[Discussion] rule profiling (was: Re: Non-tokenized preprocessor parameter lines)

Thu Feb 12 08:26:13 UTC 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Holste wrote:
> I've done a little bit of work with rule profiling enough to realize
> that I need help from someone who understands Aho-Corasick better so
> that I can more accurately figure out what the load would be on the
> detection engine.  (I posted something to this effect on the Emerging
> Threats list last week.)  One real performance factor, as far as I can
> tell, would be to see if the content matching already appears in another
> rule.  If that's true, then the effective load of adding that rule could
> be negligible, since the detection engine's effective load doesn't
> actually increase.  So, a tool which takes the entire ruleset into
> account would be very helpful.  I know that Sourcefire kind of already
> does this in Snort in a few places, but I think the number of people who
> understand the information from the state transition table reports could
> be counted on one hand (judging from the lack of comments on the ET
> list).  That information needs to be wrapped into a larger profiler.

I'll add in some more variables that matter a lot:

Number of patterns. All pattern matcher algorithms slow down if the
number of patterns increases. This is because the match verification
process gets more expensive as hash tables will fill up (using a 8bit
alphabet in a 2gram algorithm results in a hash of max 65536). Usually
switching from 2gram to 3 (or more) gram versions of algorithms helps,
however at a cost of using more memory & higher computation overhead.
That influences performance quite a bit, probably because of more memory
meaning more CPU cache misses. Usually the average pattern length goes
down too.

The minimum largest pattern size. Most pattern matcher algorithms
perform best with longer patterns since the matcher can step through the
searched data in bigger steps. So if all your sigs have at least a
pattern length of 8 and you add in one of length 3 your performance hit
is going to be bigger than when there are already sigs with smaller
patterns.

Similar rules are grouped together so safe memory on (among others)
pattern matcher contexts. A bad rule will only influence the groups it's
in. But an even worse rule can end up in a lot of contexts. (using the
emerging-all.rules file I can easily have the engine use a few gigabytes
of ram, but using the grouping I have slimmed it down to about 50mb.
Guess which performs better? The smaller one :))

CPU cache size & memory bus speeds seems to make a big difference too.
In my code I've implemented the (simple) BNDM algorithm, both in a
2-gram and 3-gram version. On a Core2duo T5500 (2mb cache) a 2-gram
sBNDM performed best, on Core2duo E6600 (4mb cache) a 3-gram BNDM. On my
gateway box, P3 500mhz 512mb cache, again the 2-gram sBNDM, but with
different hash table sizes and stuff...

One thing that also influences performance is the likelihood of a match
because after the pattern matcher suspects a match, it has to be
verified by the detection engine.  For example I think HTTP keywords,
HTML stuff, SMTP commands, etc, etc, all have a bigger likelihood of
matching and thus are more expensive... maybe a blacklist could help
there. Any pattern on that list would be classified as more expensive...

Regards,
Victor

- --
- ---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
- ---------------------------------------------

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmT3SAACgkQiSMBBAuniMeyRgCfe0nDEaM+qzCpWQ1V2aCfotzp
hOUAn14sw9BI3LZGCRYLKh9dFB6oiubV
=kZSl
-----END PGP SIGNATURE-----