[Oisf-users] Workers vs AutoFP with more than 32 cores

Thu Aug 29 19:57:48 UTC 2013

Hi all.

I'm seeing ok results using PF_RING+Libzero, workers runmode, 42 cores, and
a full ruleset processing 260,000 - 320,000 pps ( ~1670-2200 Mbps).

"ok results" means stats.log infrequently reports drops and when it does,
less than 3 or 4 workers are involved.

For this setup it seems ~300,000 pps is about the limit before performance
degrades.  (For those claiming 10g on a single box with less cores/memory
and less rules... what are your packets per second ?)

The hardware configuration is a dual-port Intel x520-DA2 card with both
ports receiving a balanced share of traffic.  Each port is setup to use a
DNA Libzero cluster with 22 threads.  (DNA libzero has a limit of 32
threads per cluster and is the reason I used two ports and two clusters.)
 The server is very similar to the one discussed in the Clarkson.edu paper
below and I used a number of their suggestions such as 65534
max-pending-packets.

Workers runmode is the only configuration that works for me.  I tried to
get AutoFP working while honoring cache coherency (afaiu..) but no matter
the configuration it always failed to process anything more than 200-300
Mbps and dropped 85%+ packets on each capture thread.

If I separated the capture threads to different CPU they showed very little
CPU usage so I figured some other thread was limiting them. I'm not sure
how the design of AutoFP is supposed to work in an ideal situation as it
seems more likely to fail as the number of CPU increases.  (the opposite of
the intended design).   For it to work correctly it I think it must be
configured it to act something like "workers" runmode for a numa node where
all threads stay local to that node.  But with more than one numa node I'm
not sure if the order of the CPUs determines their allocation or ..?  This
bios is old and the CPUs are not allocated in blocks but as one per node.
 (1,5,9,13,17,21...)  I didn't check if the cpu-affinity settings actually
allocate the additional detect threads to the same node... I gave up and
moved on to trying workers again.

I guess that's the tradeoff with workers runmode, only a single detect
module is available and it can starve out the other modules on that thread.
 ( If I'm totally wrong on some of this please point it out ).  I think the
drops are from one or more of the signatures consuming the time spent in
the detect module but I cannot tell since they are all in the same thread.
 Some workers see ~35-45k pps with no drops while others may drop packets
while processing under 10k pps.

The design of workers runmode seems the best for high CPU installs so long
as you can broker traffic among the available CPU, the PF_RING DNA
cluster(s) seem to solve that problem well.

The memory usage is low but I think there is a leak somewhere and is maybe
related to the higher thread count.  After ~12 hours of runtime the server
finally used all memory and crashed.  (i.e. if you're not using many many
threads maybe you'll not trigger this condition)

http://people.clarkson.edu/~jmatthew/publications/SPIE_SnortSuricata_2013.pdf

This paper was very useful and maybe good for debate on performance issues.
 Where I'm confused is how they were able to get AutoFP working because I
cannot make it scale as they seem to describe.  Maybe it has something to
do with how they consumed the traffic; they mention their live replay
results were very close to PCAP processing and thus used the latter for the
majority of their tests.  There was a mention that processing with PCAP
guarantees no packet drops because the system can adapt the read as
necessary.  As I remember it they ran a full ruleset with 24 cores and
managed about ~300,000 pps before performance degraded.

--TC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20130829/95831145/attachment.html>