<div dir="ltr"><div>Hi all.</div><div><br></div><div>I'm seeing ok results using PF_RING+Libzero, workers runmode, 42 cores, and a full ruleset processing 260,000 - 320,000 pps ( ~1670-2200 Mbps).</div><div><br></div><div>

"ok results" means stats.log infrequently reports drops and when it does, less than 3 or 4 workers are involved.</div><div><br></div><div>For this setup it seems ~300,000 pps is about the limit before performance degrades.  (For those claiming 10g on a single box with less cores/memory and less rules... what are your packets per second ?)</div>

<div><br></div><div>The hardware configuration is a dual-port Intel x520-DA2 card with both ports receiving a balanced share of traffic.  Each port is setup to use a DNA Libzero cluster with 22 threads.  (DNA libzero has a limit of 32 threads per cluster and is the reason I used two ports and two clusters.)  The server is very similar to the one discussed in the Clarkson.edu paper below and I used a number of their suggestions such as 65534 max-pending-packets.</div>


<div><br></div><div>Workers runmode is the only configuration that works for me.  I tried to get AutoFP working while honoring cache coherency (afaiu..) but no matter the configuration it always failed to process anything more than 200-300 Mbps and dropped 85%+ packets on each capture thread.  </div>

<div><br></div><div>If I separated the capture threads to different CPU they showed very little CPU usage so I figured some other thread was limiting them. I'm not sure how the design of AutoFP is supposed to work in an ideal situation as it seems more likely to fail as the number of CPU increases.  (the opposite of the intended design).   For it to work correctly it I think it must be configured it to act something like "workers" runmode for a numa node where all threads stay local to that node.  But with more than one numa node I'm not sure if the order of the CPUs determines their allocation or ..?  This bios is old and the CPUs are not allocated in blocks but as one per node.  (1,5,9,13,17,21...)  I didn't check if the cpu-affinity settings actually allocate the additional detect threads to the same node... I gave up and moved on to trying workers again.  </div>

<div><br></div><div>I guess that's the tradeoff with workers runmode, only a single detect module is available and it can starve out the other modules on that thread.  ( If I'm totally wrong on some of this please point it out ).  I think the drops are from one or more of the signatures consuming the time spent in the detect module but I cannot tell since they are all in the same thread.  Some workers see ~35-45k pps with no drops while others may drop packets while processing under 10k pps.</div>


<div><br></div><div>The design of workers runmode seems the best for high CPU installs so long as you can broker traffic among the available CPU, the PF_RING DNA cluster(s) seem to solve that problem well. </div><div><br>


</div><div>The memory usage is low but I think there is a leak somewhere and is maybe related to the higher thread count.  After ~12 hours of runtime the server finally used all memory and crashed.  (i.e. if you're not using many many threads maybe you'll not trigger this condition)</div>

<div><br></div><div><a href="http://people.clarkson.edu/~jmatthew/publications/SPIE_SnortSuricata_2013.pdf" target="_blank">http://people.clarkson.edu/~jmatthew/publications/SPIE_SnortSuricata_2013.pdf</a><br></div><div><br>

</div>This paper was very useful and maybe good for debate on performance issues.  Where I'm confused is how they were able to get AutoFP working because I cannot make it scale as they seem to describe.  Maybe it has something to do with how they consumed the traffic; they mention their live replay results were very close to PCAP processing and thus used the latter for the majority of their tests.  There was a mention that processing with PCAP guarantees no packet drops because the system can adapt the read as necessary.  As I remember it they ran a full ruleset with 24 cores and managed about ~300,000 pps before performance degraded.<div>


<br></div><div><br></div><div>--TC</div><div><br></div></div>