[Oisf-users] Workers vs AutoFP with more than 32 cores

Fri Aug 30 16:05:46 EDT 2013

On Thu, Aug 29, 2013 at 10:57 PM, Tritium Cat <tritium.cat at gmail.com> wrote:
> Hi all.
>
> I'm seeing ok results using PF_RING+Libzero, workers runmode, 42 cores, and
> a full ruleset processing 260,000 - 320,000 pps ( ~1670-2200 Mbps).

This seems strange.
Is that 1.6~2 Gbps on all combined interfaces/ports or on one?

This below are some stats form a 1100~1600 Mbps, 4 core box with 16G RAM:

root at snif01:/var/log/suricata# grep kernel stats.log | tail -8
capture.kernel_packets    | RxPFReth31                | 2585791020
capture.kernel_drops      | RxPFReth31                | 602961
capture.kernel_packets    | RxPFReth32                | 2514450402
capture.kernel_drops      | RxPFReth32                | 2709976
capture.kernel_packets    | RxPFReth33                | 2244859533
capture.kernel_drops      | RxPFReth33                | 124751
capture.kernel_packets    | RxPFReth34                | 2775241625
capture.kernel_drops      | RxPFReth34                | 1364053
root at snif01:/var/log/suricata#
root at snif01:/var/log/suricata# tcpstat -i eth3
Time:1377890508    n=729849    avg=1004.05    stddev=656.23    bps=1172488715.20
Time:1377890513    n=711964    avg=1019.76    stddev=650.83    bps=1161648092.80
Time:1377890518    n=727231    avg=1028.70    stddev=648.18    bps=1196969761.60
Time:1377890523    n=761041    avg=1007.06    stddev=656.89    bps=1226261100.80
Time:1377890528    n=812943    avg=989.00    stddev=664.39    bps=1286402824.00
Time:1377890533    n=866400    avg=1007.10    stddev=659.03    bps=1396084670.40
^CTime:1377890538    n=244884    avg=1009.87    stddev=655.87
bps=395679948.80
root at snif01:/var/log/suricata#

Getting about 0.05% drops with 2.0dev (rev ff668c2), pfring (no DNA),
workers mode , 6K rules, about 120K pps and keeping 60% cpu cores busy
(E5420 at 2.50GHz).

All that after running for 20 hrs and using 10 out of the 16G RAM on
Ubuntu 3.5 kernel :

root at snif01:/var/log/suricata# date && free -g
Fri Aug 30 21:46:46 CEST 2013
             total       used       free     shared    buffers     cached
Mem:            15         11          4          0          0          1
-/+ buffers/cache:          9          6
Swap:           15          0         15
root at snif01:/var/log/suricata#
root at snif01:/var/log/suricata# grep uptime stats.log | tail -3
Date: 8/30/2013 -- 21:38:54 (uptime: 0d, 19h 53m 12s)
Date: 8/30/2013 -- 21:39:13 (uptime: 0d, 19h 53m 31s)
Date: 8/30/2013 -- 21:39:32 (uptime: 0d, 19h 53m 50s)
root at snif01:/var/log/suricata#

Network card -

  *-network:1
       description: Ethernet interface
       product: 82599EB 10-Gigabit SFI/SFP+ Network Connection
       vendor: Intel Corporation
       physical id: 0.1
       bus info: pci at 0000:0a:00.1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress bus_master cap_list
ethernet physical fibre
       configuration: autonegotiation=off broadcast=yes driver=ixgbe
driverversion=3.17.3 duplex=full firmware=0x61c10001 latency=0
link=yes multicast=yes port=fibre promiscuous=yes
       resources: irq:17 memory:d8080000-d80fffff ioport:dcc0(size=32)
memory:d8178000-d817bfff memory:c0200000-c02fffff
memory:c0300000-c03fffff

Suricata is listening on one interface/port - eth3.

>
> "ok results" means stats.log infrequently reports drops and when it does,
> less than 3 or 4 workers are involved.
>
> For this setup it seems ~300,000 pps is about the limit before performance
> degrades.  (For those claiming 10g on a single box with less cores/memory
> and less rules... what are your packets per second ?)
>
> The hardware configuration is a dual-port Intel x520-DA2 card with both
> ports receiving a balanced share of traffic.  Each port is setup to use a
> DNA Libzero cluster with 22 threads.  (DNA libzero has a limit of 32 threads
> per cluster and is the reason I used two ports and two clusters.)  The
> server is very similar to the one discussed in the Clarkson.edu paper below
> and I used a number of their suggestions such as 65534 max-pending-packets.
>
> Workers runmode is the only configuration that works for me.  I tried to get
> AutoFP working while honoring cache coherency (afaiu..) but no matter the
> configuration it always failed to process anything more than 200-300 Mbps
> and dropped 85%+ packets on each capture thread.
>
> If I separated the capture threads to different CPU they showed very little
> CPU usage so I figured some other thread was limiting them. I'm not sure how
> the design of AutoFP is supposed to work in an ideal situation as it seems
> more likely to fail as the number of CPU increases.  (the opposite of the
> intended design).   For it to work correctly it I think it must be
> configured it to act something like "workers" runmode for a numa node where
> all threads stay local to that node.  But with more than one numa node I'm
> not sure if the order of the CPUs determines their allocation or ..?  This
> bios is old and the CPUs are not allocated in blocks but as one per node.
> (1,5,9,13,17,21...)  I didn't check if the cpu-affinity settings actually
> allocate the additional detect threads to the same node... I gave up and
> moved on to trying workers again.
>
> I guess that's the tradeoff with workers runmode, only a single detect
> module is available and it can starve out the other modules on that thread.
> ( If I'm totally wrong on some of this please point it out ).  I think the
> drops are from one or more of the signatures consuming the time spent in the
> detect module but I cannot tell since they are all in the same thread.  Some
> workers see ~35-45k pps with no drops while others may drop packets while
> processing under 10k pps.
>
> The design of workers runmode seems the best for high CPU installs so long
> as you can broker traffic among the available CPU, the PF_RING DNA
> cluster(s) seem to solve that problem well.
>
> The memory usage is low but I think there is a leak somewhere and is maybe
> related to the higher thread count.  After ~12 hours of runtime the server
> finally used all memory and crashed.  (i.e. if you're not using many many
> threads maybe you'll not trigger this condition)
>
> http://people.clarkson.edu/~jmatthew/publications/SPIE_SnortSuricata_2013.pdf
>
> This paper was very useful and maybe good for debate on performance issues.
> Where I'm confused is how they were able to get AutoFP working because I
> cannot make it scale as they seem to describe.  Maybe it has something to do
> with how they consumed the traffic; they mention their live replay results
> were very close to PCAP processing and thus used the latter for the majority
> of their tests.  There was a mention that processing with PCAP guarantees no
> packet drops because the system can adapt the read as necessary.  As I
> remember it they ran a full ruleset with 24 cores and managed about ~300,000
> pps before performance degraded.
>
>
> --TC
>
>
> _______________________________________________
> Suricata IDS Users mailing list: oisf-users at openinfosecfoundation.org
> Site: http://suricata-ids.org | Support: http://suricata-ids.org/support/
> List: https://lists.openinfosecfoundation.org/mailman/listinfo/oisf-users
> OISF: http://www.openinfosecfoundation.org/

-- 
Regards,
Peter Manev