[Oisf-users] Workers vs AutoFP with more than 32 cores
Peter Manev
petermanev at gmail.com
Fri Aug 30 20:05:46 UTC 2013
On Thu, Aug 29, 2013 at 10:57 PM, Tritium Cat <tritium.cat at gmail.com> wrote:
> Hi all.
>
> I'm seeing ok results using PF_RING+Libzero, workers runmode, 42 cores, and
> a full ruleset processing 260,000 - 320,000 pps ( ~1670-2200 Mbps).
This seems strange.
Is that 1.6~2 Gbps on all combined interfaces/ports or on one?
This below are some stats form a 1100~1600 Mbps, 4 core box with 16G RAM:
root at snif01:/var/log/suricata# grep kernel stats.log | tail -8
capture.kernel_packets | RxPFReth31 | 2585791020
capture.kernel_drops | RxPFReth31 | 602961
capture.kernel_packets | RxPFReth32 | 2514450402
capture.kernel_drops | RxPFReth32 | 2709976
capture.kernel_packets | RxPFReth33 | 2244859533
capture.kernel_drops | RxPFReth33 | 124751
capture.kernel_packets | RxPFReth34 | 2775241625
capture.kernel_drops | RxPFReth34 | 1364053
root at snif01:/var/log/suricata#
root at snif01:/var/log/suricata# tcpstat -i eth3
Time:1377890508 n=729849 avg=1004.05 stddev=656.23 bps=1172488715.20
Time:1377890513 n=711964 avg=1019.76 stddev=650.83 bps=1161648092.80
Time:1377890518 n=727231 avg=1028.70 stddev=648.18 bps=1196969761.60
Time:1377890523 n=761041 avg=1007.06 stddev=656.89 bps=1226261100.80
Time:1377890528 n=812943 avg=989.00 stddev=664.39 bps=1286402824.00
Time:1377890533 n=866400 avg=1007.10 stddev=659.03 bps=1396084670.40
^CTime:1377890538 n=244884 avg=1009.87 stddev=655.87
bps=395679948.80
root at snif01:/var/log/suricata#
Getting about 0.05% drops with 2.0dev (rev ff668c2), pfring (no DNA),
workers mode , 6K rules, about 120K pps and keeping 60% cpu cores busy
(E5420 at 2.50GHz).
All that after running for 20 hrs and using 10 out of the 16G RAM on
Ubuntu 3.5 kernel :
root at snif01:/var/log/suricata# date && free -g
Fri Aug 30 21:46:46 CEST 2013
total used free shared buffers cached
Mem: 15 11 4 0 0 1
-/+ buffers/cache: 9 6
Swap: 15 0 15
root at snif01:/var/log/suricata#
root at snif01:/var/log/suricata# grep uptime stats.log | tail -3
Date: 8/30/2013 -- 21:38:54 (uptime: 0d, 19h 53m 12s)
Date: 8/30/2013 -- 21:39:13 (uptime: 0d, 19h 53m 31s)
Date: 8/30/2013 -- 21:39:32 (uptime: 0d, 19h 53m 50s)
root at snif01:/var/log/suricata#
Network card -
*-network:1
description: Ethernet interface
product: 82599EB 10-Gigabit SFI/SFP+ Network Connection
vendor: Intel Corporation
physical id: 0.1
bus info: pci at 0000:0a:00.1
width: 64 bits
clock: 33MHz
capabilities: pm msi msix pciexpress bus_master cap_list
ethernet physical fibre
configuration: autonegotiation=off broadcast=yes driver=ixgbe
driverversion=3.17.3 duplex=full firmware=0x61c10001 latency=0
link=yes multicast=yes port=fibre promiscuous=yes
resources: irq:17 memory:d8080000-d80fffff ioport:dcc0(size=32)
memory:d8178000-d817bfff memory:c0200000-c02fffff
memory:c0300000-c03fffff
Suricata is listening on one interface/port - eth3.
>
> "ok results" means stats.log infrequently reports drops and when it does,
> less than 3 or 4 workers are involved.
>
> For this setup it seems ~300,000 pps is about the limit before performance
> degrades. (For those claiming 10g on a single box with less cores/memory
> and less rules... what are your packets per second ?)
>
> The hardware configuration is a dual-port Intel x520-DA2 card with both
> ports receiving a balanced share of traffic. Each port is setup to use a
> DNA Libzero cluster with 22 threads. (DNA libzero has a limit of 32 threads
> per cluster and is the reason I used two ports and two clusters.) The
> server is very similar to the one discussed in the Clarkson.edu paper below
> and I used a number of their suggestions such as 65534 max-pending-packets.
>
> Workers runmode is the only configuration that works for me. I tried to get
> AutoFP working while honoring cache coherency (afaiu..) but no matter the
> configuration it always failed to process anything more than 200-300 Mbps
> and dropped 85%+ packets on each capture thread.
>
> If I separated the capture threads to different CPU they showed very little
> CPU usage so I figured some other thread was limiting them. I'm not sure how
> the design of AutoFP is supposed to work in an ideal situation as it seems
> more likely to fail as the number of CPU increases. (the opposite of the
> intended design). For it to work correctly it I think it must be
> configured it to act something like "workers" runmode for a numa node where
> all threads stay local to that node. But with more than one numa node I'm
> not sure if the order of the CPUs determines their allocation or ..? This
> bios is old and the CPUs are not allocated in blocks but as one per node.
> (1,5,9,13,17,21...) I didn't check if the cpu-affinity settings actually
> allocate the additional detect threads to the same node... I gave up and
> moved on to trying workers again.
>
> I guess that's the tradeoff with workers runmode, only a single detect
> module is available and it can starve out the other modules on that thread.
> ( If I'm totally wrong on some of this please point it out ). I think the
> drops are from one or more of the signatures consuming the time spent in the
> detect module but I cannot tell since they are all in the same thread. Some
> workers see ~35-45k pps with no drops while others may drop packets while
> processing under 10k pps.
>
> The design of workers runmode seems the best for high CPU installs so long
> as you can broker traffic among the available CPU, the PF_RING DNA
> cluster(s) seem to solve that problem well.
>
> The memory usage is low but I think there is a leak somewhere and is maybe
> related to the higher thread count. After ~12 hours of runtime the server
> finally used all memory and crashed. (i.e. if you're not using many many
> threads maybe you'll not trigger this condition)
>
> http://people.clarkson.edu/~jmatthew/publications/SPIE_SnortSuricata_2013.pdf
>
> This paper was very useful and maybe good for debate on performance issues.
> Where I'm confused is how they were able to get AutoFP working because I
> cannot make it scale as they seem to describe. Maybe it has something to do
> with how they consumed the traffic; they mention their live replay results
> were very close to PCAP processing and thus used the latter for the majority
> of their tests. There was a mention that processing with PCAP guarantees no
> packet drops because the system can adapt the read as necessary. As I
> remember it they ran a full ruleset with 24 cores and managed about ~300,000
> pps before performance degraded.
>
>
> --TC
>
>
> _______________________________________________
> Suricata IDS Users mailing list: oisf-users at openinfosecfoundation.org
> Site: http://suricata-ids.org | Support: http://suricata-ids.org/support/
> List: https://lists.openinfosecfoundation.org/mailman/listinfo/oisf-users
> OISF: http://www.openinfosecfoundation.org/
--
Regards,
Peter Manev
More information about the Oisf-users
mailing list