[Oisf-users] AMD Piledriver "Elephant Gun" runmode (was Re: Question about cpu-affinity)

Wed Mar 14 01:29:03 UTC 2018

Here are some notes on my current "Elephant Gun" AMD Piledriver suricata
build.

This is a 20G deployment, with the goal to be able to handle big
gigabit/multi-gigabit flows without dropping too many packets.  So far
I've been mostly successful, packet drops are under .1% over 24 hours. 
There appears to be a bug/feature in suricata that is keeping me from
the 'Holy Grail' of zero drops, I'll touch on that on the end of the post.

The 'mission statement' is to basically follow the existing SEPTUN
guides, but 'flip' the bits about using small buffers/blocks to get
around the lack of advanced caching on the AMD platform.  So in general
we want to use large buffers/blocks/pages with a low-res system timer
(100Hz) in order maintain cache coherency within the timeslices
allocated to suricata.  We also separate the RSS NIC/NUMA nodes from the
decode cores entirely and let AMD Hypertransport send the packets to
decode threads on other nodes.

Config is as follows:

System: Follow the SEPTUN guides where possible, selecting for 'server'
config.  Be sure to enable power saving so you can use the 'ondemand'
governor, AMD cores run hot so it's critical to keep the system cool so
that individual cores can still be overclocked.

Kernel: Again, select for a 'server' config.  I use hardened gentoo
vanilla sources, so at the very least enable VM huge pages, no kernel
preemption and 100hz timer.  We want big blocks, buffers and time slices
to minimize cache/TLB thrashing.

NIC (ixgbe driver in my case): Use 'warm boot' script to clear memory
and set 4k PCI reads/buffers:

> #!/bin/bash
>
>
> #clear caches/buffers
> free && sync && echo 3 > /proc/sys/vm/drop_caches && free
>
>
> echo "Tuning intel card..."
> ifconfig enp8s0f0 down && ifconfig enp35s0f0 down
> rmmod ixgbe
> sleep 1
> modprobe ixgbe
>
> #Set 4k PCI reads
> setpci -v -d 8086:10fb e6.b=2e
>
> #Bring up interfaces with huge page support
> ifconfig enp8s0f0 up
> ifconfig enp8s0f0 mtu 9000
>
> #Disable offloading except for RX checksums
> ethtool -K  enp8s0f0 rx on sg off gro off lro off tso off gso off
> #Enable ntuple filters and IRQ coalescing
> ethtool -K  enp8s0f0 ntuple on
> ethtool -C  enp8s0f0 adaptive-rx on rx-usecs 100
> #Enable 4k ring buffer
> ethtool -G  enp8s0f0 rx 4096
> # 64 cores, use all 8 cores on NUMO node 0
> ethtool -L  enp8s0f0 combined 8
> # Force symmetric flow hashing
> ethtool -X  enp8s0f0 hkey
> 6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a
> for proto in tcp4 udp4 tcp6 udp6; do
>   /usr/sbin/ethtool -N enp35s0f0 rx-flow-hash $proto sdfn
> done
> #Set IRQ affinity
> /usr/local/sbin/set_irq_affinity.sh 0-7 enp8s0f0
>
> #Start suri with libtcmalloc
> LD_PRELOAD="/usr/lib64/libtcmalloc_minimal.so.4" /usr/bin/suricata
> -vvv -D -c /etc/suricata/suricata.yaml --af-packet

Suricata:  SEPTUN build is fine, I found that using tpacket-v3 buffer
size of 2MB (same size as AMD cache and huge pages) works best:

>
> af-packet:
>   - interface: enp8s0f0
>     threads: 24
>     cluster-id: 98
>     cluster-type: cluster_flow
>     defrag: yes
>     use-mmap: yes
>     mmap-locked: no
>     tpacket-v3: yes
>     ring-size: 500000
>     rollover: no
>     block-size: 2097152
>     block-timeout: 10
>     use-emergency-flush: yes
>     checksum-checks: no

I enable "midstream: true" in the stream config and delayed detection,
so any active 'elephant' flows are properly tracked and bypassed when
suri starts up and hopefully no packets dropped on startup. 

For 64 core system, set detect/management threads on remaining 48 cores,
24 threads per NIC.  We have lots of cores so not a big deal to dedicate
16 to RSS+cluster_flow.  Plus big 'elephant' threads will be less likely
to crush a single core.  With the server build, you can also run other
software, like elasticsearch on the same system without impacting
performance significantly.  Load average is currently under 10 (out of
64) at peak utilization.  You can also 'nice' background tasks, gnu
parallel and emerge so that they will run mostly when the suricata cores
are idle.

Re:  The 'holy grail' of zero packet drops.  Despite playing with
max-pending-packets, default-packet-size and ring-size, certain flows
seem to break the flow bypass feature and fill the AF_PACKET buffer
(even when set to large values like 500k).  This appears to happen in
two scenarios:

1:  Non-TCP 'elephants', like Google's QUIC protocol (UDP port 443). 
2:  TCP elephants, possibly using jumbo frames, that peg the decode
core, create a race condition and overwhelm the stream tracker.  So, for
example, I've observed TCP flows peg a core @100% for an extended period
of time (much longer than it should take to be bypassed) and then
eventually appear to trigger an emergency flush.  What I think is
happening is that packets are coming in so fast that the decoding
pipeline gets backed up and can't track the full stream depth to allow a
proper bypass before the ring-buffer fills up. 

As always, questions/comments/feedback welcome.  I'm trying to get the
new XDP stuff working but so far have not been able too on hardened Gentoo.

-Coop

 3/13/2018 12:09 AM, Peter Manev wrote:
> feedback as promised...from what i have seen/tested.
> Dedicating Suricata workers cross different NUMA nodes with modern AMD
> CPUs (EPYC 7601 for example) seems to work quite ok - very good in
> fact - which would differ from the way you would "traditionally"
> deploy that with Intel CPUs (NIC and workers on same NUMA)
>
> In both cases a good NIC is an essential piece as well :)

-- 
Cooper Nelson
Network Security Analyst
UCSD ITS Security Team
cnelson at ucsd.edu x41042

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20180313/0e4d903c/attachment-0002.sig>