[Oisf-users] AMD Piledriver "Elephant Gun" runmode (was Re: Question about cpu-affinity)

Wed Mar 14 01:50:44 UTC 2018

Thanks for a field report.

Ideally you want to minimize the number of context switches, so I see why separating RSS from worker threads helps here.

This would also be a valid optimization for Intel.

Same story with timers - 100Hz max and with the careful process pining, including moving of kernel threads away from cores doing rss and workers makes the kernel tick once per second or so, minimizing context switches and cache / tlb trashing.

> On Mar 13, 2018, at 6:29 PM, Cooper F. Nelson <cnelson at ucsd.edu> wrote:
> 
> Here are some notes on my current "Elephant Gun" AMD Piledriver suricata
> build.
> 
> This is a 20G deployment, with the goal to be able to handle big
> gigabit/multi-gigabit flows without dropping too many packets.  So far
> I've been mostly successful, packet drops are under .1% over 24 hours. 
> There appears to be a bug/feature in suricata that is keeping me from
> the 'Holy Grail' of zero drops, I'll touch on that on the end of the post.
> 
> The 'mission statement' is to basically follow the existing SEPTUN
> guides, but 'flip' the bits about using small buffers/blocks to get
> around the lack of advanced caching on the AMD platform.  So in general
> we want to use large buffers/blocks/pages with a low-res system timer
> (100Hz) in order maintain cache coherency within the timeslices
> allocated to suricata.  We also separate the RSS NIC/NUMA nodes from the
> decode cores entirely and let AMD Hypertransport send the packets to
> decode threads on other nodes.
> 
> Config is as follows:
> 
> System: Follow the SEPTUN guides where possible, selecting for 'server'
> config.  Be sure to enable power saving so you can use the 'ondemand'
> governor, AMD cores run hot so it's critical to keep the system cool so
> that individual cores can still be overclocked.
> 
> Kernel: Again, select for a 'server' config.  I use hardened gentoo
> vanilla sources, so at the very least enable VM huge pages, no kernel
> preemption and 100hz timer.  We want big blocks, buffers and time slices
> to minimize cache/TLB thrashing.
> 
> NIC (ixgbe driver in my case): Use 'warm boot' script to clear memory
> and set 4k PCI reads/buffers:
> 
>> #!/bin/bash
>> 
>> 
>> #clear caches/buffers
>> free && sync && echo 3 > /proc/sys/vm/drop_caches && free
>> 
>> 
>> echo "Tuning intel card..."
>> ifconfig enp8s0f0 down && ifconfig enp35s0f0 down
>> rmmod ixgbe
>> sleep 1
>> modprobe ixgbe
>> 
>> #Set 4k PCI reads
>> setpci -v -d 8086:10fb e6.b=2e
>> 
>> #Bring up interfaces with huge page support
>> ifconfig enp8s0f0 up
>> ifconfig enp8s0f0 mtu 9000
>> 
>> #Disable offloading except for RX checksums
>> ethtool -K  enp8s0f0 rx on sg off gro off lro off tso off gso off
>> #Enable ntuple filters and IRQ coalescing
>> ethtool -K  enp8s0f0 ntuple on
>> ethtool -C  enp8s0f0 adaptive-rx on rx-usecs 100
>> #Enable 4k ring buffer
>> ethtool -G  enp8s0f0 rx 4096
>> # 64 cores, use all 8 cores on NUMO node 0
>> ethtool -L  enp8s0f0 combined 8
>> # Force symmetric flow hashing
>> ethtool -X  enp8s0f0 hkey
>> 6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a
>> for proto in tcp4 udp4 tcp6 udp6; do
>>   /usr/sbin/ethtool -N enp35s0f0 rx-flow-hash $proto sdfn
>> done
>> #Set IRQ affinity
>> /usr/local/sbin/set_irq_affinity.sh 0-7 enp8s0f0
>> 
>> #Start suri with libtcmalloc
>> LD_PRELOAD="/usr/lib64/libtcmalloc_minimal.so.4" /usr/bin/suricata
>> -vvv -D -c /etc/suricata/suricata.yaml --af-packet
> 
> Suricata:  SEPTUN build is fine, I found that using tpacket-v3 buffer
> size of 2MB (same size as AMD cache and huge pages) works best:
> 
>> 
>> af-packet:
>>   - interface: enp8s0f0
>>     threads: 24
>>     cluster-id: 98
>>     cluster-type: cluster_flow
>>     defrag: yes
>>     use-mmap: yes
>>     mmap-locked: no
>>     tpacket-v3: yes
>>     ring-size: 500000
>>     rollover: no
>>     block-size: 2097152
>>     block-timeout: 10
>>     use-emergency-flush: yes
>>     checksum-checks: no
> 
> I enable "midstream: true" in the stream config and delayed detection,
> so any active 'elephant' flows are properly tracked and bypassed when
> suri starts up and hopefully no packets dropped on startup. 
> 
> For 64 core system, set detect/management threads on remaining 48 cores,
> 24 threads per NIC.  We have lots of cores so not a big deal to dedicate
> 16 to RSS+cluster_flow.  Plus big 'elephant' threads will be less likely
> to crush a single core.  With the server build, you can also run other
> software, like elasticsearch on the same system without impacting
> performance significantly.  Load average is currently under 10 (out of
> 64) at peak utilization.  You can also 'nice' background tasks, gnu
> parallel and emerge so that they will run mostly when the suricata cores
> are idle.
> 
> Re:  The 'holy grail' of zero packet drops.  Despite playing with
> max-pending-packets, default-packet-size and ring-size, certain flows
> seem to break the flow bypass feature and fill the AF_PACKET buffer
> (even when set to large values like 500k).  This appears to happen in
> two scenarios:
> 
> 1:  Non-TCP 'elephants', like Google's QUIC protocol (UDP port 443). 
> 2:  TCP elephants, possibly using jumbo frames, that peg the decode
> core, create a race condition and overwhelm the stream tracker.  So, for
> example, I've observed TCP flows peg a core @100% for an extended period
> of time (much longer than it should take to be bypassed) and then
> eventually appear to trigger an emergency flush.  What I think is
> happening is that packets are coming in so fast that the decoding
> pipeline gets backed up and can't track the full stream depth to allow a
> proper bypass before the ring-buffer fills up. 
> 
> As always, questions/comments/feedback welcome.  I'm trying to get the
> new XDP stuff working but so far have not been able too on hardened Gentoo.
> 
> -Coop
> 
>  3/13/2018 12:09 AM, Peter Manev wrote:
>> feedback as promised...from what i have seen/tested.
>> Dedicating Suricata workers cross different NUMA nodes with modern AMD
>> CPUs (EPYC 7601 for example) seems to work quite ok - very good in
>> fact - which would differ from the way you would "traditionally"
>> deploy that with Intel CPUs (NIC and workers on same NUMA)
>> 
>> In both cases a good NIC is an essential piece as well :)
> 
> -- 
> Cooper Nelson
> Network Security Analyst
> UCSD ITS Security Team
> cnelson at ucsd.edu x41042
> 
>