[Oisf-users] AMD Piledriver "Elephant Gun" runmode (was Re: Question about cpu-affinity)

Peter Manev petermanev at gmail.com
Thu Mar 15 12:55:47 UTC 2018


Cooper - thanks for sharing your findings!

On Wed, Mar 14, 2018 at 2:50 AM, Michał Purzyński
<michalpurzynski1 at gmail.com> wrote:
> Thanks for a field report.
>
> Ideally you want to minimize the number of context switches, so I see why separating RSS from worker threads helps here.
>
> This would also be a valid optimization for Intel.
>
> Same story with timers - 100Hz max and with the careful process pining, including moving of kernel threads away from cores doing rss and workers makes the kernel tick once per second or so, minimizing context switches and cache / tlb trashing.
>
>> On Mar 13, 2018, at 6:29 PM, Cooper F. Nelson <cnelson at ucsd.edu> wrote:
>>
>> Here are some notes on my current "Elephant Gun" AMD Piledriver suricata
>> build.
>>
>> This is a 20G deployment, with the goal to be able to handle big
>> gigabit/multi-gigabit flows without dropping too many packets.  So far
>> I've been mostly successful, packet drops are under .1% over 24 hours.
>> There appears to be a bug/feature in suricata that is keeping me from
>> the 'Holy Grail' of zero drops, I'll touch on that on the end of the post.
>>

Nice :)

>> The 'mission statement' is to basically follow the existing SEPTUN
>> guides, but 'flip' the bits about using small buffers/blocks to get
>> around the lack of advanced caching on the AMD platform.  So in general
>> we want to use large buffers/blocks/pages with a low-res system timer
>> (100Hz) in order maintain cache coherency within the timeslices
>> allocated to suricata.  We also separate the RSS NIC/NUMA nodes from the
>> decode cores entirely and let AMD Hypertransport send the packets to
>> decode threads on other nodes.
>>

Yes - I have observed the same as well - AMD has a different
architecture and this would be  the approach to follow (there is a
section on AMD in SEPTun Mark II touching on that)

>> Config is as follows:
>>
>> System: Follow the SEPTUN guides where possible, selecting for 'server'
>> config.  Be sure to enable power saving so you can use the 'ondemand'
>> governor, AMD cores run hot so it's critical to keep the system cool so
>> that individual cores can still be overclocked.
>>
>> Kernel: Again, select for a 'server' config.  I use hardened gentoo
>> vanilla sources, so at the very least enable VM huge pages, no kernel
>> preemption and 100hz timer.  We want big blocks, buffers and time slices
>> to minimize cache/TLB thrashing.
>>
>> NIC (ixgbe driver in my case): Use 'warm boot' script to clear memory
>> and set 4k PCI reads/buffers:
>>

Another thing to be aware would be (my 5 cents) to know your "Bus" :)
There  is a good read here (
https://community.mellanox.com/docs/DOC-2496 )  by Mellanox that one
can use to have  a better view on the PCIe limitations in heavy load
multi-app - residing on the same server - scenarios.


>>> #!/bin/bash
>>>
>>>
>>> #clear caches/buffers
>>> free && sync && echo 3 > /proc/sys/vm/drop_caches && free
>>>
>>>
>>> echo "Tuning intel card..."
>>> ifconfig enp8s0f0 down && ifconfig enp35s0f0 down
>>> rmmod ixgbe
>>> sleep 1
>>> modprobe ixgbe
>>>
>>> #Set 4k PCI reads
>>> setpci -v -d 8086:10fb e6.b=2e
>>>
>>> #Bring up interfaces with huge page support
>>> ifconfig enp8s0f0 up
>>> ifconfig enp8s0f0 mtu 9000
>>>

Out of curiosity -  do you have jumbo frames in the traffic?

>>> #Disable offloading except for RX checksums
>>> ethtool -K  enp8s0f0 rx on sg off gro off lro off tso off gso off
>>> #Enable ntuple filters and IRQ coalescing
>>> ethtool -K  enp8s0f0 ntuple on
>>> ethtool -C  enp8s0f0 adaptive-rx on rx-usecs 100
>>> #Enable 4k ring buffer
>>> ethtool -G  enp8s0f0 rx 4096
>>> # 64 cores, use all 8 cores on NUMO node 0
>>> ethtool -L  enp8s0f0 combined 8
>>> # Force symmetric flow hashing
>>> ethtool -X  enp8s0f0 hkey
>>> 6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a:6d:5a
>>> for proto in tcp4 udp4 tcp6 udp6; do
>>>   /usr/sbin/ethtool -N enp35s0f0 rx-flow-hash $proto sdfn
>>> done
>>> #Set IRQ affinity
>>> /usr/local/sbin/set_irq_affinity.sh 0-7 enp8s0f0
>>>
>>> #Start suri with libtcmalloc
>>> LD_PRELOAD="/usr/lib64/libtcmalloc_minimal.so.4" /usr/bin/suricata
>>> -vvv -D -c /etc/suricata/suricata.yaml --af-packet
>>
>> Suricata:  SEPTUN build is fine, I found that using tpacket-v3 buffer
>> size of 2MB (same size as AMD cache and huge pages) works best:
>>

This here is indeed very important and good find for your set up -
making sure the "end to end buffer flow" is all tuned.
There is really no other way of how to calculate that but to try a few
different configs and see what works best.

>>>
>>> af-packet:
>>>   - interface: enp8s0f0
>>>     threads: 24
>>>     cluster-id: 98
>>>     cluster-type: cluster_flow
>>>     defrag: yes
>>>     use-mmap: yes
>>>     mmap-locked: no
>>>     tpacket-v3: yes
>>>     ring-size: 500000
>>>     rollover: no
>>>     block-size: 2097152
>>>     block-timeout: 10
>>>     use-emergency-flush: yes
>>>     checksum-checks: no
>>
>> I enable "midstream: true" in the stream config and delayed detection,
>> so any active 'elephant' flows are properly tracked and bypassed when
>> suri starts up and hopefully no packets dropped on startup.
>>
>> For 64 core system, set detect/management threads on remaining 48 cores,
>> 24 threads per NIC.  We have lots of cores so not a big deal to dedicate
>> 16 to RSS+cluster_flow.  Plus big 'elephant' threads will be less likely
>> to crush a single core.  With the server build, you can also run other
>> software, like elasticsearch on the same system without impacting
>> performance significantly.  Load average is currently under 10 (out of
>> 64) at peak utilization.  You can also 'nice' background tasks, gnu
>> parallel and emerge so that they will run mostly when the suricata cores
>> are idle.
>>
>> Re:  The 'holy grail' of zero packet drops.  Despite playing with
>> max-pending-packets, default-packet-size and ring-size, certain flows
>> seem to break the flow bypass feature and fill the AF_PACKET buffer
>> (even when set to large values like 500k).  This appears to happen in
>> two scenarios:
>>
>> 1:  Non-TCP 'elephants', like Google's QUIC protocol (UDP port 443).
>> 2:  TCP elephants, possibly using jumbo frames, that peg the decode
>> core, create a race condition and overwhelm the stream tracker.  So, for
>> example, I've observed TCP flows peg a core @100% for an extended period
>> of time (much longer than it should take to be bypassed) and then
>> eventually appear to trigger an emergency flush.  What I think is
>> happening is that packets are coming in so fast that the decoding
>> pipeline gets backed up and can't track the full stream depth to allow a
>> proper bypass before the ring-buffer fills up.
>>

Here (above) is a scenario where XDP bypass would shine :)

I think you have done it already - but for the purpose of iteration -
I would (in general) suggest first looking at the rx_missed and
no_buffer (for Intel ) counts on the NIC  - play around with the
number of RSSs- 2/4/6/8/10 etc... get those counters (in a perfect
world) to 0 - look at what can you optimize further on Suri's side as
well to get those down (it depends however in the case of af_packet
what cluster-type do you use - cluster_flow/cpu/qm..etc.. and that
gives you more variations to try :) )

>> As always, questions/comments/feedback welcome.  I'm trying to get the
>> new XDP stuff working but so far have not been able too on hardened Gentoo.
>>

Keep up the good work!
Thanks for sharing with the community ! :)

>> -Coop
>>
>>  3/13/2018 12:09 AM, Peter Manev wrote:
>>> feedback as promised...from what i have seen/tested.
>>> Dedicating Suricata workers cross different NUMA nodes with modern AMD
>>> CPUs (EPYC 7601 for example) seems to work quite ok - very good in
>>> fact - which would differ from the way you would "traditionally"
>>> deploy that with Intel CPUs (NIC and workers on same NUMA)
>>>
>>> In both cases a good NIC is an essential piece as well :)
>>




-- 
Regards,
Peter Manev



More information about the Oisf-users mailing list