[Oisf-users] Question about cpu-affinity

Peter Manev petermanev at gmail.com
Tue Mar 13 07:09:15 UTC 2018


On Wed, Mar 7, 2018 at 4:03 AM, Michał Purzyński
<michalpurzynski1 at gmail.com> wrote:
> Minimizing cache misses was the biggest message of septun indeed. Glad people can hear the message, esp when companies like AMD focus 100% on execution units and internal bandwidth, while completely missing the point here - those resources are unused most of the time.
>
> When we started I had IPC 0.4 - 0.7.
>
> Through the process of understanding what’s going on and measuring cache misses we went to IPC close to 3, comparing to ideal 4 (Xeons are 4-wide). Not bad for a workload that has to experience cache misses by design (pats self and pevma on the back).
>
> Interesting side effect - libpcap will experience two cache misses per packet, as it tries to calculate time stamp based on headers and then another cache miss for the data itself.
>
> We recommend separating RSS from worker threads indeed. To each their own, sometimes it will work well, but I’d still separate - to minimize context switches and TLB trashing.
>
> BTW, Linux with meltdown patches and finally using PCID on Haswell might have an interesting effect here ;)
>
> Grab a few cores, make them do RSS with DDIO, dedicate everything else to workers, have fun :)
>

feedback as promised...from what i have seen/tested.
Dedicating Suricata workers cross different NUMA nodes with modern AMD
CPUs (EPYC 7601 for example) seems to work quite ok - very good in
fact - which would differ from the way you would "traditionally"
deploy that with Intel CPUs (NIC and workers on same NUMA)

In both cases a good NIC is an essential piece as well :)

>> On Mar 6, 2018, at 3:01 PM, Cooper F. Nelson <cnelson at ucsd.edu> wrote:
>>
>> "All programming is an exercise in caching."
>>     -Terje Mathisen
>>
>> Regarding this deployment, since I was on old Intel hardware that is not
>> very IO-friendly either, I just copied that build to the new Piledriver
>> system and switched from cluster_cpu to cluster_flow.  And separated the
>> detect threads from the RSS queues.  No need for the offloading features
>> this time (which TBH do impact detection for some sigs) with HyperScan,
>> AVX and 56 detect threads!  The system is at around 12% load @peak, even
>> with the on-demand CPU frequency governor.
>>
>> I agree that the new Intel FSB innovations like DDIO are at this point
>> pretty much mandatory for 10 Gb HPC IDS deployments.  I'm already
>> looking at doing a 40Gb build using a modern Intel system and the new
>> 40G NICs, which officially support symmetric hashing.
>>
>> -Coop
>>
>>> On 3/4/2018 11:31 PM, Michał Purzyński wrote:
>>> The SepTun Mark II we're about to publish should actually behave better on
>>> non-IO friendly architectures, like AMD.
>>>
>>> Speaking personally, this is my private opinion:
>>>
>>> I don't see any deeper thought process about IO optimization on the AMD
>>> side, other than increasing the throughput of every interconnect. That's
>>> nice, but those aren't even close to being saturated, as we're wasting
>>> cycles waiting for cache misses :/
>>>
>>> Intel approached this problem in a much more systematic way.
>>>
>>
>> --
>> Cooper Nelson
>> Network Security Analyst
>> UCSD ITS Security Team
>> cnelson at ucsd.edu x41042
>>
>>



-- 
Regards,
Peter Manev



More information about the Oisf-users mailing list