[Oisf-users] Testers: please test our initial Hyperscan support

Wed Apr 6 07:59:09 UTC 2016

Our you saying you use as many workers as hyper threads and not physical cores (minus some reserved for other Suricata threads). In other words - Suricata thread gets a hyper thread?

> On 05 Apr 2016, at 21:05, Cooper F. Nelson <cnelson at ucsd.edu> wrote:
> 
>> On 4/3/2016 7:33 PM, Viiret, Justin wrote:
>> Hi Cooper,
>> 
>> Thanks for all the details. Comments are inline below:
>> 
>>> My intuition of why I'm seeing similar performance using the
>>> Hyperscan pattern matcher vs. the 'ac' matcher is because the SIMD
>>> pipeline is shared among hyper-threaded cores.
>> 
>> I had a chat to a colleague with deeper architectural knowledge than
>> mine, and he said the following (between dashes):
>> 
>> ---- This statement about how the SIMD pipeline is shared among
>> hyper-threaded cores is not correct for modern Intel Architecture
>> post Core 2 – or at least, there is no difference between integer,
>> floating point and SIMD operations in this regard. There is a lot of
>> detail in Chapter 2 of the Intel 64 and IA-32 Architectures
>> Optimization Reference Manual:
> 
> Thanks for the feedback/link, I see my error now.  As your reference
> mentioned, I think this may have been the case with Intel architectures
> prior to the i7.
> 
>> However, you may be correct in essence: a matcher that spends a lot
>> of time waiting for cache misses (or other higher-latency operations)
>> may get more benefit from HT than one that uses the execution
>> resources (whether they are integer, FP or SIMD) intensively, as
>> their operations can be more effectively interleaved.
> 
> This is why I often say the RSS implementation of suricata's 'worker'
> runmode is the poster-boy for hyperthreading.  Under real-world
> workloads, you get 2x the performance due to all the IO/cache misses
> involved with processing live packets.
> 
>> The profile is  very interesting -- can you share a similar profile
>> from a version running with the AC matcher? I'm also curious about
>> how you are measuring performance; is this an inline IPS deployment,
>> or something different? Have you measured a "no MPM rules" baseline
>> as well?
> 
> Well, that's the thing.  It's hard to measure real performance on modern
> super-scalar architectures, as the performance is dependent on lots of
> variables.  I/O, cache lines, pipelines, OOE, power-management,
> hyper-threading, etc.
> 
> This is an IDS deployment so I basically look at two things.  I watch a
> 'top' window and try to make sure the 5-min load average is under 16 at
> peak (16 HT cores) and when I restart the suricata process I check the
> logs and ensure that packet loss is under 1%.
> 
> I ran the ac algo last night under a period of lighter load and it shows
> that it uses much more CPU time.
> 
>> 
>>   PerfTop:   61719 irqs/sec  kernel:32.6%  exact: 93.0% [4000Hz cycles:pp],  (all, 16 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>    49.02%  suricata            [.] SCACSearch
>>    10.20%  [kernel]            [k] acpi_processor_ffh_cstate_enter
>>     9.77%  [kernel]            [k] __bpf_prog_run
>>     2.93%  suricata            [.] BoyerMoore
>>     1.53%  suricata            [.] IPOnlyMatchPacket
>>     1.51%  suricata            [.] SigMatchSignatures
>>     1.32%  [kernel]            [k] __memcpy
>>     1.27%  suricata            [.] StreamTcp
> 
> The system load was still around 14, vs 15-16 at peak using the hs algo.
> 
> One thing I have noticed is that the 'hs' matcher doesn't seem to result
> in '0.0' idle times; which is a big win for suricata.  When any
> core/thread is 100% utilized is when you start dropping packets.  And as
> mentioned, it uses less memory as well.
> 
>> The time in the fdr_exec and nfaExec functions constitute the
>> majority of the time in Hyperscan in this profile, so they add up to
>> ~ 15% of runtime -- this looks like a lighter workload than the Web
>> traffic traces we tested with here, but there are a lot of variables
>> that could affect that (different rule set, the BPF filter, overhead
>> of live network scanning vs. our PCAP trace scanning for testing,
>> etc).
> 
> We are running a tweaked config for our environment/hardware, especially
> for web traffic.
> 
>> One concrete suggestion: you may see some improvement from using
>> Hyperscan 4.1, which has some improvements to the literal matching
>> path. It's available here:
>> 
>> https://github.com/01org/hyperscan/releases/tag/v4.1.0
> 
> The 'make install' failed, but it looks the libraries built so I just
> copied them over manually.  Performance is a little better, pertop
> output copied below.  You can infer the relative amount of IP traffic by
> looking at the percentage of CPU time spent running the BPF filter.
> 
>> 
>>   PerfTop:   63306 irqs/sec  kernel:39.9%  exact: 90.2% [4000Hz cycles:pp],  (all, 16 CPUs)
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>    12.63%  [kernel]            [k] __bpf_prog_run
>>    12.32%  [kernel]            [k] acpi_processor_ffh_cstate_enter
>>     7.53%  libhs.so.4.1.0      [.] fdr_exec_x86_64_s1_w128
>>     4.20%  libhs.so.4.1.0      [.] nfaExecMcClellan16_B
>>     3.04%  libc-2.22.so        [.] __memset_sse2
>>     2.97%  suricata            [.] BoyerMoore
>>     2.75%  suricata            [.] IPOnlyMatchPacket
>>     2.39%  suricata            [.] SigMatchSignatures
>>     2.05%  suricata            [.] StreamTcp
>>     1.85%  [kernel]            [k] __memcpy
>>     1.81%  libhs.so.4.1.0      [.] fdr_exec_x86_64_s2_w128
>>     1.43%  gzip                [.] longest_match
>>     1.35%  [ixgbe]             [k] ixgbe_configure
>>     1.26%  libc-2.22.so        [.] vfprintf
>>     1.10%  suricata            [.] FlowManager
>>     1.07%  [kernel]            [k] tpacket_rcv
>>     1.06%  libpthread-2.22.so  [.] pthread_mutex_lock
>>     1.02%  [kernel]            [k] __memset
>>     0.88%  suricata            [.] AFPReadFromRing
>>     0.71%  suricata            [.] FlowGetFlowFromHash
>>     0.68%  [kernel]            [k] __netif_receive_skb_core
>>     0.68%  suricata            [.] StreamTcpPacket
>>     0.66%  libhs.so.4.1.0      [.] roseBlockExec_i
> 
> 
> -- 
> Cooper Nelson
> Network Security Analyst
> UCSD ITS Security Team
> cnelson at ucsd.edu x41042
> 
> _______________________________________________
> Suricata IDS Users mailing list: oisf-users at openinfosecfoundation.org
> Site: http://suricata-ids.org | Support: http://suricata-ids.org/support/
> List: https://lists.openinfosecfoundation.org/mailman/listinfo/oisf-users
> Suricata User Conference November 9-11 in Washington, DC: http://oisfevents.net