[Oisf-users] Testers: please test our initial Hyperscan support
Cooper F. Nelson
cnelson at ucsd.edu
Tue Apr 5 19:05:28 UTC 2016
On 4/3/2016 7:33 PM, Viiret, Justin wrote:
> Hi Cooper,
>
> Thanks for all the details. Comments are inline below:
>
>> My intuition of why I'm seeing similar performance using the
>> Hyperscan pattern matcher vs. the 'ac' matcher is because the SIMD
>> pipeline is shared among hyper-threaded cores.
>
> I had a chat to a colleague with deeper architectural knowledge than
> mine, and he said the following (between dashes):
>
> ---- This statement about how the SIMD pipeline is shared among
> hyper-threaded cores is not correct for modern Intel Architecture
> post Core 2 – or at least, there is no difference between integer,
> floating point and SIMD operations in this regard. There is a lot of
> detail in Chapter 2 of the Intel 64 and IA-32 Architectures
> Optimization Reference Manual:
Thanks for the feedback/link, I see my error now. As your reference
mentioned, I think this may have been the case with Intel architectures
prior to the i7.
> However, you may be correct in essence: a matcher that spends a lot
> of time waiting for cache misses (or other higher-latency operations)
> may get more benefit from HT than one that uses the execution
> resources (whether they are integer, FP or SIMD) intensively, as
> their operations can be more effectively interleaved.
This is why I often say the RSS implementation of suricata's 'worker'
runmode is the poster-boy for hyperthreading. Under real-world
workloads, you get 2x the performance due to all the IO/cache misses
involved with processing live packets.
> The profile is very interesting -- can you share a similar profile
> from a version running with the AC matcher? I'm also curious about
> how you are measuring performance; is this an inline IPS deployment,
> or something different? Have you measured a "no MPM rules" baseline
> as well?
Well, that's the thing. It's hard to measure real performance on modern
super-scalar architectures, as the performance is dependent on lots of
variables. I/O, cache lines, pipelines, OOE, power-management,
hyper-threading, etc.
This is an IDS deployment so I basically look at two things. I watch a
'top' window and try to make sure the 5-min load average is under 16 at
peak (16 HT cores) and when I restart the suricata process I check the
logs and ensure that packet loss is under 1%.
I ran the ac algo last night under a period of lighter load and it shows
that it uses much more CPU time.
>
> PerfTop: 61719 irqs/sec kernel:32.6% exact: 93.0% [4000Hz cycles:pp], (all, 16 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 49.02% suricata [.] SCACSearch
> 10.20% [kernel] [k] acpi_processor_ffh_cstate_enter
> 9.77% [kernel] [k] __bpf_prog_run
> 2.93% suricata [.] BoyerMoore
> 1.53% suricata [.] IPOnlyMatchPacket
> 1.51% suricata [.] SigMatchSignatures
> 1.32% [kernel] [k] __memcpy
> 1.27% suricata [.] StreamTcp
The system load was still around 14, vs 15-16 at peak using the hs algo.
One thing I have noticed is that the 'hs' matcher doesn't seem to result
in '0.0' idle times; which is a big win for suricata. When any
core/thread is 100% utilized is when you start dropping packets. And as
mentioned, it uses less memory as well.
> The time in the fdr_exec and nfaExec functions constitute the
> majority of the time in Hyperscan in this profile, so they add up to
> ~ 15% of runtime -- this looks like a lighter workload than the Web
> traffic traces we tested with here, but there are a lot of variables
> that could affect that (different rule set, the BPF filter, overhead
> of live network scanning vs. our PCAP trace scanning for testing,
> etc).
We are running a tweaked config for our environment/hardware, especially
for web traffic.
> One concrete suggestion: you may see some improvement from using
> Hyperscan 4.1, which has some improvements to the literal matching
> path. It's available here:
>
> https://github.com/01org/hyperscan/releases/tag/v4.1.0
The 'make install' failed, but it looks the libraries built so I just
copied them over manually. Performance is a little better, pertop
output copied below. You can infer the relative amount of IP traffic by
looking at the percentage of CPU time spent running the BPF filter.
>
> PerfTop: 63306 irqs/sec kernel:39.9% exact: 90.2% [4000Hz cycles:pp], (all, 16 CPUs)
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 12.63% [kernel] [k] __bpf_prog_run
> 12.32% [kernel] [k] acpi_processor_ffh_cstate_enter
> 7.53% libhs.so.4.1.0 [.] fdr_exec_x86_64_s1_w128
> 4.20% libhs.so.4.1.0 [.] nfaExecMcClellan16_B
> 3.04% libc-2.22.so [.] __memset_sse2
> 2.97% suricata [.] BoyerMoore
> 2.75% suricata [.] IPOnlyMatchPacket
> 2.39% suricata [.] SigMatchSignatures
> 2.05% suricata [.] StreamTcp
> 1.85% [kernel] [k] __memcpy
> 1.81% libhs.so.4.1.0 [.] fdr_exec_x86_64_s2_w128
> 1.43% gzip [.] longest_match
> 1.35% [ixgbe] [k] ixgbe_configure
> 1.26% libc-2.22.so [.] vfprintf
> 1.10% suricata [.] FlowManager
> 1.07% [kernel] [k] tpacket_rcv
> 1.06% libpthread-2.22.so [.] pthread_mutex_lock
> 1.02% [kernel] [k] __memset
> 0.88% suricata [.] AFPReadFromRing
> 0.71% suricata [.] FlowGetFlowFromHash
> 0.68% [kernel] [k] __netif_receive_skb_core
> 0.68% suricata [.] StreamTcpPacket
> 0.66% libhs.so.4.1.0 [.] roseBlockExec_i
--
Cooper Nelson
Network Security Analyst
UCSD ITS Security Team
cnelson at ucsd.edu x41042
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20160405/27402a99/attachment-0002.sig>
More information about the Oisf-users
mailing list