[Oisf-users] Suricata pegs a detect thread and drops packets

Fri Jun 20 03:43:51 UTC 2014

On Thu, Jun 19, 2014 at 4:01 PM, Victor Julien <lists at inliniac.net> wrote:
> On 06/18/2014 04:54 PM, David Vasil wrote:
>> I have been trying to track down an issue I am having with Suricata
>> dropping packets (seems to be a theme on this list), requiring a restart
>> of the daemon to clear the condition.  My environment is not large
>> (averge 40-80Mbps traffic, mostly user/http traffic) and I have Suricata
>> 2.0.1 running on a base installation of Security Onion 12.04.4 on a Dell
>> R610 (12GB RAM, Dual Intel X5570, Broadcom BCM5709 sniffing interface).
>>
>> About once a day, Zabbix shows that I am starting to see a large number
>> of capture.kernel_drops and some corresponding tcp.reassembly_gap.
>>  Looking at htop, I can see that one of the Detect threads (Detect1 in
>> this screenshot) is pegged at 100% utilization.  If I use 'perf top' to
>> look at the perf events on the system, I see libhtp consuming a large
>> number of the cycles (attached).  Restarting suricata using
>> 'nsm_sensor_stop --only-snort-alert' results in child threads exiting,
>> but the main suricata process itself never stops (requiring a kill -9).
>>  Starting suricata again with 'nsm_sensor_start --only-snort-alert'
>> starts up Suricata and shows that we are able to inspect traffic with no
>> drops.
>>
>> In the attached screenshots, I am only inspecting ~2k packets/sec
>> ~16Mbit/s when Suricata started dropping packets.  As I write this,
>> Suricata is processing ~7k packets/sec and ~40Mbit/s with no drops.  I
>> could not see anything that I can directly correlate to the drops and
>> the various tuning steps I have taken have not helped alleviate the
>> issue, so I was hoping to leverage the community's wisdom.
>>
>> Some observations I had:
>>
>> - Bro (running on the same system, on the same interface) drops 0%
>> packets without issue all day
>> - When I start seeing capture.kernel_drops, I also begin seeing an
>> uptick in flow_mgr.new_pruned and tcp.reassembly_gap, changing the
>> associated memcaps of each has not seemed to help
>> - tcp.reassembly_memuse jumps to a peak of around 2.66G even though my
>> reassembly memcap is set to 2gb
>> - http.memcap is set to 256mb in my config and logfile, but the
>> stats.log show http.memcap = 0 (bug?)
>
> When this happens, do you see a peak in syn/synack and flow manager
> pruned stats each time?
>
> The current flow timeout code has a weakness. When it injects fake
> packets into the engine to do some final processing, it currently only
> injects into Detect1. You might be seeing this here.
>

Considering that the load seems to be inside htp_list_array_get(), I'm
wondering if it's because the transaction list in htp has grown too
large, and the get() call now has to iterate through all the
transactions in the list.

I did a quick check through the htp list code and I see that the
transaction destroyal process assigns a NULL, but doesn't compress the
list/list_entry(I might not have searched thoroughly, but this seems
to be the case from my quick check).  If this is true, we have a huge
list with dummy NULL entries, and a get() retrieval is going to
walk(inside htp_list_array_get()) through this huge NULL list to
retrieve the transaction it wants.

-- 
-------------------------------
Anoop Saldanha
http://www.poona.me
-------------------------------