[Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

Cloherty, Sean E scloherty at mitre.org
Tue Feb 19 00:12:09 UTC 2019


When I compiled 4.0.6 on the previously 4.1.2 host, I used the same arguments including Rust.  I think all of the Rust parsers are disabled, but SMB is enabled.  In my case I’ve seen no packet loss in three days despite compiling with Rust.

From: Eric Urban <eurban at umn.edu>
Sent: Monday, February 18, 2019 5:35 PM
To: Peter Manev <petermanev at gmail.com>
Cc: Open Information Security Foundation <oisf-users at lists.openinfosecfoundation.org>; Cloherty, Sean E <scloherty at mitre.org>
Subject: Re: [EXT] Re: [Oisf-users] Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

First, just in case it was lost in the details I do want to call out something from my first email.  When compiling Suricata 4.1.2 without Rust, we have little to no packet loss so looks to be the same behavior as we have with 4.0.6 compiled without Rust.  I wanted to make sure to highlight this point more as I probably should have used a subject line to that extent instead of writing that the upgrade is the main difference where we saw this increase in packet loss.  The upgrade to 4.1.2 likely just triggered the issue since that is when Rust is enabled by default.  Do note though that in 4.1.2 if I compile with Rust and explicitly disable the 6 new parsers (not disabling SMB though), then we still do have significant packet loss.  So it seems there is more to disabling Rust than just disabling the new parsers in the config?

Now on to responding to your most recent email:

This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread " counter especially with a big run like the shared stats -over 20 days.
Do you mind sharing a bit more info on your NIC (Myricom i think - if I am not mistaken) - driver/version/any specific set up - we are trying to keep a record for that here -
https://redmine.openinfosecfoundation.org/issues/2725#note-13

We are using Myricom cards with SNFv3 (Sniffer 3) drivers.  The driver version from the system where the stats log was taken from is 3.0.15.50857.  We were previously running 3.0.14.50843 and experienced a similar volume of drops, so we upgraded to this newer patch version to rule out issues with the driver.  Would you like me to add these details directly to the Redmine issue?  Are there any other specific details that would be helpful?


Observations:
with 4.1.2 this counters seem odd -
capture.kernel_packets                     | Total
| 16345348068
capture.kernel_drops                        | Total
 | 33492892572
aka you have more kernel_drops than kernel_packets - seems odd.
Which makes me think It maybe a "counter" bug of some sort. Are the NIC driver versions the same on both boxes / same NIC config etc ?

When I look at the delta counters for packets and drops, there are many times these appear to possibly overflow as they are resulting in large negative values.  These negative delta values are being added to the running counters so make it look like we have more drops than packets.  We see this more often with packets than with drops.  If this is an issue with overflow that seems to make sense as we often have much higher packet counts than drops.  One example I found where stats.capture.kernel_packets_delta: -4293864433.  The stats.capture.kernel_packets count prior to this was 20713149739 then went down to 16419285306 the next time stats were generated.  I had noticed this behavior prior to 4.1.2, and am quite sure we saw this in 3.2.2 and possibly earlier than that.  I typically just filter out these values when looking through the delta data.  I can file a bug for this issue if you'd like as we have plenty of examples in our logs where this occurs both with packet and drop counters.

Also, to answer your question about the NIC drivers/config even though it may not matter anymore give the info about the negative delta counters, at the time I submitted the stats logs to you we were running slightly different Myricom driver versions.  The 4.0.6 sensors were running 3.0.14.50843 and the 4.1.2 ones were running 3.0.15.50857.  The reason for the different driver versions is as I mentioned above that we upgraded the driver version to rule that out as a potential cause and only did it on our one set running 4.1.2.

Suggestions for the 4.1.2 set up:
Try a run where you disable those (false) and run again to see if any difference (?) :
  midstream: true            # allow midstream session pickups
  async-oneside: true        # enable async stream handling

I just made these changes today.  I will let it run for a few days and get back to you with the results.

Thank you for your assistance,
Eric


On Thu, Feb 14, 2019 at 3:59 PM Cloherty, Sean E <scloherty at mitre.org<mailto:scloherty at mitre.org>> wrote:
That also seems to be the case with me regarding high counts on tcp.pkt_on_wrong_thread.  I've reverted to 4.0.6 using the same setup and YAML and the stats look much better with no packet loss.  I will forward the data.

Thanks.

-----Original Message-----
From: Peter Manev <petermanev at gmail.com<mailto:petermanev at gmail.com>>
Sent: Wednesday, February 13, 2019 3:52 PM
To: Eric Urban <eurban at umn.edu<mailto:eurban at umn.edu>>
Cc: Cloherty, Sean E <scloherty at mitre.org<mailto:scloherty at mitre.org>>; Open Information Security Foundation <oisf-users at lists.openinfosecfoundation.org<mailto:oisf-users at lists.openinfosecfoundation.org>>
Subject: Re: [EXT] Re: [Oisf-users] Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

On Fri, Feb 8, 2019 at 6:34 PM Eric Urban <eurban at umn.edu<mailto:eurban at umn.edu>> wrote:
>
> Peter, I emailed our config to you directly.  I mentioned in my original email that we did test having Rust enabled in 4.1.2 where I explicitly disabled the Rust parsers and still experienced significant packet loss.  In that case I added the following config under app-layer.protocols but left the rest of the config the same:
>


Thank you for sharing all the requested information.
Please find below my observations and some suggestions.

The good news with 4.1.2:
tcp.pkt_on_wrong_thread                    | Total                     | 100

This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread " counter especially with a big run like the shared stats -over 20 days.
Do you mind sharing a bit more info on your NIC (Myricom i think - if I am not mistaken) - driver/version/any specific set up - we are trying to keep a record for that here -
https://redmine.openinfosecfoundation.org/issues/2725#note-13


Observations:
with 4.1.2 this counters seem odd -
capture.kernel_packets                     | Total
| 16345348068
capture.kernel_drops                        | Total
 | 33492892572
aka you have more kernel_drops than kernel_packets - seems odd.
Which makes me think It maybe a "counter" bug of some sort. Are the NIC driver versions the same on both boxes / same NIC config etc ?


Suggestions for the 4.1.2 set up:
Try a run where you disable those (false) and run again to see if any difference (?) :
  midstream: true            # allow midstream session pickups
  async-oneside: true        # enable async stream handling

Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20190219/96504660/attachment-0001.html>


More information about the Oisf-users mailing list