[Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

Wed Feb 20 10:57:25 UTC 2019

On Mon, Feb 18, 2019 at 11:35 PM Eric Urban <eurban at umn.edu> wrote:
>
> First, just in case it was lost in the details I do want to call out something from my first email.  When compiling Suricata 4.1.2 without Rust, we have little to no packet loss so looks to be the same behavior as we have with 4.0.6 compiled without Rust.  I wanted to make sure to highlight this point more as I probably should have used a subject line to that extent instead of writing that the upgrade is the main difference where we saw this increase in packet loss.  The upgrade to 4.1.2 likely just triggered the issue since that is when Rust is enabled by default.  Do note though that in 4.1.2 if I compile with Rust and explicitly disable the 6 new parsers (not disabling SMB though), then we still do have significant packet loss.  So it seems there is more to disabling Rust than just disabling the new parsers in the config?

Noted - thank you.
To that point - what is your full compile / install line ?

>
> Now on to responding to your most recent email:
>
>> This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread " counter especially with a big run like the shared stats -over 20 days.
>> Do you mind sharing a bit more info on your NIC (Myricom i think - if I am not mistaken) - driver/version/any specific set up - we are trying to keep a record for that here -
>> https://redmine.openinfosecfoundation.org/issues/2725#note-13
>
>
> We are using Myricom cards with SNFv3 (Sniffer 3) drivers.  The driver version from the system where the stats log was taken from is 3.0.15.50857.  We were previously running 3.0.14.50843 and experienced a similar volume of drops, so we upgraded to this newer patch version to rule out issues with the driver.  Would you like me to add these details directly to the Redmine issue?  Are there any other specific details that would be helpful?
>

Please feel free to update the issue and then i can further update the
matrix/table if needed.
I think info like:
-  kernel (just uname -a)
- ethtool -i iface
- runmode (example : af-packet cluster_flow / pfring )
- ethtool -x iface

would help a lot! Thank you

>
>> Observations:
>> with 4.1.2 this counters seem odd -
>> capture.kernel_packets                     | Total
>> | 16345348068
>> capture.kernel_drops                        | Total
>>  | 33492892572
>> aka you have more kernel_drops than kernel_packets - seems odd.
>> Which makes me think It maybe a "counter" bug of some sort. Are the NIC driver versions the same on both boxes / same NIC config etc ?
>
>
> When I look at the delta counters for packets and drops, there are many times these appear to possibly overflow as they are resulting in large negative values.  These negative delta values are being added to the running counters so make it look like we have more drops than packets.  We see this more often with packets than with drops.  If this is an issue with overflow that seems to make sense as we often have much higher packet counts than drops.  One example I found where stats.capture.kernel_packets_delta: -4293864433.  The stats.capture.kernel_packets count prior to this was 20713149739 then went down to 16419285306 the next time stats were generated.  I had noticed this behavior prior to 4.1.2, and am quite sure we saw this in 3.2.2 and possibly earlier than that.  I typically just filter out these values when looking through the delta data.  I can file a bug for this issue if you'd like as we have plenty of examples in our logs where this occurs both with packet and drop counters.

Yes - could you please file a report including th edetails - OS/Suri
versions affected/ NIC model used / runmode used (afpacket for
example) etc

>
> Also, to answer your question about the NIC drivers/config even though it may not matter anymore give the info about the negative delta counters, at the time I submitted the stats logs to you we were running slightly different Myricom driver versions.  The 4.0.6 sensors were running 3.0.14.50843 and the 4.1.2 ones were running 3.0.15.50857.  The reason for the different driver versions is as I mentioned above that we upgraded the driver version to rule that out as a potential cause and only did it on our one set running 4.1.2.
>
>> Suggestions for the 4.1.2 set up:
>> Try a run where you disable those (false) and run again to see if any difference (?) :
>>   midstream: true            # allow midstream session pickups
>>   async-oneside: true        # enable async stream handling
>
>
> I just made these changes today.  I will let it run for a few days and get back to you with the results.
>

Any diff observed yet ?

> Thank you for your assistance,
> Eric

Thank you for the feedback!

>
>
> On Thu, Feb 14, 2019 at 3:59 PM Cloherty, Sean E <scloherty at mitre.org> wrote:
>>
>> That also seems to be the case with me regarding high counts on tcp.pkt_on_wrong_thread.  I've reverted to 4.0.6 using the same setup and YAML and the stats look much better with no packet loss.  I will forward the data.
>>
>> Thanks.
>>
>> -----Original Message-----
>> From: Peter Manev <petermanev at gmail.com>
>> Sent: Wednesday, February 13, 2019 3:52 PM
>> To: Eric Urban <eurban at umn.edu>
>> Cc: Cloherty, Sean E <scloherty at mitre.org>; Open Information Security Foundation <oisf-users at lists.openinfosecfoundation.org>
>> Subject: Re: [EXT] Re: [Oisf-users] Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support
>>
>> On Fri, Feb 8, 2019 at 6:34 PM Eric Urban <eurban at umn.edu> wrote:
>> >
>> > Peter, I emailed our config to you directly.  I mentioned in my original email that we did test having Rust enabled in 4.1.2 where I explicitly disabled the Rust parsers and still experienced significant packet loss.  In that case I added the following config under app-layer.protocols but left the rest of the config the same:
>> >
>>
>>
>> Thank you for sharing all the requested information.
>> Please find below my observations and some suggestions.
>>
>> The good news with 4.1.2:
>> tcp.pkt_on_wrong_thread                    | Total                     | 100
>>
>> This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread " counter especially with a big run like the shared stats -over 20 days.
>> Do you mind sharing a bit more info on your NIC (Myricom i think - if I am not mistaken) - driver/version/any specific set up - we are trying to keep a record for that here -
>> https://redmine.openinfosecfoundation.org/issues/2725#note-13
>>
>>
>> Observations:
>> with 4.1.2 this counters seem odd -
>> capture.kernel_packets                     | Total
>> | 16345348068
>> capture.kernel_drops                        | Total
>>  | 33492892572
>> aka you have more kernel_drops than kernel_packets - seems odd.
>> Which makes me think It maybe a "counter" bug of some sort. Are the NIC driver versions the same on both boxes / same NIC config etc ?
>>
>>
>> Suggestions for the 4.1.2 set up:
>> Try a run where you disable those (false) and run again to see if any difference (?) :
>>   midstream: true            # allow midstream session pickups
>>   async-oneside: true        # enable async stream handling
>>
>> Thank you

-- 
Regards,
Peter Manev