[Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

Mon Feb 25 19:28:36 UTC 2019

Sorry for the delay.  I am responding to where we left off as I don't
believe the additional info added later applies to my situation.  See
inline responses below:

On Wed, Feb 20, 2019 at 4:57 AM Peter Manev <petermanev at gmail.com> wrote:

> On Mon, Feb 18, 2019 at 11:35 PM Eric Urban <eurban at umn.edu> wrote:
> >
> > First, just in case it was lost in the details I do want to call out
> something from my first email.  When compiling Suricata 4.1.2 without Rust,
> we have little to no packet loss so looks to be the same behavior as we
> have with 4.0.6 compiled without Rust.  I wanted to make sure to highlight
> this point more as I probably should have used a subject line to that
> extent instead of writing that the upgrade is the main difference where we
> saw this increase in packet loss.  The upgrade to 4.1.2 likely just
> triggered the issue since that is when Rust is enabled by default.  Do note
> though that in 4.1.2 if I compile with Rust and explicitly disable the 6
> new parsers (not disabling SMB though), then we still do have significant
> packet loss.  So it seems there is more to disabling Rust than just
> disabling the new parsers in the config?
>
> Noted - thank you.
> To that point - what is your full compile / install line ?
>
>
HAVE_PYTHON=/usr/bin/python3 ./configure --with-libpcap=/opt/snf
--localstatedir=/var/ --with-libhs-includes=/usr/local/include/hs/
--with-libhs-libraries=/usr/local/lib64/

> >
> > Now on to responding to your most recent email:
> >
> >> This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread
> " counter especially with a big run like the shared stats -over 20 days.
> >> Do you mind sharing a bit more info on your NIC (Myricom i think - if I
> am not mistaken) - driver/version/any specific set up - we are trying to
> keep a record for that here -
> >> https://redmine.openinfosecfoundation.org/issues/2725#note-13
> >
> >
> > We are using Myricom cards with SNFv3 (Sniffer 3) drivers.  The driver
> version from the system where the stats log was taken from is
> 3.0.15.50857.  We were previously running 3.0.14.50843 and experienced a
> similar volume of drops, so we upgraded to this newer patch version to rule
> out issues with the driver.  Would you like me to add these details
> directly to the Redmine issue?  Are there any other specific details that
> would be helpful?
> >
>
> Please feel free to update the issue and then i can further update the
> matrix/table if needed.
> I think info like:
> -  kernel (just uname -a)
> - ethtool -i iface
> - runmode (example : af-packet cluster_flow / pfring )
> - ethtool -x iface
>
> would help a lot! Thank you
>
>
I updated the the Redmine issue.  Please let me know if you need any
additional info there.

>
> >> Observations:
> >> with 4.1.2 this counters seem odd -
> >> capture.kernel_packets                     | Total
> >> | 16345348068
> >> capture.kernel_drops                        | Total
> >>  | 33492892572
> >> aka you have more kernel_drops than kernel_packets - seems odd.
> >> Which makes me think It maybe a "counter" bug of some sort. Are the NIC
> driver versions the same on both boxes / same NIC config etc ?
> >
> >
> > When I look at the delta counters for packets and drops, there are many
> times these appear to possibly overflow as they are resulting in large
> negative values.  These negative delta values are being added to the
> running counters so make it look like we have more drops than packets.  We
> see this more often with packets than with drops.  If this is an issue with
> overflow that seems to make sense as we often have much higher packet
> counts than drops.  One example I found where
> stats.capture.kernel_packets_delta: -4293864433.  The
> stats.capture.kernel_packets count prior to this was 20713149739 then went
> down to 16419285306 the next time stats were generated.  I had noticed this
> behavior prior to 4.1.2, and am quite sure we saw this in 3.2.2 and
> possibly earlier than that.  I typically just filter out these values when
> looking through the delta data.  I can file a bug for this issue if you'd
> like as we have plenty of examples in our logs where this occurs both with
> packet and drop counters.
>
> Yes - could you please file a report including th edetails - OS/Suri
> versions affected/ NIC model used / runmode used (afpacket for
> example) etc
>
>
I opened https://redmine.openinfosecfoundation.org/issues/2845 so will work
from there on that issue.

> >
> > Also, to answer your question about the NIC drivers/config even though
> it may not matter anymore give the info about the negative delta counters,
> at the time I submitted the stats logs to you we were running slightly
> different Myricom driver versions.  The 4.0.6 sensors were running
> 3.0.14.50843 and the 4.1.2 ones were running 3.0.15.50857.  The reason for
> the different driver versions is as I mentioned above that we upgraded the
> driver version to rule that out as a potential cause and only did it on our
> one set running 4.1.2.
> >
> >> Suggestions for the 4.1.2 set up:
> >> Try a run where you disable those (false) and run again to see if any
> difference (?) :
> >>   midstream: true            # allow midstream session pickups
> >>   async-oneside: true        # enable async stream handling
> >
> >
> > I just made these changes today.  I will let it run for a few days and
> get back to you with the results.
> >
>
> Any diff observed yet ?
>

It looks like the config changes you recommended did reduce the number of
drops we saw, but did not avoid them altogether.

Right now we have three Suricata instances that are set up to get the same
traffic for troubleshooting this issue.  We had a few periods of drops on
the 4.1.2 sensors (with Rust) where the 4.0.6 (no Rust) instance had no
drops.  The 4.1.2 instance with midstream and async disabled had some drops
but not as many as the 4.1.2 instance running our unmodified config that I
provided to you earlier on this thread.

I looked at one period (Feb 21 13:14 through Feb 21 13:26) where we had
some heavy packet loss on our 4.1.2 sensor interfaces.  Unfortunately I
don't have full stats logs to provide for this window as they were deleted
before I got to them.  I am assuming you actually need the stats log so
will make sure to pay more attention to this to grab the logs before they
are deleted.

To give you a summary of what I saw:
The first sensor running 4.0.6 had 0 packets dropped during this time.
The second sensor running 4.1.2 with your recommended config options to
disable midstream sessions and async stream handling still had a
significant amount of drops (1.2 million to 16 million per minute).
However, the total number of drops during this time period (77,782,338) was
much less than the third sensor.
The third sensor running 4.1.2 and our unmodified config had the most
amount of drops overall during this period.  The drops per minute ranged
from about 1 million to 39 million  per minute and totaled 173,550,987.

Something that is strange to me is that even though these should be getting
the same traffic, the Suricata counters show a large spike in
packets_received for the two 4.1.2 hosts so appears to have significantly
more packets than the 4.0.6 host.  I checked our Myricom stats to compare
with this period and each of these have a very similar number of packets
received so from that point of view does seem that these are getting the
same traffic.

I will get back to you once this happens again and I can get stats logs
from all three of these sensors to compare.

>
> > Thank you for your assistance,
> > Eric
>
> Thank you for the feedback!
>
> >
> >
> > On Thu, Feb 14, 2019 at 3:59 PM Cloherty, Sean E <scloherty at mitre.org>
> wrote:
> >>
> >> That also seems to be the case with me regarding high counts on
> tcp.pkt_on_wrong_thread.  I've reverted to 4.0.6 using the same setup and
> YAML and the stats look much better with no packet loss.  I will forward
> the data.
> >>
> >> Thanks.
> >>
> >> -----Original Message-----
> >> From: Peter Manev <petermanev at gmail.com>
> >> Sent: Wednesday, February 13, 2019 3:52 PM
> >> To: Eric Urban <eurban at umn.edu>
> >> Cc: Cloherty, Sean E <scloherty at mitre.org>; Open Information Security
> Foundation <oisf-users at lists.openinfosecfoundation.org>
> >> Subject: Re: [EXT] Re: [Oisf-users] Packet loss and increased resource
> consumption after upgrade to 4.1.2 with Rust support
> >>
> >> On Fri, Feb 8, 2019 at 6:34 PM Eric Urban <eurban at umn.edu> wrote:
> >> >
> >> > Peter, I emailed our config to you directly.  I mentioned in my
> original email that we did test having Rust enabled in 4.1.2 where I
> explicitly disabled the Rust parsers and still experienced significant
> packet loss.  In that case I added the following config under
> app-layer.protocols but left the rest of the config the same:
> >> >
> >>
> >>
> >> Thank you for sharing all the requested information.
> >> Please find below my observations and some suggestions.
> >>
> >> The good news with 4.1.2:
> >> tcp.pkt_on_wrong_thread                    | Total
>  | 100
> >>
> >> This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread
> " counter especially with a big run like the shared stats -over 20 days.
> >> Do you mind sharing a bit more info on your NIC (Myricom i think - if I
> am not mistaken) - driver/version/any specific set up - we are trying to
> keep a record for that here -
> >> https://redmine.openinfosecfoundation.org/issues/2725#note-13
> >>
> >>
> >> Observations:
> >> with 4.1.2 this counters seem odd -
> >> capture.kernel_packets                     | Total
> >> | 16345348068
> >> capture.kernel_drops                        | Total
> >>  | 33492892572
> >> aka you have more kernel_drops than kernel_packets - seems odd.
> >> Which makes me think It maybe a "counter" bug of some sort. Are the NIC
> driver versions the same on both boxes / same NIC config etc ?
> >>
> >>
> >> Suggestions for the 4.1.2 set up:
> >> Try a run where you disable those (false) and run again to see if any
> difference (?) :
> >>   midstream: true            # allow midstream session pickups
> >>   async-oneside: true        # enable async stream handling
> >>
> >> Thank you
>
>
>
> --
> Regards,
> Peter Manev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20190225/43bb9705/attachment-0001.html>