[Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

Eric Urban eurban at umn.edu
Thu Feb 28 17:44:31 UTC 2019


Hello Peter,

I sent our config in a separate email directly to you and will do the same
with the stats in a moment due to the size max on the mailing list.

The stats files are from our 4.0.6 (no drops) and two 4.1.2 instances.  The
drops occurred on Feb 27 around 21:05.  I included the Myricom stats from
the two 4.1.2 ones but not for the 4.0.6 as it is a little more difficult
to get those, but let me know if you need them.  Also note that the 4.0.6
logs don't have stats per thread because it is from a production host so I
haven't made that change yet.

Thank you,
Eric



On Wed, Feb 27, 2019 at 2:53 AM Peter Manev <petermanev at gmail.com> wrote:

> On Tue, Feb 26, 2019 at 5:24 PM Eric Urban <eurban at umn.edu> wrote:
> >
> > Hello Peter,
> >
> > Here are the stats logs, which do not yet have per thread stats. I will
> make that change today to have per thread stats but figured I would still
> send these along for now.
> >
> > Log file stats_412_Rust_noAsyncMidstream_e02-snf0.log-2019022517.gz is
> the 4.1.2 instance with your config change recommendations.
> >
> > Also, in your last email you wrote:
> >>
> >> It appears some sort of a counter issue  (as you also detailed in the
> >> bug report).
> >> Do you mind also following the Suggestion Victor (per thread stats)
> >> had and sharing those stats too?
> >> Thank you
> >
> >
> > Do you mean that you feel at this time the drops are the result of a
> counter issue?  I may not be understanding you correctly so wanted to ask.
> I thought I should mention too that our Myricom stats show drops during
> these times and only on our 4.1.2 sensors.  In the time period that these
> attached stats logs attached cover, it is also true like what I described
> in my last email that the Myricom stats show that all three sensors had
> very similar packet counts.
> >
>
> It seems so at least for now.(wrt counters)
> Thank you for attaching the stats. Seems we are down to 10% drops now
> based on the latest update.
>
> I have  a couple of more requests.
> 1 - Could you please reshare your latest yaml for 4.1.2
> 2 - Is it possible to share the Myricom NIC/packet counter stats as well?
>
>
>
> > Finally, I will make sure to follow up with Victor once I make this
> change and encounter a situation where we have negative deltas for packet
> counters.
> >
>
> Thank you
>
> > Thank you,
> > Eric
> >
> >
> > On Mon, Feb 25, 2019 at 4:29 PM Peter Manev <petermanev at gmail.com>
> wrote:
> >>
> >> >
> >> > I updated the the Redmine issue.  Please let me know if you need any
> additional info there.
> >> >
> >>
> >> Thank you
> >>
> >> >> >
> >> >> >> Observations:
> >> >> >> with 4.1.2 this counters seem odd -
> >> >> >> capture.kernel_packets                     | Total
> >> >> >> | 16345348068
> >> >> >> capture.kernel_drops                        | Total
> >> >> >>  | 33492892572
> >> >> >> aka you have more kernel_drops than kernel_packets - seems odd.
> >> >> >> Which makes me think It maybe a "counter" bug of some sort. Are
> the NIC driver versions the same on both boxes / same NIC config etc ?
> >> >> >
> >> >> >
> >> >> > When I look at the delta counters for packets and drops, there are
> many times these appear to possibly overflow as they are resulting in large
> negative values.  These negative delta values are being added to the
> running counters so make it look like we have more drops than packets.  We
> see this more often with packets than with drops.  If this is an issue with
> overflow that seems to make sense as we often have much higher packet
> counts than drops.  One example I found where
> stats.capture.kernel_packets_delta: -4293864433.  The
> stats.capture.kernel_packets count prior to this was 20713149739 then went
> down to 16419285306 the next time stats were generated.  I had noticed this
> behavior prior to 4.1.2, and am quite sure we saw this in 3.2.2 and
> possibly earlier than that.  I typically just filter out these values when
> looking through the delta data.  I can file a bug for this issue if you'd
> like as we have plenty of examples in our logs where this occurs both with
> packet and drop counters.
> >> >>
> >> >> Yes - could you please file a report including th edetails - OS/Suri
> >> >> versions affected/ NIC model used / runmode used (afpacket for
> >> >> example) etc
> >> >>
> >> >
> >> > I opened https://redmine.openinfosecfoundation.org/issues/2845 so
> will work from there on that issue.
> >> >
> >> >
> >> >>
> >> >> >
> >> >> > Also, to answer your question about the NIC drivers/config even
> though it may not matter anymore give the info about the negative delta
> counters, at the time I submitted the stats logs to you we were running
> slightly different Myricom driver versions.  The 4.0.6 sensors were running
> 3.0.14.50843 and the 4.1.2 ones were running 3.0.15.50857.  The reason for
> the different driver versions is as I mentioned above that we upgraded the
> driver version to rule that out as a potential cause and only did it on our
> one set running 4.1.2.
> >> >> >
> >> >> >> Suggestions for the 4.1.2 set up:
> >> >> >> Try a run where you disable those (false) and run again to see if
> any difference (?) :
> >> >> >>   midstream: true            # allow midstream session pickups
> >> >> >>   async-oneside: true        # enable async stream handling
> >> >> >
> >> >> >
> >> >> > I just made these changes today.  I will let it run for a few days
> and get back to you with the results.
> >> >> >
> >> >>
> >> >> Any diff observed yet ?
> >> >
> >> >
> >> > It looks like the config changes you recommended did reduce the
> number of drops we saw, but did not avoid them altogether.
> >> >
> >> > Right now we have three Suricata instances that are set up to get the
> same traffic for troubleshooting this issue.  We had a few periods of drops
> on the 4.1.2 sensors (with Rust) where the 4.0.6 (no Rust) instance had no
> drops.  The 4.1.2 instance with midstream and async disabled had some drops
> but not as many as the 4.1.2 instance running our unmodified config that I
> provided to you earlier on this thread.
> >> >
> >> > I looked at one period (Feb 21 13:14 through Feb 21 13:26) where we
> had some heavy packet loss on our 4.1.2 sensor interfaces.  Unfortunately I
> don't have full stats logs to provide for this window as they were deleted
> before I got to them.  I am assuming you actually need the stats log so
> will make sure to pay more attention to this to grab the logs before they
> are deleted.
> >> >
> >> > To give you a summary of what I saw:
> >> > The first sensor running 4.0.6 had 0 packets dropped during this time.
> >> > The second sensor running 4.1.2 with your recommended config options
> to disable midstream sessions and async stream handling still had a
> significant amount of drops (1.2 million to 16 million per minute).
> However, the total number of drops during this time period (77,782,338) was
> much less than the third sensor.
> >> > The third sensor running 4.1.2 and our unmodified config had the most
> amount of drops overall during this period.  The drops per minute ranged
> from about 1 million to 39 million  per minute and totaled 173,550,987.
> >> >
> >> > Something that is strange to me is that even though these should be
> getting the same traffic, the Suricata counters show a large spike in
> packets_received for the two 4.1.2 hosts so appears to have significantly
> more packets than the 4.0.6 host.  I checked our Myricom stats to compare
> with this period and each of these have a very similar number of packets
> received so from that point of view does seem that these are getting the
> same traffic.
> >> >
> >> > I will get back to you once this happens again and I can get stats
> logs from all three of these sensors to compare.
> >>
> >> It appears some sort of a counter issue  (as you also detailed in the
> >> bug report).
> >> Do you mind also following the Suggestion Victor (per thread stats)
> >> had and sharing those stats too?
> >> Thank you
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Peter Manev
>
>
>
> --
> Regards,
> Peter Manev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20190228/68b6296c/attachment-0001.html>


More information about the Oisf-users mailing list