[Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

Wed May 29 19:44:58 UTC 2019

HI all, 

I've been investigating the ' tcp.pkt_on_wrong_thread' issue on my system, the counters are very high currently.

I'm fairly certain I know what the issue is (at least on Intel cards).  See the is blog post:

http://adrianchadd.blogspot.com/2014/08/receive-side-scaling-figuring-out-how.html

Pay close attention to this bit....

>The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

>But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The edge case described above is actually very, very common when monitoring live traffic on a big, busy network.  Large HTTP downloads will begin with small, unfragmented TCP packets.  However, as the receive window increases over time eventually the TCP packets will become fragmented and end up on the wrong thread.  You also won't see this on test/simulated traffic unless you deliberately create these packets.  

The easiest fix for this would be to simply force a trivial 'sd' (src->dst) hash on the Intel NIC or within the ixgbe driver.  However, ethtool does not seem to allow this for TCP traffic.  I'm thinking it might be possible by modifying the driver source code.  

If anyone has any ideas I would appreciate it.    

-Coop

-----Original Message-----
From: Oisf-users <oisf-users-bounces at lists.openinfosecfoundation.org> On Behalf Of Cloherty, Sean E
Sent: Thursday, February 14, 2019 2:00 PM
To: Peter Manev <petermanev at gmail.com>; Eric Urban <eurban at umn.edu>
Cc: Open Information Security Foundation <oisf-users at lists.openinfosecfoundation.org>
Subject: Re: [Oisf-users] [EXT] Re: Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

That also seems to be the case with me regarding high counts on tcp.pkt_on_wrong_thread.  I've reverted to 4.0.6 using the same setup and YAML and the stats look much better with no packet loss.  I will forward the data.  

Thanks.

-----Original Message-----
From: Peter Manev <petermanev at gmail.com> 
Sent: Wednesday, February 13, 2019 3:52 PM
To: Eric Urban <eurban at umn.edu>
Cc: Cloherty, Sean E <scloherty at mitre.org>; Open Information Security Foundation <oisf-users at lists.openinfosecfoundation.org>
Subject: Re: [EXT] Re: [Oisf-users] Packet loss and increased resource consumption after upgrade to 4.1.2 with Rust support

On Fri, Feb 8, 2019 at 6:34 PM Eric Urban <eurban at umn.edu> wrote:
>
> Peter, I emailed our config to you directly.  I mentioned in my original email that we did test having Rust enabled in 4.1.2 where I explicitly disabled the Rust parsers and still experienced significant packet loss.  In that case I added the following config under app-layer.protocols but left the rest of the config the same:
>

Thank you for sharing all the requested information.
Please find below my observations and some suggestions.

The good news with 4.1.2:
tcp.pkt_on_wrong_thread                    | Total                     | 100

This is very low (lowest i have seen) for the "tcp.pkt_on_wrong_thread " counter especially with a big run like the shared stats -over 20 days.
Do you mind sharing a bit more info on your NIC (Myricom i think - if I am not mistaken) - driver/version/any specific set up - we are trying to keep a record for that here -
https://redmine.openinfosecfoundation.org/issues/2725#note-13

Observations:
with 4.1.2 this counters seem odd -
capture.kernel_packets                     | Total
| 16345348068
capture.kernel_drops                        | Total
 | 33492892572
aka you have more kernel_drops than kernel_packets - seems odd.
Which makes me think It maybe a "counter" bug of some sort. Are the NIC driver versions the same on both boxes / same NIC config etc ?

Suggestions for the 4.1.2 set up:
Try a run where you disable those (false) and run again to see if any difference (?) :
  midstream: true            # allow midstream session pickups
  async-oneside: true        # enable async stream handling

Thank you
_______________________________________________
Suricata IDS Users mailing list: oisf-users at openinfosecfoundation.org
Site: http://suricata-ids.org | Support: http://suricata-ids.org/support/
List: https://lists.openinfosecfoundation.org/mailman/listinfo/oisf-users

Conference: https://suricon.net
Trainings: https://suricata-ids.org/training/