[Oisf-devel] imporve flow table in workers runmode

Wed Apr 17 14:35:33 UTC 2013

Hello all,

On 04/05/2013 11:09 AM, Victor Julien wrote:
> (please don't top-post in discussions like this and also don't use HTML)
damn Google Mail interface, sry again

> 
> On 04/04/2013 02:44 PM, Vito Piserchia wrote:
>> On Thu, Apr 4, 2013 at 12:10 PM, Victor Julien <victor at inliniac.net
>> <mailto:victor at inliniac.net>> wrote:
>>
>>     On 04/04/2013 11:53 AM, Eric Leblond wrote:
>>     > On Thu, 2013-04-04 at 11:05 +0200, Victor Julien wrote:
>>     >> On 04/04/2013 10:25 AM, Eric Leblond wrote:
>>     >>> Hello,
>>     >>>
>>     >>> On Wed, 2013-04-03 at 15:40 +0200, Victor Julien wrote:
>>     >>>> On 04/03/2013 10:59 AM, Chris Wakelin wrote:
>>     >>>>> On 03/04/13 09:19, Victor Julien wrote:
>>     >>>>>>> On 04/03/2013 02:31 AM, Song liu wrote:
>>     >>>>>>>>>
>>     >>>>>>>>> Right now, all workers will share one big flow table, and
>>     there will be
>>     >>>>>>>>> contention for it.
>>     >>>>>>>>> Supposed that the network interface is flow affinity, each
>>     worker will
>>     >>>>>>>>> handle individual flows.
>>     >>>>>>>>> In this way, I think it makes more sense that each worker
>>     has its own
>>     >>>>>>>>> flow table rather than one big table to reduce contention.
>>     >>>>>>>>>
>>     >>>>>>>
>>     >>>>>>> We've been discussing this before and I think it would make
>>     sense. It
>>     >>>>>>> does require quite a bit of refactoring though, especially
>>     since we'd
>>     >>>>>>> have to support the current setup as well for the
>>     non-workers runmodes.
>>     >>>>>>>

An other approach could be having something like a partitioned hash
table with a fine-grained lock strategy.

Even stronger could be using a lock-free data structures, does anyone
have experience in this topic?

The last alternative idea could be the usage of the userspace RCU
implementation [1]

>>     >>>>> It sounds like a good idea when things like PF_RING are
>>     supposed to
>>     >>>>> handle the flow affinity onto virtual interfaces for us
>>     (PF_RING DNA +
>>     >>>>> libzero clusters do, and there's the PF_RING_DNA_SYMMETRIC_RSS
>>     flag for
>>     >>>>> PF_RING DNA without libzero and interfaces that support RSS).
>>     >>>>
>>     >>>> Actually, all workers implementations share the same assumption
>>     >>>> currently: flow based load balancing in pf_ring, af_packet,
>>     nfq, etc. So
>>     >>>> I think it makes sense to have a flow engine per worker in all
>>     these cases.
>>     >>>
>>     >>> There may be a special point in the IPS mode. For example, NFQ
>>     will soon
>>     >>> provide a cpu fanout mode where the worker will be selected based on
>>     >>> CPU. The idea is to have the NIC do the flow balancing. But this
>>     implies
>>     >>> that the return packet may come to a different CPU based on the flow
>>     >>> hash function used by the NIC.
>>     >>> We have the same behavior in af_packet IPS mode...
>>     >>
>>     >> I think this can lead to some weird packet order problems. T1
>>     inspects
>>     >> toserver, T2 toclient. If the T1 worker is held up for whatever
>>     reason,
>>     >> we may for example process ACKs in T2 for packets we've not
>>     processed in
>>     >> T1 yet. I'm pretty sure this won't work correctly.
>>     >
>>     > In the case of IPS mode, do inline streaming depends on ACKed packet ?
>>
>>     No, but the stream engine is written with the assumption that what we
>>     see is the order of packets on the wire. TCP packets may still be out of
>>     order of course, but in this case the end-host has to deal with it
>>     as well.
>>
>>     In cases like window checks, sequence validation, SACK checks, etc I can
>>     imagine problems. We'd possibly reject/accept packets in the stream
>>     handling that the end host will treat differently.
>>
>>     >
>>     >> This isn't limited to workers btw, in autofp when using multiple
>>     capture
>>     >> threads we can have the same issue. One side of a connection getting
>>     >> ahead of the other.
>>     >
>>     > Yes, I've observed this lead to strange behavior...
>>     >
>>     >> Don't think we can solve this in Suricata itself, as the OS has a
>>     lot of
>>     >> liberty in scheduling threads. A full packet reordering module would
>>     >> maybe work, but it's performance affect would probably completely nix
>>     >> all gains by the said capture methods.
>>     >
>>     > Sure
>>     >
>>     >>> In this case, we may want to disable the per-worker flow engine
>>     which is
>>     >>> a really good idea for other running mode.
>>     >>
>>     >> Don't think it would be sufficient. The ordering problem won't be
>>     solved
>>     >> by it.
>>     >
>>     > Yes, it may be interesting to study a bit the hash function used
>>     by NIC
>>     > to see if they behave symetrically. In this case, this should fix the
>>     > issue (at least for NFQ). I will have a look into it.
>>
>> IMHO the success key is having a symmetric RSS hash function. Someone
>> already made experiments/studies about this: i.e.
>> http://www.ndsl.kaist.edu/~shinae/papers/TR-symRSS.pdf
> 
> Interesting, thanks.
> 
>> Obviously this could lead to unbalanced flow queue, think about a long
>> standing flows which remain alive for long time period... To take
>> into account this kind of situation one could think to
>> assign a group of processing CPU thread to packets that arrive from
>> the same RSS queue, loosing,  of course, in this case the cache (ant
>> interrupt) affinity benefits.
> 
> With our autofp mode this could be done. We could also consider a more
> advanced autofp mode where instead of on global load balancer over all
> cpu's/threads we'd have autofp style load balancing over a select group
> of threads that run on the same cpu.
> 

A finer autofp mode will for sure help a lot in this. I also would
suggest to take into consideration the "new players" in the market with
even more advanced packet distribution policies [2]

[1] urcu at http://lttng.org/urcu
[1] DPDK at http://dpdk.org/