[Oisf-users] af_packet and rss queue count

Michał Purzyński michalpurzynski1 at gmail.com
Fri Jan 27 04:44:57 UTC 2017

Let's face the GRO!! Those endless nights spent on reading ixgbe source,
e40i source, Linux networking stack code while doing perf tracing in
another window ;)

LRO = card's CPU merges 'everything in sight' based on some very relaxed
rules before card pushes data to DMA area. Deprecated as it was harmful,
for example it had to be disabled for bridging. Rules 'what can be merged'
were too relaxed.

GRO is done by the kernel.

SoftNet starts with net_rx_action() - which in turn calls the (registered
by the driver on initialization) card-specific polling routine, which will
be used to move packets from the 'DMA area' to 'consumers' such as taps
code (like AF_Packet), IP and the like.

ixgbe_poll() is our example hero. And it calls...

ixgbe_clean_rx_irq() - which constructs the skb (a socket buffer structure)
and calls...

napi_gro_receive() - even if you have GRO disabled ;) This function handles
both cases and the name is tricky. And it passes the packet up the network
stack to 'taps' / IP and what not. And it calls...

dev_gro_receive() - also for cases when GRO is disabled. It is
dev_gro_receive responsibility to check if GRO is enabled.

If it is - the upper layer (protocol layer) is consulted "is this packet
part of an already being-offloaded flow?". That's cool because say, TCP can
decide to have ACK merged with another packet or not merged. And that's how
it's different from LRO (in hardware) - the protocol layer can decide.
Once again - you can consult upper layers what gets merged, and TCP knows
better, because it has access to the current TCP-state-machine. If you want
to have it done in hardware, you would have to implement TCP in hardware.
With 'just taps' it is simpler but at least those rules are more strict
than a pure LRO 'I WILL merge because I can. What? Everythyyyyying!!'

Finally the protocol layer tells the dev_gro_receive 'yo, give me all my
packets' and a giant packet is "pulled" from the DMA-area up the stack
(without copying anything, usually it's a pointer magic unless it's not).

Why is it faster with GRO than without it? We are saving lots of function
calls done per packet and just fetch entire NN kB from the DMA-area.

An important point to remember - it is NOT about the bandwidth, it is all
about the latency. Calling each function takes nn nanoseconds and there are
only that many nanoseconds you have before next packet overwrites your data.

I promise to be more active for next questions.

On Thu, Jan 26, 2017 at 12:14 PM, Cooper F. Nelson <cnelson at ucsd.edu> wrote:

> I just checked and using the 'rx-all: on' shorthand results in LRO being
> turning off and GRO/GSO on:
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: off
> -Coop
> On 1/26/2017 12:07 PM, Marcus Eagan wrote:
> > Just saw LRO and wanted to remark for the record that I had a ton of
> > problems with af_packet because of cheap Realtek NICs. I've heard they
> have
> > improved but I have since moved on.
> >
> > Marcus
> --
> Cooper Nelson
> Network Security Analyst
> UCSD ITS Security Team
> cnelson at ucsd.edu x41042
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openinfosecfoundation.org/pipermail/oisf-users/attachments/20170126/b57e0d8c/attachment-0002.html>

More information about the Oisf-users mailing list