netdev - Ethernet deferred 'end of transmit' processing.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Date:	Tue, 27 Nov 2012 13:28:49 -0000
From:	"David Laight" <David.Laight@...LAB.COM>
To:	<netdev@...r.kernel.org>
Subject: Ethernet deferred 'end of transmit' processing.

Eric and I have just had a private discussion about deferring
(or not) the ethernet 'end of tx' processing.
Below is Eric's last email.

> > Subject: RE: performance regression on HiperSockets depending on MTU size
> > 
> > On Mon, 2012-11-26 at 16:38 +0000, David Laight wrote:
> > > > For example, I had to change mlx4 driver for the same problem : Make
> > > > sure a TX packet can be "TX completed" in a short amount of time.
> > >
> > > I'm intrigued that Linux is going that way.
> > > It (effectively) requires the hardware generate an interrupt
> > > for every transmit packet in order to get high throughput.
> > >
> > > I remember carefully designing ethernet drivers to avoid
> > > taking 'tx done' interrupts unless absolutely necessary
> > > in order to reduce system interrupt load.
> > > Some modern hardware probably allows finer control of 'tx done'
> > > interrupts, but it won't be universal.
> > >
> > > I realise that hardware TX segmentation offload can cause a
> > > single tx ring entry to take a significant amount of time to
> > > transmit - so allowing a lot of packets to sit in the tx
> > > ring causes latency issues.
> > >
> > > But there has to be a better solution than requiring every
> > > tx to complete very quickly - especially if the tx flow
> > > is actually a lot of small packets.
> > >
> > > 	David
> > >
> > 
> > 20 years ago, interrupts were expensive so you had to batch packets.
> > 
> > In 2012, we want low latencies, because hardware is fast and is able to
> > cope with the requirement.
> > 
> > Instead of one cpu, we now have 24 cpus or more per host.
> > 
> > And if there is enough load, NAPI will really avoid interrupts, and you
> > get full batch advantages (lowering the number of cpu cycles per packet)
>
> AFAICT some of the stuff being done to get 10G+ speeds is
> actually similar to what I was doing trying to saturate
> 10M ethernet. Network speeds have increased by a factor
> of (about) 800, cpu clock speeds only by 100 or so
> (we were doing quad cpu sparc systems with quite slow
> cache coherency operations).
> Somewhere in the last 20 years a lot of code has got very lazy!
> 
> Using 'source allocated' byte counts for flow control
> (which is what I presume the socket send code does) so that
> each socket has a limited amount of live data in the kernel
> and can't allocate a new buffer (skb) until the amount of
> kernel memory allocated to the 'live' buffers decreases
> (ie a transmit completes) certainly works a lot better
> that the target queue size flow control attempted by SYSV
> STREAMS (which doesn't work very well at all!).
> 
> What might work is to allow the ethernet driver to reassign
> some bytes of the SKB from the socket (or other source) to
> the transmit interface - then it need not request end of tx
> immediately for those bytes.

It seems you understood how it currently works.

> The amount it could take can be quite small - possibly one
> or two maximal sized ring entries, or (say) 100us of network
> time.
> 
> With multiple flows this will make little difference to the
> size of the burst that each socket gets to add into the
> interfaces tx queue (unlike increasing the socket tx buffer).
> But with a single flow it will let the socket get the next
> tx data queued even if the tx interrupts are deferred.
> 
> The only time it doesn't help is when the next transmit
> can't be done until the reference count on the skb decreases.
> (We had some NFS code like that!)

If you read the code, you'll see current implementation is able to keep
a 20Gbe link busy with a single tcp flow, with 2 TSO packets posted on
the device.

A TSO packet is about 545040 bits on wire, or 27 us.

That's 36694 interrupts per second. Even my phone is able to sustain this
rate of interrupts.

But if the device holds the TX completion interrupt for 100 us,
performance of a single TCP flow is hurt. I don't think it's hard to
understand.

mlx4 driver handles 40Gbe links, 13 us is the needed value, not 100 us.

Please post these mails to netdev, there is no secret to protect.