netdev - Re: splice() performance for TCP socket forwarding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181213140411.GC16149@1wt.eu>
Date:   Thu, 13 Dec 2018 15:04:11 +0100
From:   Willy Tarreau <w@....eu>
To:     Marek Majkowski <marek@...udflare.com>
Cc:     eric.dumazet@...il.com, netdev@...r.kernel.org
Subject: Re: splice() performance for TCP socket forwarding

On Thu, Dec 13, 2018 at 02:17:20PM +0100, Marek Majkowski wrote:
> > splice code will be expensive if less than 1MB is present in receive queue.
> 
> I'm not sure what you are suggesting. I'm just shuffling data between
> two sockets. Is there a better buffer size value? Is it possible to
> keep splice() blocked until it succeeds to forward N bytes of data? (I
> tried this unsuccessfully with SO_RCVLOWAT).

I've personally observed performance decrease once the pipe is configured
larger than 512 kB. I think that it's related to the fact that you're
moving 256 pages around on each call and that it might even start to
have some effect on L1 caches when touch lots of data, though that could
be completely unrelated.

> Here is a snippet from strace:
> 
> splice(4, NULL, 11, NULL, 1048576, 0) = 373760 <0.000048>
> splice(10, NULL, 5, NULL, 373760, 0) = 373760 <0.000108>
> splice(4, NULL, 11, NULL, 1048576, 0) = 335800 <0.000065>
> splice(10, NULL, 5, NULL, 335800, 0) = 335800 <0.000202>
> splice(4, NULL, 11, NULL, 1048576, 0) = 227760 <0.000029>
> splice(10, NULL, 5, NULL, 227760, 0) = 227760 <0.000106>
> splice(4, NULL, 11, NULL, 1048576, 0) = 16060 <0.000019>
> splice(10, NULL, 5, NULL, 16060, 0) = 16060 <0.000028>
> splice(4, NULL, 11, NULL, 1048576, 0) = 7300 <0.000013>
> splice(10, NULL, 5, NULL, 7300, 0) = 7300 <0.000021>

I think your driver is returning one segment per page. Let's do some
rough maths : assuming you're having an MSS of 1448 (timestamps enabled),
you'll retrieve 256*1448 = 370688 at once per call, which closely matches
what you're seeing. Hmmm checking closer, you're in fact exactly running
at 1460 (256*1460=373760) so you have timestamps disabled. Your numbers
seem normal to me (just the CPU usage doesn't, but maybe it improves
when using a smaller pipe).

Willy