netdev - TCP receive splice performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <200805121545.50390.opurdila@ixiacom.com>
Date:	Mon, 12 May 2008 15:45:50 +0300
From:	Octavian Purdila <opurdila@...acom.com>
To:	netdev@...r.kernel.org
Cc:	Jens Axboe <jens.axboe@...cle.com>
Subject: TCP receive splice performance

Hi,

I have been playing with the tcp receive splice implementation and noticed 
some strange results. 

The tests were run with client and server on two machines with GB interfaces 
(PPC 800MHz, 32K cache size) connected back to back. Both client and server 
machine were running a 2.6.25 kernel.

In the first test case the client was doing read(/dev/zero) write(socket), and 
the server read(socket) write(/dev/null). Throughput  was 68MB/s.

In the second test case the same test was performed, only this time with some 
modification in the network stack to disable the copying from/to userspace. 
Throughput was 122-123MB which is practically line-rate.

For the third testcase I was using splice(/some/file, socket) on the client 
and splice(socket, /dev/null) on the server. Throughput was 113MB/s. 

Now the strange part: when lowering the tcp_rmem buffer sizes to 32K or 
setting SOCK_RECVBUF to 16K (which AFAIK is the same thing) the throughput 
was 121-122MB/s.

oprofiling the two splice runs shows that in the 113 case tcp_read_sock is 
responsible for about 3% CPU time while in the 122 case tcp_read_sock is only 
responsible for 1.5% CPU time. In the second test case tcp_recvmsg is 
responsible for 1.5% CPU time.

For the 113 case there are about 400K tcp_read_sock calls per 5 seconds, and 
in the 122 case there are 420K tcp_read_sock calls per 5 seconds. Detailed 
profiling seems to point that about 50% from tcp_read_sock, in the 113 case, 
is spent when touching the tcp header in tcp_recv_skb.

The only clue for the 32K magic buffer size that I see is that L1cache of the 
processor is 32K. Could it be that the splice code has a big impact on the 
cache and that is pushing the tcp headers out the cache? 

tcp_headers should be in the cache before the tcp_read_sock call, since the 
softirq processing is touching the tcp headers, right?

Any ideas of how I can debug this further?

Thanks,
tavi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html