netdev - Re: Stable regression with 'tcp: allow splice() to build full TSO packets'

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1337287279.3403.44.camel@edumazet-glaptop>
Date:	Thu, 17 May 2012 22:41:19 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Willy Tarreau <w@....eu>
Cc:	netdev@...r.kernel.org
Subject: Re: Stable regression with 'tcp: allow splice() to build full TSO
 packets'

On Thu, 2012-05-17 at 14:18 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
> 
> I'm seeing this on haproxy running on a small ARM machine (a
> dockstar), which exchanges data through a gig switch with my
> development PC. The NIC (mv643xx) doesn't support TSO but has
> GSO enabled. If I disable GSO, the problem remains. I can however
> make the problem disappear by disabling SG or Tx checksumming.
> BTW, using recv/send() instead of splice() also gets rid of the
> problem.
> 
> I can also reduce the risk of seeing the problem by increasing
> the default TCP buffer sizes in tcp_wmem. By default I'm running
> at 16kB, but if I increase the output buffer size above the pipe
> size, the problem *seems* to disappear though I can't be certain,
> since larger buffers generally means the problem takes longer to
> appear, probably due to the fact that the buffers don't need to
> be filled. Still I'm certain that with 64k TCP buffers and 128k
> pipes I'm still seeing it.
> 
> With strace, I'm seeing data fill up the pipe with the splice()
> call responsible for pushing the data to the output socket returing
> -1 EAGAIN. During this time, the client receives no data.
> 
> Something bugs me, I have tested with a dummy server of mine,
> httpterm, which uses tee+splice() to push data outside, and it
> has no problem filling the gig pipe, and correctly recoverers
> from the EAGAIN :
> 
>   send(13, "HTTP/1.1 200\r\nConnection: close\r"..., 160, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
>   tee(0x3, 0x6, 0x10000, 0x2)             = 42552
>   splice(0x5, 0, 0xd, 0, 0xa00000, 0x2)   = 14440
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
>   ...
>   tee(0x3, 0x6, 0x10000, 0x2)             = 13880
>   splice(0x5, 0, 0xd, 0, 0x9fc798, 0x2)   = 51100
>   tee(0x3, 0x6, 0x10000, 0x2)             = 50744
>   splice(0x5, 0, 0xd, 0, 0x9efffc, 0x2)   = 32120
>   tee(0x3, 0x6, 0x10000, 0x2)             = 30264
>   splice(0x5, 0, 0xd, 0, 0x9e8284, 0x2)   = -1 EAGAIN (Resource temporarily unavailable)
> 
> etc...
> 
> It's only with haproxy which uses splice() to copy data between
> two sockets that I'm getting the issue (data forwarded from fd 0xe
> to fd 0x6) :
> 
>   16:03:17.797144 pipe([36, 37])          = 0
>   16:03:17.797318 fcntl64(36, 0x407 /* F_??? */, 0x20000) = 131072 ## note: fcntl(F_SETPIPE_SZ, 128k)
>   16:03:17.797473 splice(0xe, 0, 0x25, 0, 0x9f2234, 0x3) = 10220
>   16:03:17.797706 splice(0x24, 0, 0x6, 0, 0x27ec, 0x3) = 10220
>   16:03:17.802036 gettimeofday({1324652597, 802093}, NULL) = 0
>   16:03:17.802200 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 7
>   16:03:17.802363 gettimeofday({1324652597, 802419}, NULL) = 0
>   16:03:17.802530 splice(0xe, 0, 0x25, 0, 0x9efa48, 0x3) = 16060
>   16:03:17.802789 splice(0x24, 0, 0x6, 0, 0x3ebc, 0x3) = 16060
>   16:03:17.806593 gettimeofday({1324652597, 806651}, NULL) = 0
>   16:03:17.806759 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 4
>   16:03:17.806919 gettimeofday({1324652597, 806974}, NULL) = 0
>   16:03:17.807087 splice(0xe, 0, 0x25, 0, 0x9ebb8c, 0x3) = 17520
>   16:03:17.807356 splice(0x24, 0, 0x6, 0, 0x4470, 0x3) = 17520
>   16:03:17.809565 gettimeofday({1324652597, 809620}, NULL) = 0
>   16:03:17.809726 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.809883 gettimeofday({1324652597, 809937}, NULL) = 0
>   16:03:17.810047 splice(0xe, 0, 0x25, 0, 0x9e771c, 0x3) = 36500
>   16:03:17.810399 splice(0x24, 0, 0x6, 0, 0x8e94, 0x3) = 23360
>   16:03:17.810629 epoll_ctl(0x3, 0x1, 0x6, 0x85378) = 0       ## note: epoll_ctl(ADD, fd=6, dir=OUT).
>   16:03:17.810792 gettimeofday({1324652597, 810848}, NULL) = 0
>   16:03:17.810954 epoll_wait(0x3, 0x99250, 0x16, 0x3e8) = 1
>   16:03:17.811188 gettimeofday({1324652597, 811246}, NULL) = 0
>   16:03:17.811356 splice(0xe, 0, 0x25, 0, 0x9de888, 0x3) = 21900
>   16:03:17.811651 splice(0x24, 0, 0x6, 0, 0x88e0, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
> 

Willy you say output to fd 6 hangs, but splice() returns EAGAIN here ?
(because socket buffer is full)

> So output fd 6 hangs here and will not appear anymore until
> here where I pressed Ctrl-C to stop the test :
> 

I just want to make sure its not a userland error that triggers now much
faster than with prior kernels.

You drain bytes from fd 0xe to pipe buffers, but I dont see you check
write ability on destination socket prior the splice(pipe -> socket)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html