netdev - Re: TCP: orphans broken by RFC 2525 #2.17

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100926214614.GD12373@1wt.eu>
Date:	Sun, 26 Sep 2010 23:46:14 +0200
From:	Willy Tarreau <w@....eu>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: TCP: orphans broken by RFC 2525 #2.17

On Sun, Sep 26, 2010 at 11:01:11PM +0200, Eric Dumazet wrote:
> Le dimanche 26 septembre 2010 à 20:49 +0200, Willy Tarreau a écrit :
> > On Sun, Sep 26, 2010 at 08:35:15PM +0200, Eric Dumazet wrote:
> > > I was referring to this code. It works well for me.
> > > 
> > > shutdown(fd, SHUT_RDWR);
> > > while (recv(fd, buff, sizeof(buff), 0) > 0)
> > > 	;
> > > close(fd);
> > 
> > Ah this one yes, but it's overkill. We're actively pumping data from the
> > other side to drop it on the floor until it finally closes while we only
> > need to know when it has ACKed the FIN. In practice, doing that on a POST
> > request which causes a 302 or 401 will result in the whole amount of data
> > being transferred twice. Not only this is bad for the bandwidth, this is
> > also bad for the user, as we're causing him to experience a complete upload
> > twice, just to be sure it has received the FIN, while it's pretty obvious
> > that it's not necessary in 99.9% of the cases.
> > 
> 
> I dont understand how recv() could transfert data twice.

That's not what I said, I said the client would have to retransfer. Here's
what typically happens, for instance with an authentication requirement :

Client                                Server
           SYN ----------->
               <----------- SYN/ACK
           ACK ----------->

POST /some_url HTTP/1.1
Host: xxx
Content-length: 10 meg
             ------ headers are sent ---->
xxxxxxxxxxxxx
xxxxxxxxxxxxx ----- data are being sent ->  HTTP/1.1 401 forbidden
xxxxxxxxxxxxx                               WWW-Authenticate: basic realm="xxx"
...                                         Connection: close
             <----------------------------
xxxxxxxxxxxxx
             <---------------------------- FIN
xxxxxxxxxxxxx
...
10 megs of data being sent and drained by the server
xxxxxxxxxxxxx
             FIN --------------->
                 <--------------  ACK


second attempt with credentials this time

           SYN ----------->
               <----------- SYN/ACK
           ACK ----------->

POST /some_url HTTP/1.1
Host: x
authorization: basic xyz
Content-length: 10 meg
             ------ headers are sent ---->
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
etc...


So in this case the data is effectively transmitted twice. With an
RST once the client acks the FIN, the first transfer aborts very
early instead, saving half of the bandwidth.

> You only read from the socket receive queue to null buffer, and in most
> cases a single recv() call will be enough to drain the queue.

Indeed, in *most* cases, and just as right now in most cases there is no
problem. As I said, that's the first reported issue in 10 years and hundreds
of billions of connections cumulated on various sites. But instead of really
fixing the issue, it just reduces its occurrences. Also, it does only work
for low bandwidth clients (most common case too). That's what I'm going to
implement anyway, but this is an unreliable workaround. All I know is that
it will probably divide by 10 the number of times this problem is encountered
but it will not fix it.

> > Since this method is the least efficient one and clearly not acceptable
> > for practical cases, I wanted to dig at the root, where the information
> > is known. And the TCP recv code is precisely the place where we know
> > exactly when it's safe to reset.
> > 
> 
> And its safe to reset exactly when application just does close(), if
> unread data was not drained. Not only its safe, but required. A new RFC
> might be needed ?

I'm not requesting a new RFC, but I'm just trying to make a correct use
of orphans as implemented in the Linux stack, and I'm realizing that
since RFC2525 was implemented, orphans cannot be relied on at all anymore.
We can simply delete all the orphans code and emit an RST immediately upon
close(), there is no safe use of them now. And that's my concern. In my
opinion, the code is there and was written precisely for that usage. Since
I'm seeing that it can't be used for what it's designed for, I'm naturally
interested in trying to get it usable again. And in fact, when I really
want an RST, I can already have one by disabling lingering before the
close(). This too shows that the default close() by default protects
orphaned data.

> > Also there's another issue in doing this. It requires polling of the
> > receive side for all requests, which adds one epoll_ctl() syscall and
> > one recv() call, which have a much noticeable negative performance
> > impact at high rates (at 100000 connections per second, every syscall
> > counts). For now I could very well consider that I do this only for
> > POST requests, which currently are the ones exhibiting the issue the
> > most, but since HTTP browsers will try to enable pipelining again
> > soon, the problem will generalize to all types of requests. Hence my
> > attempts to do it the optimal way.
> 
> This might be overkill but is a portable way of doing this, on all
> versions of linux.

I'm not discussing the portability at all. You see, right now, Linux is
by some margin the fastest platform to run haproxy, and the one I always
recommend for that. Some people experience good performance on FreeBSD
too, but the fine-grained control we have on Linux helps maintaining a
high level of performance by avoiding many unnecessary steps when we
can trust the OS to do some things correctly. Having workarounds for
some versions that we know don't work as expected is not an issue, and
it's even a good reason to sometimes make people upgrade. But having to
cut performance in half under Linux on some workloads because the user
land can't know something obvious that the OS knows is a bit of a waste.

> Sure, you could patch kernel to add a new system call
> 
> close_proper(fd);

That would just be the normal close() (the one with lingering enabled)
in theory.

> As shutdown() only uses two bits, you can eventually add another bit to
> flush receive queue as well (to avoid the copy of it)

This is a good idea, but it will still leave some incorrectly handled
cases where the other side has the time to send a few packets between
the shutdown() and the close().

> Another question, is : why the server closes the socket, if you believe
> more pipelining is coming soon ?

There are quite a bunch of situations in HTTP where you have no other
solution than closing. All responses that don't advertise a length must
terminated by a close. Haproxy is a reverse-proxy, so it sits between
the client and the server. When a server sends such responses, haproxy
must forward the close to the client, regardless of what's in the request
buffer. Also, some response codes require a close. The 400 bad request
for instance, implies a mandatory close (as well as many 4xx or 5xx).
All redirects (301/302/303/307) should lead to a close if the target is
another site.

Eventhough we optimize for the most common cases, that doesn't save us
from having to support the legacy cases.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html