netdev - Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5628C0AE.2020508@oracle.com>
Date:	Thu, 22 Oct 2015 11:55:42 +0100
From:	Alan Burlison <Alan.Burlison@...cle.com>
To:	Al Viro <viro@...IV.linux.org.uk>, Casper.Dik@...cle.com
CC:	Eric Dumazet <eric.dumazet@...il.com>,
	Stephen Hemminger <stephen@...workplumber.org>,
	netdev@...r.kernel.org, dholland-tech@...bsd.org
Subject: Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect
 for sockets in accept(3)

On 22/10/2015 05:21, Al Viro wrote:

>> Most of the work on using a file descriptor is local to the thread.
>
> Using - sure, but what of cacheline dirtied every time you resolve a
> descriptor to file reference?

Don't you have to do that anyway, to do anything useful with the file?

> How much does it cover and to what
> degree is that local to thread?  When it's a refcount inside struct file -
> no big deal, we'll be reading the same cacheline anyway and unless several
> threads are pounding on the same file with syscalls at the same time,
> that won't be a big deal.  But when refcounts are associated with
> descriptors...

There is a refcount in the struct held in the per-process list of open 
files and the 'slow path' processing is only taken if there's more than 
one LWP in the process that's accessing the file.

> In case of Linux we have two bitmaps and an array of pointers associated
> with descriptor table.  They grow on demand (in parallel)
> 	* reserving a descriptor is done under ->file_lock (dropped/regained
> around memory allocation if we end up expanding the sucker, actual reassignment
> of pointers to array/bitmaps is under that spinlock)
> 	* installing a pointer is lockless (we wait for ongoing resize to
> settle, RCU takes care of the rest)
> 	* grabbing a file by index is lockless as well
> 	* removing a pointer is under ->file_lock, so's replacing it by dup2().

Is that table per-process or global?

> The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
> over a descriptor we'd opened and kept open should be safe.  As it is,
> their implementation boils down to "userland must do process-wide exclusion
> between *any* dup2() and all syscalls that might create a new descriptor -
> open()/pipe()/socket()/accept()/recvmsg()/dup()/etc".  At the very least,
> it's a big QoI problem, especially since such userland exclusion would have
> to be taken around the operations that can block for a long time.  Sure,
> POSIX wording regarding dup2 is so weak that this behaviour can be argued
> to be compliant, but... replacement of the opened file associated with
> newfd really ought to be atomic to be of any use for multithreaded processes.

There's existing language in the Issue 7 dup2() description that says 
dup2() has to be atomic:

"the dup2( ) function provides unique services, as no other
interface is able to atomically replace an existing file descriptor."

And there is some new language in Issue 7 Technical Corrigenda 2 that 
reinforces that, when it's talking about reassignment of 
stdin/stdout/stderr:

"Furthermore, a close() followed by a reopen operation (e.g. open(), 
dup() etc) is not atomic; dup2() should be used to change standard file 
descriptors."

I don't think that it's possible to claim that a non-atomic dup2() is 
POSIX-compliant.

> IOW, if newfd is an opened descriptor prior to dup2() and no other thread
> attempts to close it by any means, there should be no window during which
> it would appear to be not opened.  Linux and FreeBSD satisfy that; OpenBSD
> seems to do the same, from the quick look.  NetBSD doesn't, no idea about
> Solaris.  FWIW, older NetBSD implementation (prior to "File descriptor changes,
> discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
> had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
> kept the old implementation, with tons of other races in that area, but this
> dup2() race got introduced in that rewrite.

Related to dup2(), there's some rather surprising behaviour on Linux. 
Here's the scenario:

----------
ThreadA opens, listens and accepts on socket fd1, waiting for incoming 
connections.

ThreadB waits for a while, then opens normal file fd2 for read/write.
ThreadB uses dup2 to make fd1 a clone of fd2.
ThreadB closes fd2.

ThreadA remains sat in accept on fd1 which is now a plain file, not a 
socket.

ThreadB writes to fd1, the result of which appears in the file, so fd1 
is indeed operating as a plain file.

ThreadB exits. ThreadA is still sat in accept on fd1.

A connection is made to fd1 by another process. The accept call succeeds 
and returns the incoming  connection. fd1 is still operating as a 
socket, even though it's now actually a plain file.
----------

I assume this is another consequence of the fact that threads waiting in 
accept don't get a notification if the fd they are using is closed, 
either directly by a call to close or by a syscall such as dup2. Not 
waking up other threads on a fd when it is closed seems like it's 
undesirable behaviour.

I can see the reasoning behind allowing shutdown to be used to do such a 
wakeup even if that's not POSIX-compliant - it may make it slightly 
easier for applications avoid fd recycling races. However the current 
situation is that shutdown is the *only* way to perform such a wakeup - 
simply closing the fd has no effect on any other threads. That seems 
incorrect.

-- 
Alan Burlison
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html