netdev - Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1445605340.22974.140.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Fri, 23 Oct 2015 06:02:20 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Casper.Dik@...cle.com
Cc:	Al Viro <viro@...IV.linux.org.uk>,
	Alan Burlison <Alan.Burlison@...cle.com>,
	David Miller <davem@...emloft.net>, stephen@...workplumber.org,
	netdev@...r.kernel.org, dholland-tech@...bsd.org
Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect
 for sockets in accept(3)

On Fri, 2015-10-23 at 11:52 +0200, Casper.Dik@...cle.com wrote:
> 
> >Ho-hum...  It could even be made lockless in fast path; the problems I see
> >are
> >	* descriptor-to-file lookup becomes unsafe in a lot of locking
> >conditions.  Sure, most of that happens on the entry to some syscall, with
> >very light locking environment, but... auditing every sodding ioctl that
> >might be doing such lookups is an interesting exercise, and then there are
> >->mount() instances doing the same thing.  And procfs accesses.  Probably
> >nothing impossible to deal with, but nothing pleasant either.
> 
> In the Solaris kernel code, the ioctl code is generally not handled a file 
> descriptor but instead a file pointer (i.e., the lookup is done early in 
> the system call).
> 
> In those specific cases where a system call needs to convert a file 
> descriptor to a file pointer, there is only one routines which can be used.
> 
> >	* memory footprint.  In case of Linux on amd64 or sparc64,
> >main()
> >{
> >	int i;
> >	for (i = 0; i < 1<<24; dup2(0, i++))	// 16M descriptors
> >		;
> >}
> >will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n,
> >of course).  How much will Solaris eat on the same?
> 
> Yeah, that is a large amount of memory.  Of course, the table is only 
> sized when it is extended and there is a reason why there is a limit on 
> file descriptors.  But we're using more data per file descriptor entry.
> 
> 
> >	* related to the above - how much cacheline sharing will that involve?
> >These per-descriptor use counts are bitch to pack, and giving each a cacheline
> >of its own...  <shudder>
> 
> As I said, we do actually use a lock and yes that means that you really  
> want to have a single cache line for each and every entry.  It does make 
> it easy to have non-racy file description updates.  You certainly do not 
> want false sharing when there is a lot of contention.
> 
> Other data is used to make sure that it only takes O(log(n)) to find the 
> lowest available file descriptor entry.  (Where n, I think, is the returned
> descriptor)

Yet another POSIX deficiency.

When a server deals with 10,000,000+ socks, we absolutely do not care of
this requirement.

O(log(n)) is still crazy if it involves O(log(n)) cache misses.

> 
> Not contended locks aren't expensive.  And all is done on a single cache 
> line.
> 
> One question about the Linux implementation: what happens when a socket in 
> select is closed?  I'm assuming that the kernel waits until "shutdown" is 
> given or when a connection comes in?
> 
> Is it a problem that you can "hide" your listening socket with a thread in 
> accept()?  I would think so (It would be visible in netstat but you can't 
> easily find out why has it)

Again, netstat -p on a server with 10,000,000 sockets never completes.

Never try this unless you are desperate and want to avoid a reboot
maybe.

If you absolutely want to nuke a listener because of untrusted
applications, we better implement a proper syscall.

Android has such a facility.

Alternative would be to extend netlink (ss command from iproute2
package) to carry one pid per socket.

ss -atnp state listening

-> would not have to readlink (/proc/*/fd/*)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html