netdev - Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 28 Oct 2015 21:13:47 +0000
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	David Miller <davem@...emloft.net>, stephen@...workplumber.org,
	netdev@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	dhowells@...hat.com, linux-fsdevel@...r.kernel.org
Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect
 for sockets in accept(3)

On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> 
> > Before I take a deep look at your suggestion, are you sure plain use of
> > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> 
> BTW, I am not convinced we need to spend so much energy and per-cpu
> memory for struct cred refcount.
> 
> The big problem is fd array spinlock of course and bitmap search for
> POSIX compliance.
> 
> The cache line trashing in struct cred is a minor one ;)

percpu-refcount isn't convenient - the only such candidate for ref_kill in
there is "all other references are gone", and that can happen in
interesting locking environments.  I doubt that it would be a good fit, TBH...

Cacheline pingpong on the descriptors bitmap is probably inevitable, but
it's not the only problem in the existing implementation - close a small
descriptor when you've got a lot of them and look for the second open
after that.  _That_ can lead to thousands of cachelines being read through,
all under the table spinlock.  It's literally orders of magnitude worse.
And if the first open after that close happens to be for a short-living
descriptor, you'll get the same situation back in your face as soon as you
close it.

I think we can seriously improve that without screwing the fast path by
adding "summary" bitmaps once the primary grows past the cacheline worth
of bits.  With bits in the summary bitmap corresponding to cacheline-sized
chunks of the primary, being set iff all bits in the corresponding chunk
are set.  If the summary map grows larger than one cacheline, add the
second-order one (that happens at quarter million descriptors and serves
until 128 million; adding the third-order map is probably worthless).

I want to maintain the same kind of "everything below this is known to be
in use" thing as we do now.  Allocation would start with looking into the
same place in primary bitmap where we'd looked now and similar search
forward for zero bit.  _However_, it would stop at cacheline boundary.
If nothing had been found, we look in the corresponding place in the
summary bitmap and search for zero bit there.  Again, no more than up
to the cacheline boundary.  If something is found, we've got a chunk in
the primary known to contain a zero bit; if not - go to the second-level
and search there, etc.

When a zero bit in the primary had been found, check if it's within the
rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
that bit.  If there are zero bits left in the same word - we are done,
otherwise check the still unread words in the cacheline and see if all
of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.

Normal case is exactly the same as now - one cacheline accessed and modified.
We might end up touching more than that, but it's going to be rare and
the cases when it happens are very likely to lead to much worse amount of
memory traffic with the current code.

Freeing is done by zeroing the bit in primary, checking for other zero bits
nearby and buggering off if there are such.  If the entire cacheline used
to be all-bits-set, clear the bit in summary and, if there's a second-order
summary, get the bit in there clear as well - it's probably not worth
bothering with checking that all the cacheline in summary bitmap had been
all-bits-set.  Again, the normal case is the same as now.

It'll need profiling and tuning, but AFAICS it's doable without making the
things worse than they are now, and it should get rid of those O(N) fetches
under spinlock cases.  And yes, those are triggerable and visible in
profiles.  IMO it's worth trying to fix...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html