lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 25 Dec 2016 13:51:17 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Nicholas Piggin <npiggin@...il.com>
Cc:     Dave Hansen <dave.hansen@...ux.intel.com>,
        Bob Peterson <rpeterso@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Steven Whitehouse <swhiteho@...hat.com>,
        Andrew Lutomirski <luto@...nel.org>,
        Andreas Gruenbacher <agruenba@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-mm <linux-mm@...ck.org>,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for
 a page bit

On Sat, Dec 24, 2016 at 7:00 PM, Nicholas Piggin <npiggin@...il.com> wrote:
> Add a new page flag, PageWaiters, to indicate the page waitqueue has
> tasks waiting. This can be tested rather than testing waitqueue_active
> which requires another cacheline load.

Ok, I applied this one too. I think there's room for improvement, but
I don't think it's going to help to just wait another release cycle
and hope something happens.

Example room for improvement from a profile of unlock_page():

   46.44 │      lock   andb $0xfe,(%rdi)
   34.22 │      mov    (%rdi),%rax

this has the old "do atomic op on a byte, then load the whole word"
issue that we used to have with the nasty zone lookup code too. And it
causes a horrible pipeline hickup because the load will not forward
the data from the (partial) store.

 Its' really a misfeature of our asm optimizations of the atomic bit
ops. Using "andb" is slightly smaller, but in this case in particular,
an "andq" would be a ton faster, and the mask still fits in an imm8,
so it's not even hugely larger.

But it might also be a good idea to simply use a "cmpxchg" loop here.
That also gives atomicity guarantees that we don't have with the
"clear bit and then load the value".

Regardless, I think this is worth more people looking at and testing.
And merging it is probably the best way for that to happen.

                Linus

Powered by blists - more mailing lists