linux-kernel - Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161226111654.76ab0957@roar.ozlabs.ibm.com>
Date:   Mon, 26 Dec 2016 11:16:54 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Dave Hansen <dave.hansen@...ux.intel.com>,
        Bob Peterson <rpeterso@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Steven Whitehouse <swhiteho@...hat.com>,
        Andrew Lutomirski <luto@...nel.org>,
        Andreas Gruenbacher <agruenba@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-mm <linux-mm@...ck.org>,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting
 for a page bit

On Sun, 25 Dec 2016 13:51:17 -0800
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Sat, Dec 24, 2016 at 7:00 PM, Nicholas Piggin <npiggin@...il.com> wrote:
> > Add a new page flag, PageWaiters, to indicate the page waitqueue has
> > tasks waiting. This can be tested rather than testing waitqueue_active
> > which requires another cacheline load.  
> 
> Ok, I applied this one too. I think there's room for improvement, but
> I don't think it's going to help to just wait another release cycle
> and hope something happens.
> 
> Example room for improvement from a profile of unlock_page():
> 
>    46.44 │      lock   andb $0xfe,(%rdi)
>    34.22 │      mov    (%rdi),%rax
> 
> this has the old "do atomic op on a byte, then load the whole word"
> issue that we used to have with the nasty zone lookup code too. And it
> causes a horrible pipeline hickup because the load will not forward
> the data from the (partial) store.
> 
>  Its' really a misfeature of our asm optimizations of the atomic bit
> ops. Using "andb" is slightly smaller, but in this case in particular,
> an "andq" would be a ton faster, and the mask still fits in an imm8,
> so it's not even hugely larger.

I did actually play around with that. I could not get my skylake
to forward the result from a lock op to a subsequent load (the
latency was the same whether you use lock ; andb or lock ; andl
(32 cycles for my test loop) whereas with non-atomic versions I
was getting about 15 cycles for andb vs 2 for andl.

I guess the lock op drains the store queue to coherency and does
not allow forwarding so as to provide the memory ordering
semantics.

> But it might also be a good idea to simply use a "cmpxchg" loop here.
> That also gives atomicity guarantees that we don't have with the
> "clear bit and then load the value".

cmpxchg ends up at 19 cycles including the initial load, so it
may be worthwhile. Powerpc has a similar problem with doing a
clear_bit; test_bit (not the size mismatch, but forwarding from
atomic ops being less capable).

Thanks,
Nick