lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wix0YNq1U8iroRLpx+fCUGE8RG3asY8Zm4vyH-g4UhbPg@mail.gmail.com>
Date:   Wed, 25 Nov 2020 13:30:20 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
        syzbot <syzbot+3622cea378100f45d59f@...kaller.appspotmail.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        "Theodore Ts'o" <tytso@....edu>, Linux-MM <linux-mm@...ck.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Nicholas Piggin <npiggin@...il.com>,
        Alex Shi <alex.shi@...ux.alibaba.com>, Qian Cai <cai@....pw>,
        Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Jens Axboe <axboe@...nel.dk>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 3:24 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> I've applied your second patch (the smaller one that just takes a ref
> around the critical section). If somebody comes up with some great
> alternative, we can always revisit this.

Hmm.

I'm not sure about "great alternative", but it strikes me that we
*could* move the clearing of the PG_writeback bit _into_
wake_up_page_bit(), under the page waitqueue lock.

IOW, we could make the rule be that the bit isn't actually cleared
before calling wake_up_page() at all, and we'd clear it with something
like

    unsigned long flags = READ_ONCE(page->flags);

    // We can clear PG_writeback directly if PG_waiters isn't set
    while (!(flags & (1ul << PG_waiters))) {
        unsigned long new = flags & ~(1ul << PG_writeback);
        // PG_writeback was already clear??!!?
        if (WARN_ON_ONCE(new == flags))
            return;
        new = cmpxchg(&page->flags, flags, new);
        if (likely(flags == new))
            return;
        flags = new;
    }

    // Otherwise, clear the bit at the end - but under the
    // page waitqueue lock - inside wake_up_page_bit()
    return wake_up_page_bit(..);

instead.

That would basically make the bit clearing atomic wrt the PG_waiters
flags - either using that atomic cmpxchg, or by doing it under the
page queue lock so that it's atomic wrt any new waiters.

This seems conceptually like the right thing to do - and if would also
make the (fair) exclusive lock hand-off case atomic too, because the
bit we're waking up on would never be cleared if it gets handed off
directly.

The above is entirely untested crap written in my MUA, and obviously
requires that all callers of wake_up_page() be moved to that new world
order, but I think we only have two cases: unlock_page() and
end_page_writeback().

And unlock_page() already has that
"clear_bit_unlock_is_negative_byte()" special case that is an ugly
special case of PG_waiters atomicity. So we'd get rid of that, because
the cmpxchg loop would be the better model.

I'm not sure I'm willing to write and test the real patch, but it
doesn't look _too_ nasty from just looking at the code. The bookmark
thing makes it important to only actually clear the bit at the end (as
does the handoff case anyway), but the way wake_up_page_bit() is
written, that's actually very straightforward - just after the
while-loop. That's when we've woken up everybody.

So I'm sending this idea out to see if somebody can shoot it down, or
even wants to possibly even try to do it..

                Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ