linux-kernel - Re: kernel BUG at fs/ext4/inode.c:LINE!

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201124201552.GE4327@casper.infradead.org>
Date:   Tue, 24 Nov 2020 20:15:52 +0000
From:   Matthew Wilcox <willy@...radead.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Hugh Dickins <hughd@...gle.com>, Jan Kara <jack@...e.cz>,
        syzbot <syzbot+3622cea378100f45d59f@...kaller.appspotmail.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Theodore Ts'o <tytso@....edu>, Linux-MM <linux-mm@...ck.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Nicholas Piggin <npiggin@...il.com>,
        Alex Shi <alex.shi@...ux.alibaba.com>, Qian Cai <cai@....pw>,
        Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Jens Axboe <axboe@...nel.dk>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 11:00:42AM -0800, Linus Torvalds wrote:
> On Tue, Nov 24, 2020 at 10:33 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > We could fix this by turning that 'if' into a 'while' in
> > write_cache_pages().
> 
> That might be the simplest patch indeed.
> 
> At the same time, I do worry about other cases like this: while
> spurious wakeup events are normal and happen in other places, this is
> a bit different.
> 
> This is literally a wakeup that leaks from a previous use of a page,
> and makes us think that something could have happened to the new use.
> 
> The unlock_page() case presumably never hits that, because even if we
> have some unlock without a page ref (which I don't think can happen,
> but whatever..), the exclusive nature of "lock_page()" means that no
> locker can care - once you get the lock, you own the page./
> 
> The writeback code is special in that the writeback bit isn't some
> kind of exclusive bit, but this code kind of expected it to be that.
> 
> So I'd _like_ to have something like
> 
>         WARN_ON_ONCE(!page_count(page));
> 
> in the wake_up_page_bit() function, to catch things that wake up a
> page that has already been released and might be reused..
> 
> And that would require the "get_page()" to be done when we set the
> writeback bit and queue the page up for IO (so that then
> end_page_writeback() would clear the bit, do the wakeup, and then drop
> the ref).
> 
> Hugh's second patch isn't pretty - I think the "get_page()" is
> conceptually in the wrong place - but it "works" in that it keeps that
> "implicit page reference" being kept by the PG_writeback bit, and then
> it takes an explicit page reference before it clears the bit.
> 
> So while I don't love the whole "PG_writeback is an implicit reference
> to the page" model, Hugh's patch at least makes that model much more
> straightforward: we really either have that PG_writeback, _or_ we have
> a real ref to the page, and we never have that odd "we could actually
> lose the page" situation.
> 
> So I think I prefer Hugh's two-liner over your one-liner suggestion.
> 
> But your one-liner is technically not just smaller, it obviously also
> avoids the whole mucking with the atomic page ref.
> 
> I don't _think_ that the extra get/put overhead could possibly really
> matter: doing the writeback is going to be a lot more expensive
> anyway. And an atomic access to a 'struct page' sounds expensive, but
> that cacheline is already likely dirty in the L1 cache because we've
> touch page->flags and done other things to it).
> 
> So I'd personally be inclined to go with Hugh's patch. Comments?

My only objection to Hugh's patch is that it may cause us to fail
to split pages when we can currently split them.  That is, we do:

	wait_on_page_writeback()
	if (page_has_private(page))
		do_invalidatepage(page, offset, length);
	split_huge_page()

(at least we do in my THP patchset; not sure if there's any of that
in the kernel today), and the extra reference held for a few nanoseconds
after calling wake_up_page() will cause us to fail to split the page.
It probably doesn't matter; there has to be a fallback path anyway.

Now I'm looking at that codepath, and the race that Hugh uncovered now
looks like a real bug.  Consider this sequence:

page allocated, added to page cache, dirtied, writeback started

--- thread A ---
end_page_writeback()
	test_clear_page_writeback
--- ctx switch to thread B ---
alloc page, add to page cache, dirty page, start page writeback,
truncate_inode_pages_range()
	wait_on_page_writeback()
--- ctx switch to thread A ---
	wake_up_page()
--- ctx switch to thread B ---
free page
alloc page
write new data to page

... now the DMA actually starts to do page writeback, and it's writing
the new data.

So my s/if/while/ suggestion is wrong and we need to do something to
prevent spurious wakeups.  Unless we bury the spurious wakeup logic
inside wait_on_page_writeback() ...