linux-kernel - Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LSU.2.11.2101110947280.1731@eggly.anvils>
Date:   Mon, 11 Jan 2021 09:54:37 -0800 (PST)
From:   Hugh Dickins <hughd@...gle.com>
To:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>
cc:     linux-kernel@...r.kernel.org, stable@...r.kernel.org,
        syzbot+2fc0712f8f8b8b8fa0ef@...kaller.appspotmail.com,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>, stable@...nel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for
 multiple pending writebacks

On Mon, 11 Jan 2021, Greg Kroah-Hartman wrote:

> From: Linus Torvalds <torvalds@...ux-foundation.org>
> 
> commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.
> 
> Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
> logic") we've had some very occasional reports of BUG_ON(PageWriteback)
> in write_cache_pages(), which we thought we already fixed in commit
> 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
> 
> But syzbot just reported another one, even with that commit in place.
> 
> And it turns out that there's a simpler way to trigger the BUG_ON() than
> the one Hugh found with page re-use.  It all boils down to the fact that
> the page writeback is ostensibly serialized by the page lock, but that
> isn't actually really true.
> 
> Yes, the people _setting_ writeback all do so under the page lock, but
> the actual clearing of the bit - and waking up any waiters - happens
> without any page lock.
> 
> This gives us this fairly simple race condition:
> 
>   CPU1 = end previous writeback
>   CPU2 = start new writeback under page lock
>   CPU3 = write_cache_pages()
> 
>   CPU1          CPU2            CPU3
>   ----          ----            ----
> 
>   end_page_writeback()
>     test_clear_page_writeback(page)
>     ... delayed...
> 
>                 lock_page();
>                 set_page_writeback()
>                 unlock_page()
> 
>                                 lock_page()
>                                 wait_on_page_writeback();
> 
>     wake_up_page(page, PG_writeback);
>     .. wakes up CPU3 ..
> 
>                                 BUG_ON(PageWriteback(page));
> 
> where the BUG_ON() happens because we woke up the PG_writeback bit
> becasue of the _previous_ writeback, but a new one had already been
> started because the clearing of the bit wasn't actually atomic wrt the
> actual wakeup or serialized by the page lock.
> 
> The reason this didn't use to happen was that the old logic in waiting
> on a page bit would just loop if it ever saw the bit set again.
> 
> The nice proper fix would probably be to get rid of the whole "wait for
> writeback to clear, and then set it" logic in the writeback path, and
> replace it with an atomic "wait-to-set" (ie the same as we have for page
> locking: we set the page lock bit with a single "lock_page()", not with
> "wait for lock bit to clear and then set it").
> 
> However, out current model for writeback is that the waiting for the
> writeback bit is done by the generic VFS code (ie write_cache_pages()),
> but the actual setting of the writeback bit is done much later by the
> filesystem ".writepages()" function.
> 
> IOW, to make the writeback bit have that same kind of "wait-to-set"
> behavior as we have for page locking, we'd have to change our roughly
> ~50 different writeback functions.  Painful.
> 
> Instead, just make "wait_on_page_writeback()" loop on the very unlikely
> situation that the PG_writeback bit is still set, basically re-instating
> the old behavior.  This is very non-optimal in case of contention, but
> since we only ever set the bit under the page lock, that situation is
> controlled.
> 
> Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@...kaller.appspotmail.com
> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> Acked-by: Hugh Dickins <hughd@...gle.com>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Matthew Wilcox <willy@...radead.org>
> Cc: stable@...nel.org
> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>

I think it's too early to push this one through to stable:
Linus mentioned on Friday that Michael Larabel of Phoronix
has observed a performance regression from this commit.

Correctness outweighs performance of course, but I think
stable users might see the performance issue much sooner
than they would ever see the BUG fixed.  Wait a bit,
while we think some more about what to try next?

Hugh

> 
> ---
>  mm/page-writeback.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback)
>   */
>  void wait_on_page_writeback(struct page *page)
>  {
> -	if (PageWriteback(page)) {
> +	while (PageWriteback(page)) {
>  		trace_wait_on_page_writeback(page, page_mapping(page));
>  		wait_on_page_bit(page, PG_writeback);
>  	}