lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgzT1QsSCF-zN+eS06WGVTBg4sf=6oTMg95+AEq7QrSCQ@mail.gmail.com>
Date:   Fri, 28 Oct 2022 17:42:39 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Jann Horn <jannh@...gle.com>,
        John Hubbard <jhubbard@...dia.com>, X86 ML <x86@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        jroedel@...e.de, ubizjak@...il.com
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@...il.com> wrote:
>
> The problem is in the following code of zap_pte_range():
>
>         if (!PageAnon(page)) {
>                 if (pte_dirty(ptent)) {
>                         force_flush = 1;
>                         set_page_dirty(page);
>                 }
>                 …
>         }
>         page_remove_rmap(page, vma, false);
>
> Once we remove the rmap, rmap_walk() would not acquire the page-table lock
> anymore. As a result, nothing prevents the kernel from performing writeback
> and cleaning the page-struct dirty-bit before the TLB is actually flushed.

Hah.

The original reason for force_flush there was similar, with a race wrt
page_mkclean() because this code doesn't take the page lock that would
normally serialize things, because the page lock is so painful and
ends up having nasty nasty interactions with slow IO operations.

So all the page-dirty handling really *wants* to use the page lock,
and for the IO side (writeback) that ends up being acceptable and
works well, but from that "serialize VM state" it's horrendous.

So instead the code intentionally serialized on the rmap data
structures which page_mkclean() also walks, and as you point out,
that's broken. It's not broken at the point where we do
set_page_dirty(), but it *comes* broken when we drop the rmap, and the
problem is exactly that "we still have the dirty bit hidden in the TLB
state" issue that you pinpoint.

I think the proper fix (or at least _a_ proper fix) would be to
actually carry the dirty bit along to the __tlb_remove_page() point,
and actually treat it exactly the same way as the page pointer itself
- set the page dirty after the TLB flush, the same way we can free the
page after the TLB flush.

We could easiy hide said dirty bit in the low bits of the
"batch->pages[]" array or something like that. We'd just have to add
the 'dirty' argument to __tlb_remove_page_size() and friends.

Hmm?

Your idea of "do the page_remove_rmap() late instead" would also work,
but the reason I think just squirrelling away the dirty bit is the
"proper" fix is that it would get rid of the whole need for
'force_flush' in this area entirely. So we'd not only fix that race
you noticed, we'd actually do so and reduce the number of TLB flushes
too.

I don't know. Maybe I'm missing something fundamental, and my idea is
just stupid.

               Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ