[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjX_P78xoNcGDTjhkgffs-Bhzcwp-mdsE1maeF57Sh0MA@mail.gmail.com>
Date: Sun, 30 Oct 2022 11:19:36 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Nadav Amit <nadav.amit@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Jann Horn <jannh@...gle.com>,
John Hubbard <jhubbard@...dia.com>, X86 ML <x86@...nel.org>,
Matthew Wilcox <willy@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
kernel list <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
Andrea Arcangeli <aarcange@...hat.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
jroedel@...e.de, ubizjak@...il.com,
Alistair Popple <apopple@...dia.com>
Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment
On Sat, Oct 29, 2022 at 7:17 PM Nadav Amit <nadav.amit@...il.com> wrote:
>
> Running the PoC on Linux 6.0.6 with these patches caused the following splat
> on the following line:
>
> WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio));
Yeah, this is a sign of that "folio_mkclean() serializes with
folio_mark_dirty using rmap and the page table lock".
And page_remove_rmap() could *almost* be called later, but it does
have code that also depends on the page table lock, although it looks
like realistically that's just because it "knows" that means that
preemption is disabled, so it uses non-atomic statistics update.
I say "knows" in quotes, because that's what the comment says, but it
turns out that __mod_node_page_state() has to deal with CONFIG_RT
anyway and does that
preempt_disable_nested();
...
preempt_enable_nested();
thing.
And then it wants to see the vma, although that's actually only to see
if it's 'mlock'ed, so we could just squirrel that away.
So we *could* move page_remove_rmap() later into the TLB flush region,
but then we would have lost the page table lock anyway, so then
folio_mkclean() can come in regardless.
So that doesn't even help.
End result: we do want to do the page_set_dirty() and the
remove_rmap() under the paeg table lock, because it's what serializes
folio_mkclean().
And we'd _like_ to do the TLB flush before the remove_rmap(), but we
*really* don't want to do that for every page.
So my current gut feel is that we should just say that if you do
"MADV_DONTNEED or do a munmap() (which includes the "re-mmap() over
the area", while some other thread is still writing to that memory
region, you may lose writes.
IOW, just accept the behavior that Nadav's test-program tries to show,
and say "look, you're doing insane things, we've never given you any
other semantics, it's your problem" to any user program that does
that.
If a user program does MADV_DONTNEED on an area that it is actively
using at the same time in another thread, that sounds really really
bogus. Same goes doubly for 'munmap()' or over-mapping.
Linus
Powered by blists - more mailing lists