[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37a9442e-f6e5-35f5-0d51-669d60936b5f@redhat.com>
Date: Wed, 7 Dec 2022 20:53:36 +0100
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Ives van Hoorne <ives@...esandbox.io>,
Andrew Morton <akpm@...ux-foundation.org>,
Alistair Popple <apopple@...dia.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Nadav Amit <nadav.amit@...il.com>,
Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while
userfaultfd-wp is enabled for a VMA
>>
>> On upstream during the next write fault, we'll end up in do_numa_page() and
>> simply remap the page writable due to vm_page_prot, not triggering a write
>> fault. I can see the "numa_hint_faults" counter in /proc/vmstat increasing
>> accordingly, so we're really in do_numa_page().
>
> Seems true. I think fundamentally it's because numa hint rely on PROT_NONE
> as the hint, and it explicitly checks against mprotect(PROT_NONE) using the
> accessible check:
>
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> return do_numa_page(vmf);
>
> I'm not sure whether we should also add a pte_uffd_wp(vmf->orig_pte) here
> to mask out the uffd-wp cases.
:/ more special UFFD-wp casing, I'm not sure sure about that.
Most importantly, once someone unlocks NUMA hinting for shmem (e.g.,
MPOL_MF_LAZY, MPOL_F_NUMA_BALANCING) this might be problematic. That at
least makes it sound fragile to me.
>
> So far it seems the outcome is not extremely bad - PROT_WRITE only mappings
> are rare in real life, and also with the protnone recovery code (and along
> with the vm_page_prot patch coming) we'll be able to still recover the pte
> into a uffd-wp-ed pte without PROTNONE bit set. But I don't have a solid
> clue yet on what's the best.
>
Yes, just another way to trigger surprise uffd-wp behavior (at least
surprising for me ;) ). But this time, not involving mprotect(). I
suspect there are more cases, but I might be wrong.
I was primarily trying to find out which other cases might be affected.
[..]
>>
>>
>> Independent of uffd-wp on shmem, we seem to be missing propagating the
>> uffd-wp bit when splitting the huge zeropage. So uffd-wp'ing the huge
>> zeropage and then splitting it loses the uffd-wp markers. :/
>
> For this one, thanks for the reproducer. I'm not extremely sure whether
> it's a bug.
>
> Firstly, I think your reproducer should just work well with shmem, afaiu,
> because shmem is based on pte markers and it should only work on pte level
> (not pmd). The huge zero pmd should got split right after wr-protected.
> So the reproducer shouldn't go wrong on shmem at all with/without any
> recent fix. Let me know otherwise.
shmem doesn't use the huge shared zeropage, so it should be fine.
>
> For anon, I'm not sure it's a bug, because there's a semantic difference on
> anon/shmem. The thing is losing wr-protect on the zero page is the same as
> losing wr-protect on a page that is not mapped. For anon currently we
> can't track a page that is not mapped and we skip those ranges (being zero
> when read). So fundamentally I am not sure whether it'll be an issue for
> existing anon uffd-wp users because if it is then it's more than zero
> pages.
I think it's a bug, although most probably a low priority one.
Once user space successfully placed an uffd-wp marker, and e.g.,
verified using pagemap that it is indeed placed, the system should not
silently drop it.
The behavior between an ordinary THP and a huge zeropage differs. For
THP, we handle the split correctly and don't lose the marker. Assuming
the huge zeropage woud be disabled, the behavior would be (IMHO)
correct. The test case would pass.
For example, QEMU with uffd-wp based snapshotting will make sure that
all virtual addresses are populated (e.g., mapping the shared,
eventually the huge zeropage -- populate_read_range()), before
protecting using uffd-wp. Losing a uffd-wp marker would be problematic.
The good news is that we barely will end up PTE-mapping the huge
zeropage unless there is real user-space interaction (mprotect(),
mremap(), mmap()), so this shouldn't trigger in the QEMU use-case.
Anyhow, I'll send a patch in a couple of days and we can discuss
further. It's independent of the other discussion, just wanted to report
my findings after staring at that code for way too long today.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists