linux-kernel - Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92173bad-caa3-6b43-9d1e-9a471fdbc184@redhat.com>
Date:   Tue, 6 Dec 2022 17:28:07 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Ives van Hoorne <ives@...esandbox.io>,
        stable@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hugh@...itas.com>,
        Alistair Popple <apopple@...dia.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while
 userfaultfd-wp is enabled for a VMA

>>>> I think fundamentally the core is uffd-wp is pte-based, so the information
>>>> resides in pte not vma.  I'm not strongly objecting this patch, especially
>>>> you mentioned auto-numa so I need to have a closer look later there.
>>>> However I do think uffd-wp is slightly special because we always need to
>>>> consider pte information anyway, so a per-vma information doesn't hugely
>>>> help, IMHO.
>>>
>>> That's the same as softdirty tracking, IMHO.
> 
> Soft-dirty doesn't have a bit in the pte showing whether the page is
> protected.

If the page is already softdirty (PTE bit), we don't need write faults 
and consequently can map the page writable. If the page is not 
softdirty, we need !write as default. But let's talk in detail about 
vma->vm_page_prot and interaction with other wrprotect features below.

> 
> One wr-protects in soft-dirty with either ALL or NONE.  That's per-vma.
> 
> One wr-protects in uffd-wp by wr-protect specific page or range of pages.
> That's per-page.
> 

Let me try to explain clearer what the issue with uffd-wp on shmem is 
right now and how to proceed from here. maybe that makes it clearer what 
the vma->vm_page_prot semantics are.

vma->vm_page_prot defines the default PTE/PMD/... protection. This 
protection is always *safe*, meaning, if you blindly use that protection 
when (re)mapping a page, there won't be correctness issues. No need to 
manually wrprotect afterwards.

That's why migration, mprotect(), NUMA faults that all rely on 
vma->vm_page_prot work as expected for now.


When calculating vma->vm_page_prot, we considers three things:

(1) VMA protection. For example, no VM_WRITE? then it won't include
     write permissions.
(2) Private vs. shared: If the mapping is private, we will never have
     write permissions in vma->vm_page_prot, because we might have to
     COW individual PTEs. We really want write faults as default.
(3) Write-notify: Some mechanism (softdirty tracking, file system, uffd-
     wp) *might* require default write-protection, such that we always
     properly trigger a write fault instead where we can handle these.
     This only applies to shared mappings, because private mappings
     always default to !write (2).

As vma->vm_page_prot is safe, it is always valid to apply them when 
mapping a PTE/PMD/ ... *in addition* we can use other information, to 
detect if we still can map the PTE writable (if they were removed due to 
(2) or (3)).

In migration code, we use the migration type (writable migration 
entries). In NUMA-hinting code we used the stored savedwrite value. 
Absence of these bits does not imply that we have to map the page 
wrprotected.


We never had to remove write permissions so far from the 
vma->vm_page_prot default. We always only added permissions.


Now, uffd-wp on shmem as of now violates these semantics. 
vma->vm_page_prot cannot always be used as a safe default, because we 
might have to wrprotect individual PTEs. Note that for uffd-wp on 
anonymous memory this was not an issue, because we default to !write in 
vma->vm_page_prot.


The two possible ways to approach this for uffd-wp on shmem are:

(1) Obey existing vma->vm_page_prot semantics. Default to !write and
     optimize the relevant cases to *add* the write bit. This is
     essentially what this patch does, minus
     can_change_pte_writable() optimizations on relevant code paths where
     we'd want to maintain the write bit. For example, when removing
     uffd-wp protection we might want to restore the write bit directly.

(2) Default to write permissions and check each and every code location
     where we remap for uffd-wp ptes, to manuall wrprotect -- *remove*
     the write bit. Alternatively, as you said, always wrprotect when
     setting the PTE bit, which might work as well.


My claim is that (1) is less error prone, because in the worst case we 
forget to optimize one code location -- instead to resulting in a BUG if 
we forget to wrprotect (what we have now). But I am not going to fight 
for it, because I can see that (2) can be made to work as well, as you 
outline in your patch.

You seem to have decided on (2). Fair enough, you know uffd-wp best. We 
just have to fix it properly and make the logic consistent whenever we 
remap a page.


Is anything I said fundamentally wrong?

>>>
>>> [...]
>>>
>>>>> Running the mprotect() reproducer [1] without this commit:
>>>>>      $ ./uffd-wp-mprotect
>>>>>      FAIL: uffd-wp did not fire
>>>>> Running the mprotect() reproducer with this commit:
>>>>>      $ ./uffd-wp-mprotect
>>>>>      PASS: uffd-wp fired
>>>>>
>>>>> [1] https://lore.kernel.org/all/222fc0b2-6ec0-98e7-833f-ea868b248446@redhat.com/T/#u
>>>>
>>>> I still hope for a formal patch (non-rfc) we can have a reproducer outside
>>>> mprotect().  IMHO mprotect() is really ambiguously here being used with
>>>> uffd-wp, so not a good example IMO as I explained in the other thread [1].
>>>
>>> I took the low hanging fruit to showcase that this is a more generic problem.
>>> The reproducer is IMHO nice because it's simple and race-free.
> 
> If no one is using mprotect() with uffd-wp like that, then the reproducer
> may not be valid - the reproducer is defining how it should work, but does
> that really stand?  That's why I said it's ambiguous, because the
> definition in this case is unclear.

There are interesting variations like:

mmap(PROT_READ, MAP_POPULATE|MAP_SHARED)
uffd_wp()
mprotect(PROT_READ|PROT_WRITE)

Where we start out with all-write permissions before we enable selective 
write permissions.

But I'm not going to argue about whats valid and whats not as long as 
it's documented ;). I primarily wanted to showcase that the same logic 
based on vma->vm_page_prot is used elsewhere, and that migration PTE 
restoration is not particularly special.

> 
> I think numa has the problem too which I agree with you.  If you attach a
> numa reproducer it'll be nicer.  But again I'm not convinced uffd-wp is a
> per-vma thing, which seems to be what this patch is based upon.

I might be able to give NUMA hinting a shot later this week, but I have 
other stuff to focus on.

> 
> Now I really wonder whether I should just simply wr-protect pte for
> pte_mkuffd_wp() always, attached.  I didn't do that from the start because
> I wanted to keep the helpers operate on one bit only.  But I found that
> it's actually common technique to use in pgtable arch code, and it really
> doesn't make sense to not wr-protect a pte if uffd-wp is set on a present
> entry.  It's much safer.

Indeed, that would be much safer.

> 
>>>
>>>>
>>>> I'll need to off-work most of the rest of today, but maybe I can also have
>>>> a look in the weekend or Monday more on the numa paths.  Before that, can
>>>> we first reach a consensus that we have the mm/migrate patch there to be
>>>> merged first?  These are two issues, IMHO.
>>>>
>>>> I know you're against me for some reason, but until now I sincerely don't
>>>> know why.  That patch sololy recovers write bit status (by removing it for
>>>> read-only) for a migration entry and that definitely makes sense to me.  As
>>>> I also mentioned in the old version of that thread, we can rework migration
>>>> entries and merge READ|WRITE entries into a GENERIC entry one day if you
>>>> think proper, but that's for later.
>>>
>>> I'm not against you. I'm against changing well-working, common code
>>> when it doesn't make any sense to me to change it.
> 
> This goes back to the original question of whether we should remove the
> write bit for read migration entry.  Well, let's just focus on others;
> we're all tired of this one.

I hope my lengthy explanation above was helpful and correct.

> 
>>> And now we have proof that
>>> mprotect() just behaves exactly the same way, using the basic rules of vma->vm_page_prot.
>>>
>>> Yes, there is broken sparc64 (below), but that shouldn't dictate our implementation.
> 
> I doubt whether sparc64 is broken if it has been like that anyway, because
> I know little on sparc64 so I guess I'd not speak on that.
> 

I'd wish some of the sparc64 maintainers would speak up finally. Anyhow, 
most probably I'll write a reproducer for some broken case and send it 
to the sparc64 list. Yeah, I need to get Linux for sparc64 running 
inside a VM ... :/

[...]

> 
> Yes, but now I'm prone to the patch I attached which should just cover all
> pte_mkuffd_wp().

We should probably do the same for the pmd variants, and clean up the 
existing wrprotect like:
	pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));

But that's of course, not a fix.


> 
> Side note: since looking at the numa code, I found that after the recent
> rework of removing savedwrite for numa, cdb205f9e220 ("mm/autonuma: use
> can_change_(pte|pmd)_writable() to replace savedwrite"), I think it can
> happen that after numa balancing one pte under uffd-wp vma (but not
> wr-protected) can have its write bit lost if the migration failed during
> recovering, because vma_wants_manual_pte_write_upgrade() will return false
> for such case.  Is it true?

Yes, you are correct. I added that to the patch description:

"
Note that we don't optimize for the actual migration case:
     (1) When migration succeeds the new PTE will not be writable because
         the source PTE was not writable (protnone); in the future we
         might just optimize that case similarly by reusing
         can_change_pte_writable()/can_change_pmd_writable() when
         removing migration PTEs.
     (2) When migration fails, we'd have to recalculate the "writable"
         flag because we temporarily dropped the PT lock; for now keep it
         simple and set "writable=false".
"

Case (1) would, with your current patch, always lose the write bit 
during migration, even if vma->vm_page_prot included it. We most might 
want to optimize that in the future.

Case (2) is rather a corner case, and unless people complain about it 
being a real performance issue, it felt cleaner (less code) to not 
optimize for that now.



Again Peter, I am not against you, not at all. Sorry if I gave you the 
impression. I highly appreciate your work and this discussion.

-- 
Thanks,

David / dhildenb