linux-kernel - Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <53dc6054-07eb-f97b-7b2f-558f02d1b90a@redhat.com>
Date:   Fri, 17 Feb 2023 09:53:47 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     Muhammad Usama Anjum <usama.anjum@...labora.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kernel@...labora.com, Paul Gofman <pgofman@...eweavers.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs

On 16.02.23 21:25, Peter Xu wrote:
> On Thu, Feb 16, 2023 at 10:37:36AM +0100, David Hildenbrand wrote:
>> On 16.02.23 10:16, Muhammad Usama Anjum wrote:
>>> mwriteprotect_range() errors out if [start, end) doesn't fall in one
>>> VMA. We are facing a use case where multiple VMAs are present in one
>>> range of interest. For example, the following pseudocode reproduces the
>>> error which we are trying to fix:
>>> - Allocate memory of size 16 pages with PROT_NONE with mmap
>>> - Register userfaultfd
>>> - Change protection of the first half (1 to 8 pages) of memory to
>>>     PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>>>     out.
>>
>> I think, in QEMU, with partial madvise()/mmap(MAP_FIXED) while handling
>> memory remapping during reboot to discard pages with memory errors, it would
>> be possible that we get multiple VMAs and could not enable uffd-wp for
>> background snapshots anymore. So this change makes sense to me.
> 
> Any pointer for this one?

In qemu, softmmu/physmem.c:qemu_ram_remap() is instructed on reboot to 
remap VMAs due to MCE pages. We apply QEMU_MADV_MERGEABLE (if configured 
for the machine) and QEMU_MADV_DONTDUMP (if configured for the machine), 
so the kernel could merge the VMAs again.

(a) From experiments (~2 years ago), I recall that some VMAs won't get 
merged again ever. I faintly remember that this was the case for 
hugetlb. It might have changed in the meantime, haven't tried it again. 
But looking at is_mergeable_vma(), we refuse to merge with 
vma->vm_ops->close. I think that might be set for hugetlb 
(hugetlb_vm_op_close).

(b) We don't consider memory-backend overrides, like toggling a backend 
QEMU_MADV_MERGEABLE or QEMU_MADV_DONTDUMP from backends/hostmem.c, 
resulting in multiple unmergable VMAs.

(c) We don't consider memory-backend  mbind() we don't re-apply the 
mbind() policy, resulting in unmergable VMAs.

The correct way to handle (b) and (c) would be to notify the memory 
backend, to let it reapply the correct flags, and to reapply the mbind() 
policy (I once had patches for that, have to look them up again).

So in these rare setups with MCEs, we would be getting more VMAs and 
while the uffd-wp registration would succeed, uffd-wp protection would fail.

Not that this is purely theoretical, people don't heavily use background 
snapshots yet, so I am not aware of any reports. Further, I consider it 
only to happen very rarely (MCE+reboot+a/b/c).

So it's more of a "the app doesn't necessarily keep track of the exact 
VMAs".

[I am not sure sure how helpful remapping !anon memory really is, we 
should be getting the same messed-up MCE pages from the fd again, but 
that's a different discussion I guess]

-- 
Thanks,

David / dhildenb