linux-kernel - Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <32fdc2c8-b86b-92f3-1d5e-64db6be29126@redhat.com>
Date:   Fri, 19 May 2023 10:38:29 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Axel Rasmussen <axelrasmussen@...gle.com>,
        Peter Xu <peterx@...hat.com>
Cc:     Jiaqi Yan <jiaqiyan@...gle.com>,
        James Houghton <jthoughton@...gle.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Christian Brauner <brauner@...nel.org>,
        Hongchen Zhang <zhanghongchen@...ngson.cn>,
        Huang Ying <ying.huang@...el.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        "Mike Rapoport (IBM)" <rppt@...nel.org>,
        Nadav Amit <namit@...are.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Shuah Khan <shuah@...nel.org>,
        ZhangPeng <zhangpeng362@...wei.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        Anish Moorthy <amoorthy@...gle.com>
Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl

On 18.05.23 22:38, Axel Rasmussen wrote:
> On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@...hat.com> wrote:
>>
>> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
>>> On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@...gle.com> wrote:
>>>>
>>>> On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@...hat.com> wrote:
>>>>>
>>>>> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
>>>>>> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
>>>>>>> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
>>>>>>> <axelrasmussen@...gle.com> wrote:
>>>>>>>>
>>>>>>>> So the basic way to use this new feature is:
>>>>>>>>
>>>>>>>> - On the new host, the guest's memory is registered with userfaultfd, in
>>>>>>>>    either MISSING or MINOR mode (doesn't really matter for this purpose).
>>>>>>>> - On any first access, we get a userfaultfd event. At this point we can
>>>>>>>>    communicate with the old host to find out if the page was poisoned.
>>>>>>>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
>>>>>>>>    so any future accesses will SIGBUS. Because the pte is now "present",
>>>>>>>>    future accesses won't generate more userfaultfd events, they'll just
>>>>>>>>    SIGBUS directly.
>>>>>>>
>>>>>>> I want to clarify the SIGBUS mechanism here when KVM is involved,
>>>>>>> keeping in mind that we need to be able to inject an MCE into the
>>>>>>> guest for this to be useful.
>>>>>>>
>>>>>>> 1. vCPU gets an EPT violation --> KVM attempts GUP.
>>>>>>> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
>>>>>>> 3. KVM finds that GUP failed and returns -EFAULT.
>>>>>>>
>>>>>>> This is different than if GUP found poison, in which case KVM will
>>>>>>> actually queue up a SIGBUS *containing the address of the fault*, and
>>>>>>> userspace can use it to inject an appropriate MCE into the guest. With
>>>>>>> UFFDIO_SIGBUS, we are missing the address!
>>>>>>>
>>>>>>> I see three options:
>>>>>>> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
>>>>>>> this is pointless.
>>>>>>> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
>>>>>>> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
>>>>>>> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
>>>>>>> accesses, just like how we get repeated signals for real poison.
>>>>>>> 3. Use this in conjunction with the additional KVM EFAULT info that
>>>>>>> Anish proposed (the first part of [1]).
>>>>>>>
>>>>>>> I think option 3 is fine. :)
>>>>>>
>>>>>> Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
>>>>>
>>>>> I just remember Axel mentioned this in the commit message, and just in case
>>>>> this is why option 4) was ruled out:
>>>>>
>>>>>          They expect that once poisoned, pages can never become
>>>>>          "un-poisoned". So, when we live migrate the VM, we need to preserve
>>>>>          the poisoned status of these pages.
>>>>>
>>>>> Just to supplement on this point: we do have unpoison (echoing to
>>>>> "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
>>>
>>> If I read unpoison_memory() correctly, once there is a real hardware
>>> memory corruption (hw_memory_failure will be set), unpoison will stop
>>> working and return EOPNOTSUPP.
>>>
>>> I know some cloud providers evacuating VMs once a single memory error
>>> happens, so not supporting unpoison is probably not a big deal for
>>> them. BUT others do keep VM running until more errors show up later,
>>> which could be long after the 1st error.
>>
>> We're talking about postcopy migrating a VM has poisoned page on src,
>> rather than on dst host, am I right?  IOW, the dest hwpoison should be
>> fake.
>>
>> If so, then I would assume that's the case where all the pages on the dest
>> host is still all good (so hw_memory_failure not yet set, or I doubt the
>> judgement of being a migration target after all)?
>>
>> The other thing is even if dest host has hw poisoned page, I'm not sure
>> whether hw_memory_failure is the only way to solve this.
>>
>> I saw that this is something got worked on before from Zhenwei, David used
>> to have some reasoning on why it was suggested like using a global knob:
>>
>> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
>>
>> Two major issues here afaics:
>>
>>    - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
>>      having !present in entries but that's x86 specific rather than generic
>>      to memory_failure.c.
>>
>>    - It is _assumed_ that hwpoison injection is for debugging only.
>>
>> I'm not sure whether you can fix 1) by some other ways, e.g., what if the
>> host just remember all the hardware poisoned pfns (or remember
>> soft-poisoned ones, but then here we need to be careful on removing them
>> from the list when it's hwpoisoned for real)?  It sounds like there's
>> opportunity on providing a generic solution rather than relying on
>> !pte_present().
>>
>> For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
>> but production systems so as to boost the feature importance with a real
>> use case.
>>
>> So far I'd say it'll be great to leverage what it's already there in linux
>> and make it as generic as possible. The only issue is probably
>> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
>> somehow, or you can simply work around this issue.
> 
> As you mention below I think the key distinction is the scope - I
> think MADV_HWPOISON affects the whole system, including other
> processes.
> 
> For our purposes, we really just want to "poison" this particular
> virtual address (the HVA, from the VM's perspective), not even other
> mappings of the same shared memory. I think that behavior is different
> from MADV_HWPOISON, at least.

MADV_HWPOISON really is the wrong interface to use. See "man madvise".

We don't want to allow arbitrary users to hwpoison+offline absolutely 
healthy physical memory, which is what MADV_HWPOISON is all about.

As you say, we want to turn an unpopulated (!present) virtual address to 
mimic like we had a MCE on a page that would have been previously mapped 
here: install a hwpoison marker without actually poisoning any present 
page. In fact, we'd even want to fail if there *is* something mapped.

Sure, one could teach MADV_HWPOISON to allow unprivileged users to do 
that for !present PTE entries, and fail for unprivileged users if there 
is a present PTE entry. I'm not sure if that's the cleanest approach, 
though, and a new MADV as suggested in this thread would eventually be 
cleaner.

-- 
Thanks,

David / dhildenb