[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <32fdc2c8-b86b-92f3-1d5e-64db6be29126@redhat.com>
Date: Fri, 19 May 2023 10:38:29 +0200
From: David Hildenbrand <david@...hat.com>
To: Axel Rasmussen <axelrasmussen@...gle.com>,
Peter Xu <peterx@...hat.com>
Cc: Jiaqi Yan <jiaqiyan@...gle.com>,
James Houghton <jthoughton@...gle.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Christian Brauner <brauner@...nel.org>,
Hongchen Zhang <zhanghongchen@...ngson.cn>,
Huang Ying <ying.huang@...el.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Miaohe Lin <linmiaohe@...wei.com>,
"Mike Rapoport (IBM)" <rppt@...nel.org>,
Nadav Amit <namit@...are.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Shuah Khan <shuah@...nel.org>,
ZhangPeng <zhangpeng362@...wei.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
Anish Moorthy <amoorthy@...gle.com>
Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl
On 18.05.23 22:38, Axel Rasmussen wrote:
> On Thu, May 18, 2023 at 9:05 AM Peter Xu <peterx@...hat.com> wrote:
>>
>> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote:
>>> On Wed, May 17, 2023 at 3:29 PM Axel Rasmussen <axelrasmussen@...gle.com> wrote:
>>>>
>>>> On Wed, May 17, 2023 at 3:20 PM Peter Xu <peterx@...hat.com> wrote:
>>>>>
>>>>> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote:
>>>>>> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote:
>>>>>>> On Thu, May 11, 2023 at 11:24 AM Axel Rasmussen
>>>>>>> <axelrasmussen@...gle.com> wrote:
>>>>>>>>
>>>>>>>> So the basic way to use this new feature is:
>>>>>>>>
>>>>>>>> - On the new host, the guest's memory is registered with userfaultfd, in
>>>>>>>> either MISSING or MINOR mode (doesn't really matter for this purpose).
>>>>>>>> - On any first access, we get a userfaultfd event. At this point we can
>>>>>>>> communicate with the old host to find out if the page was poisoned.
>>>>>>>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
>>>>>>>> so any future accesses will SIGBUS. Because the pte is now "present",
>>>>>>>> future accesses won't generate more userfaultfd events, they'll just
>>>>>>>> SIGBUS directly.
>>>>>>>
>>>>>>> I want to clarify the SIGBUS mechanism here when KVM is involved,
>>>>>>> keeping in mind that we need to be able to inject an MCE into the
>>>>>>> guest for this to be useful.
>>>>>>>
>>>>>>> 1. vCPU gets an EPT violation --> KVM attempts GUP.
>>>>>>> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS.
>>>>>>> 3. KVM finds that GUP failed and returns -EFAULT.
>>>>>>>
>>>>>>> This is different than if GUP found poison, in which case KVM will
>>>>>>> actually queue up a SIGBUS *containing the address of the fault*, and
>>>>>>> userspace can use it to inject an appropriate MCE into the guest. With
>>>>>>> UFFDIO_SIGBUS, we are missing the address!
>>>>>>>
>>>>>>> I see three options:
>>>>>>> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think
>>>>>>> this is pointless.
>>>>>>> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a
>>>>>>> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS
>>>>>>> instead of VM_FAULT_RETRY. We will keep getting userfaults on repeated
>>>>>>> accesses, just like how we get repeated signals for real poison.
>>>>>>> 3. Use this in conjunction with the additional KVM EFAULT info that
>>>>>>> Anish proposed (the first part of [1]).
>>>>>>>
>>>>>>> I think option 3 is fine. :)
>>>>>>
>>>>>> Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :)
>>>>>
>>>>> I just remember Axel mentioned this in the commit message, and just in case
>>>>> this is why option 4) was ruled out:
>>>>>
>>>>> They expect that once poisoned, pages can never become
>>>>> "un-poisoned". So, when we live migrate the VM, we need to preserve
>>>>> the poisoned status of these pages.
>>>>>
>>>>> Just to supplement on this point: we do have unpoison (echoing to
>>>>> "debug/hwpoison/hwpoison_unpoison"), or am I wrong?
>>>
>>> If I read unpoison_memory() correctly, once there is a real hardware
>>> memory corruption (hw_memory_failure will be set), unpoison will stop
>>> working and return EOPNOTSUPP.
>>>
>>> I know some cloud providers evacuating VMs once a single memory error
>>> happens, so not supporting unpoison is probably not a big deal for
>>> them. BUT others do keep VM running until more errors show up later,
>>> which could be long after the 1st error.
>>
>> We're talking about postcopy migrating a VM has poisoned page on src,
>> rather than on dst host, am I right? IOW, the dest hwpoison should be
>> fake.
>>
>> If so, then I would assume that's the case where all the pages on the dest
>> host is still all good (so hw_memory_failure not yet set, or I doubt the
>> judgement of being a migration target after all)?
>>
>> The other thing is even if dest host has hw poisoned page, I'm not sure
>> whether hw_memory_failure is the only way to solve this.
>>
>> I saw that this is something got worked on before from Zhenwei, David used
>> to have some reasoning on why it was suggested like using a global knob:
>>
>> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redhat.com/
>>
>> Two major issues here afaics:
>>
>> - Zhenwei's approach only considered x86 hwpoison - it relies on kpte
>> having !present in entries but that's x86 specific rather than generic
>> to memory_failure.c.
>>
>> - It is _assumed_ that hwpoison injection is for debugging only.
>>
>> I'm not sure whether you can fix 1) by some other ways, e.g., what if the
>> host just remember all the hardware poisoned pfns (or remember
>> soft-poisoned ones, but then here we need to be careful on removing them
>> from the list when it's hwpoisoned for real)? It sounds like there's
>> opportunity on providing a generic solution rather than relying on
>> !pte_present().
>>
>> For 2) IMHO that's not a big issue, you can declare it'll be used in !debug
>> but production systems so as to boost the feature importance with a real
>> use case.
>>
>> So far I'd say it'll be great to leverage what it's already there in linux
>> and make it as generic as possible. The only issue is probably
>> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN
>> somehow, or you can simply work around this issue.
>
> As you mention below I think the key distinction is the scope - I
> think MADV_HWPOISON affects the whole system, including other
> processes.
>
> For our purposes, we really just want to "poison" this particular
> virtual address (the HVA, from the VM's perspective), not even other
> mappings of the same shared memory. I think that behavior is different
> from MADV_HWPOISON, at least.
MADV_HWPOISON really is the wrong interface to use. See "man madvise".
We don't want to allow arbitrary users to hwpoison+offline absolutely
healthy physical memory, which is what MADV_HWPOISON is all about.
As you say, we want to turn an unpopulated (!present) virtual address to
mimic like we had a MCE on a page that would have been previously mapped
here: install a hwpoison marker without actually poisoning any present
page. In fact, we'd even want to fail if there *is* something mapped.
Sure, one could teach MADV_HWPOISON to allow unprivileged users to do
that for !present PTE entries, and fail for unprivileged users if there
is a present PTE entry. I'm not sure if that's the cleanest approach,
though, and a new MADV as suggested in this thread would eventually be
cleaner.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists