linux-kernel - Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJHvVcgkQK+YpWhpmHzjBGFUbHLLSoaq9jHfzCH052OEZAWs5w@mail.gmail.com>
Date:   Thu, 11 May 2023 14:05:23 -0700
From:   Axel Rasmussen <axelrasmussen@...gle.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Christian Brauner <brauner@...nel.org>,
        David Hildenbrand <david@...hat.com>,
        Hongchen Zhang <zhanghongchen@...ngson.cn>,
        Huang Ying <ying.huang@...el.com>,
        James Houghton <jthoughton@...gle.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        "Mike Rapoport (IBM)" <rppt@...nel.org>,
        Nadav Amit <namit@...are.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Peter Xu <peterx@...hat.com>, Shuah Khan <shuah@...nel.org>,
        ZhangPeng <zhangpeng362@...wei.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        Jiaqi Yan <jiaqiyan@...gle.com>
Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl

On Thu, May 11, 2023 at 1:40 PM Axel Rasmussen <axelrasmussen@...gle.com> wrote:
>
> On Thu, May 11, 2023 at 1:29 PM Mike Kravetz <mike.kravetz@...cle.com> wrote:
> >
> > On 05/11/23 11:24, Axel Rasmussen wrote:

Apologies for the noise, I should have CC'ed +Jiaqi on this series
too, since he is working on other parts of the memory poisoning /
recovery stuff internally.

> > > The basic idea here is to "simulate" memory poisoning for VMs. A VM
> > > running on some host might encounter a memory error, after which some
> > > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that
> > > once poisoned, pages can never become "un-poisoned". So, when we live
> > > migrate the VM, we need to preserve the poisoned status of these pages.
> > >
> > > When live migrating, we try to get the guest running on its new host as
> > > quickly as possible. So, we start it running before all memory has been
> > > copied, and before we're certain which pages should be poisoned or not.
> > >
> > > So the basic way to use this new feature is:
> > >
> > > - On the new host, the guest's memory is registered with userfaultfd, in
> > >   either MISSING or MINOR mode (doesn't really matter for this purpose).
> > > - On any first access, we get a userfaultfd event. At this point we can
> > >   communicate with the old host to find out if the page was poisoned.
> >
> > Just curious, what is this communication channel with the old host?
>
> James can probably describe it in more detail / more correctly than I
> can. My (possibly wrong :) ) understanding is:
>
> On the source machine we maintain a bitmap indicating which pages are
> clean or dirty (meaning, modified after the initial "precopy" of
> memory to the target machine) or poisoned. Eventually the entire
> bitmap is sent to the target machine, but this takes some time (maybe
> seconds on large machines). After this point though we have all the
> information we need, we no longer need to communicate with the source
> to find out the status of pages (although there may still be some
> memory contents to finish copying over).
>
> In the meantime, I think the target machine can also ask the source
> machine about the status of individual pages (for quick on-demand
> paging).
>
> As for the underlying mechanism, it's an internal protocol but the
> publicly-available thing it's most similar to is probably gRPC [1]. At
> a really basic level, we send binary serialized protocol buffers [2]
> over the network in a request / response fashion.
>
> [1] https://grpc.io/
> [2] https://protobuf.dev/
>
> > --
> > Mike Kravetz
> >
> > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker
> > >   so any future accesses will SIGBUS. Because the pte is now "present",
> > >   future accesses won't generate more userfaultfd events, they'll just
> > >   SIGBUS directly.
> > >
> > > UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This
> > > isn't needed, because during live migration we want to intercept
> > > all accesses with userfaultfd (not just writes, so WP mode isn't useful
> > > for this). So whether minor or missing mode is being used (or both), the
> > > PTE won't be present in any case, so handling that case isn't needed.
> > >