[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y6Om/dvlt1Wl2uZw@monkey>
Date: Wed, 21 Dec 2022 16:38:21 -0800
From: Mike Kravetz <mike.kravetz@...cle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Peter Xu <peterx@...hat.com>,
Muchun Song <songmuchun@...edance.com>,
David Hildenbrand <david@...hat.com>,
David Rientjes <rientjes@...gle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Mina Almasry <almasrymina@...gle.com>,
Zach O'Keefe <zokeefe@...gle.com>,
Manish Mishra <manish.mishra@...anix.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
"Dr . David Alan Gilbert" <dgilbert@...hat.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Vlastimil Babka <vbabka@...e.cz>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Yang Shi <shy828301@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 33/47] userfaultfd: add
UFFD_FEATURE_MINOR_HUGETLBFS_HGM
On 12/21/22 19:02, James Houghton wrote:
> On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@...cle.com> wrote:
> >
> > On 12/21/22 17:10, Peter Xu wrote:
> > > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > > > On 12/21/22 15:21, James Houghton wrote:
> > > > > Thanks for bringing this up, Peter. I think the main reason was:
> > > > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > > > and is not supported.
> > > >
> > > > IIRC, I think we wanted to initially limit the usage to the very
> > > > specific use case (live migration). The idea is that we could then
> > > > expand usage as more use cases came to light.
> > > >
> > > > Another good thing is that userfaultfd has versioning built into the
> > > > API. Thus a user can determine if HGM is enabled in their running
> > > > kernel.
> > >
> > > I don't worry much on this one, afaiu if we have any way to enable hgm then
> > > the user can just try enabling it on a test vma, just like when an app
> > > wants to detect whether a new madvise() is present on the current host OS.
>
> That would be enough to test if HGM was merely present, but if
> specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs
> were available. You could always check these by making a HugeTLB VMA
> and setting it up correctly for userfaultfd/etc., but that's a little
> messy.
>
> > >
> > > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> > > would work too.
>
> I'm not opposed to this.
>
> > >
> > > >
> > > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > > > allowed as of this patch series, but it could be allowed in the
> > > > > future. To add support in the same way as this series, we would add
> > > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > > > having to add another feature isn't great; is this what you're
> > > > > concerned about?
> > > > >
> > > > > Considering MADV_ENABLE_HUGETLB...
> > > > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > > > high-granularities, provided the support exists", but it becomes
> > > > > unclear to userspace to know what's supported and what isn't.
> > > > > 2. We would then need to keep track if a user explicitly enabled it,
> > > > > or if it got enabled automatically in response to memory poison, for
> > > > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > > > got enabled for poison, suddenly userspace would be allowed to do
> > > > > things it wasn't allowed to do before.)
> > >
> > > We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> > > hgm_enabled. (a) always sets (b) but not vice versa. We can limit poison
> > > to set (b) only. For this patchset, it can be all about (a).
>
> My thoughts exactly. :)
>
> > >
> > > > > 3. This API makes sense for enabling HGM for something outside of
> > > > > userfaultfd, like MADV_DONTNEED.
> > > >
> > > > I think #3 is key here. Once we start applying HGM to things outside
> > > > userfaultfd, then more thought will be required on APIs. The API is
> > > > somewhat limited by design until the basic functionality is in place.
> > >
> > > Mike, could you elaborate what's the major concern of having hgm used
> > > outside uffd and live migration use cases?
> > >
> > > I feel like I miss something here. I can understand we want to limit the
> > > usage only when the user specifies using hgm because we want to keep the
> > > old behavior intact. However if we want another way to enable hgm it'll
> > > still need one knob anyway even outside uffd, and I thought that'll service
> > > the same purpose, or maybe not?
> >
> > I am not opposed to using hgm outside the use cases targeted by this series.
> >
> > It seems that when we were previously discussing the API we spent a bunch of
> > time going around in circles trying to get the API correct. That is expected
> > as it is more difficult to take all users/uses/abuses of the API into account.
> >
> > Since the initial use case was fairly limited, it seemed like a good idea to
> > limit the API to userfaultfd. In this way we could focus on the underlying
> > code/implementation and then expand as needed. Of course, with an eye on
> > anything that may be a limiting factor in the future.
> >
> > I was not aware of the uffd-wp use case, and am more than happy to discuss
> > expanding the API.
>
> So considering two API choices:
>
> 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
> UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
> UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
> allowing high-granularity choices (not sure if this is bad; we started
> allowing it for HugeTLB recently with no other API change, AFAIA).
I don't think we can just start allowing HGM for MADV_DONTNEED without
some type of user interaction/request. Otherwise, a user that passes
in non-hugetlb page size requests may get unexpected results. And, one
of the threads about MADV_DONTNEED points out a valid use cases where
the caller may not know the mapping is hugetlb or not and is likely to
pass in non-hugetlb page size requests.
> 2. MADV_ENABLE_HGM or something similar. The changes to
> UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
> provided they are implemented.
>
> I don't mind one way or the other. Peter, I assume you prefer #2.
> Mike, what about you? If we decide on something other than #1, I'll
> make the change before sending v1 out.
Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way
to go. Any thoughts about MADV_ENABLE_HGM? I'm thinking:
- Make it have same restrictions as other madvise hugetlb calls,
. addr must be huge page aligned
. length is rounded down to a multiple of huge page size
- We split the vma as required
- Flags carrying HGM state reside in the hugetlb_shared_vma_data struct
--
Mike Kravetz
Powered by blists - more mailing lists