[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <82d294-c9b0-d7b4-71c9-cfed3925c47b@google.com>
Date: Tue, 22 Aug 2023 11:54:39 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Jann Horn <jannh@...gle.com>
cc: Hugh Dickins <hughd@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...nel.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Qi Zheng <zhengqi.arch@...edance.com>,
Yang Shi <shy828301@...il.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Xu <peterx@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Alistair Popple <apopple@...dia.com>,
Ralph Campbell <rcampbell@...dia.com>,
Ira Weiny <ira.weiny@...el.com>,
Steven Price <steven.price@....com>,
SeongJae Park <sj@...nel.org>,
Lorenzo Stoakes <lstoakes@...il.com>,
Huang Ying <ying.huang@...el.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Zack Rusin <zackr@...are.com>, Jason Gunthorpe <jgg@...pe.ca>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Minchan Kim <minchan@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
Song Liu <song@...nel.org>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Russell King <linux@...linux.org.uk>,
"David S. Miller" <davem@...emloft.net>,
Michael Ellerman <mpe@...erman.id.au>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Vishal Moola <vishal.moola@...il.com>,
Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
Zach O'Keefe <zokeefe@...gle.com>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
sparclinux@...r.kernel.org,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
linux-s390 <linux-s390@...r.kernel.org>,
kernel list <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp()
versus uffd
On Tue, 22 Aug 2023, Jann Horn wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@...gle.com> wrote:
> > On Mon, 21 Aug 2023, Jann Horn wrote:
> > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@...gle.com> wrote:
> > > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > > it gives any protection against this case itself, but because ptlock
> > > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > > In other cases, continue to minimize the pmd_lock() hold time.
> > >
> > > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > > I also can't find anything other than userfaultfd that would insert
> > > pages into regions that are khugepaged-compatible, so I guess this
> > > works?
> >
> > I'm as sure as I can be that it's solely because userfaultfd breaks
> > the usual rules here (and in fairness, IIRC Andrea did ask my permission
> > before making it behave that way on shmem, COWing without a source page).
> >
> > Perhaps something else will want that same behaviour in future (it's
> > tempting, but difficult to guarantee correctness); for now, it is just
> > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > expecting uffd to add more such exceptional modes in future).
>
> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> wanted to make it possible to reliably install PTE markers with
> madvise() or something like that, which might be nice for allowing
> userspace to create guard pages without unnecessary extra VMAs...)
I see the mailthread has taken inspiration from your comment there,
and veered off in that direction: but I'll ignore those futures.
>
> > > I guess an alternative would be to use a spin_trylock() instead of the
> > > current pmd_lock(), and if that fails, temporarily drop the page table
> > > lock and then restart from step 2 with both locks held - and at that
> > > point the page table scan should be fast since we expect it to usually
> > > be empty.
> >
> > That's certainly a good idea, if collapse on userfaultfd_armed private
> > is anything of a common case (I doubt, but I don't know). It may be a
> > better idea anyway (saving a drop and retake of ptlock).
>
> I was thinking it also has the advantage that it would still perform
> okay if we got rid of the userfaultfd_armed() condition at some point
> - though I realize that designing too much for hypothetical future
> features is an antipattern.
>
> > I gave it a try, expecting to end up with something that would lead
> > me to say "I tried it, but it didn't work out well"; but actually it
> > looks okay to me. I wouldn't say I prefer it, but it seems reasonable,
> > and no more complicated (as Peter rightly observes) than the original.
> >
> > It's up to you and Peter, and whoever has strong feelings about it,
> > to choose between them: I don't mind (but I shall be sad if someone
> > demands that I indent that comment deeper - I'm not a fan of long
> > multi-line comments near column 80).
>
> I prefer this version because it would make it easier to remove the
> "userfaultfd_armed()" check in the future if we have to, but I guess
> we could also always change it later if that becomes necessary, so I
> don't really have strong feelings on it at this point.
Thanks for considering them both, Jann. I do think your trylock way,
as in v2, is in principle superior, and we may well have good reason
to switch over to it in future; but I find it slightly more confusing,
so will follow your and Peter's "no strong feelings" for now, and ask
Andrew please to take the original (implicit v1).
Overriding reason: I realized overnight that v2 is not quite correct:
I was clever enough to realize that nr_ptes needed to be reset to 0
to get the accounting right with a recheck pass, but not clever enough
to realize that resetting it to 0 there would likely skip the abort
path's flush_tlb_mm(mm), when we actually had cleared entries on the
first pass. It needs a separate bool to decide the flush_tlb_mm(mm),
or it needs that (ridiculously minor!) step 3 to be moved down.
But rather than reworking it, please let's just go with v1 for now.
Thanks,
Hugh
Powered by blists - more mailing lists