[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1b7c7056-d742-86bf-fec-fdb024b2381@google.com>
Date: Tue, 22 Aug 2023 11:34:19 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Peter Xu <peterx@...hat.com>
cc: Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...nel.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Qi Zheng <zhengqi.arch@...edance.com>,
Yang Shi <shy828301@...il.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Alistair Popple <apopple@...dia.com>,
Ralph Campbell <rcampbell@...dia.com>,
Ira Weiny <ira.weiny@...el.com>,
Steven Price <steven.price@....com>,
SeongJae Park <sj@...nel.org>,
Lorenzo Stoakes <lstoakes@...il.com>,
Huang Ying <ying.huang@...el.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Zack Rusin <zackr@...are.com>, Jason Gunthorpe <jgg@...pe.ca>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Minchan Kim <minchan@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
Song Liu <song@...nel.org>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Russell King <linux@...linux.org.uk>,
"David S. Miller" <davem@...emloft.net>,
Michael Ellerman <mpe@...erman.id.au>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Vishal Moola <vishal.moola@...il.com>,
Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
Zach O'Keefe <zokeefe@...gle.com>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
sparclinux@...r.kernel.org,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
linux-s390 <linux-s390@...r.kernel.org>,
kernel list <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp()
versus uffd
On Tue, 22 Aug 2023, Peter Xu wrote:
> On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote:
> > On Mon, 21 Aug 2023, Jann Horn wrote:
...
> > >
> > > I guess an alternative would be to use a spin_trylock() instead of the
> > > current pmd_lock(), and if that fails, temporarily drop the page table
> > > lock and then restart from step 2 with both locks held - and at that
> > > point the page table scan should be fast since we expect it to usually
> > > be empty.
> >
> > That's certainly a good idea, if collapse on userfaultfd_armed private
> > is anything of a common case (I doubt, but I don't know). It may be a
> > better idea anyway (saving a drop and retake of ptlock).
> >
> > I gave it a try, expecting to end up with something that would lead
> > me to say "I tried it, but it didn't work out well"; but actually it
> > looks okay to me. I wouldn't say I prefer it, but it seems reasonable,
> > and no more complicated (as Peter rightly observes) than the original.
> >
> > It's up to you and Peter, and whoever has strong feelings about it,
> > to choose between them: I don't mind (but I shall be sad if someone
> > demands that I indent that comment deeper - I'm not a fan of long
> > multi-line comments near column 80).
>
> No strong opinion here, either. Just one trivial comment/question below on
> the new patch (if that will be preferred)..
I'm going to settle for the original v1 for now (I'll explain why in reply
to Jann next) - which already has the blessing of your Acked-by, thanks.
(Yes, the locking is a bit confusing: but mainly for the unrelated reason,
that with the split locking configs, we never quite know whether this lock
is the same as that lock or not, and so have to be rather careful.)
> > [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd
...
> > @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > haddr, haddr + HPAGE_PMD_SIZE);
> > mmu_notifier_invalidate_range_start(&range);
> > notified = true;
> > - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> > - if (!start_pte) /* mmap_lock + page lock should prevent this */
> > - goto abort;
> > + spin_lock(ptl);
>
> .. here will the ptl always be valid?
>
> That comes from the previous round of pte_offset_map_lock(), and I assume
> after this whole "thp collapse without write lock" work landed, it has the
> same lifecycle with the *pte pointer, so can be invalid right after the rcu
> read lock released; mmap read lock isn't strong enough to protect the ptl,
> not anymore.
>
> Maybe it's all fine because the thp collapse path is the solo path(s) that
> will release the pte pgtable page without write mmap lock (so as to release
> the ptl too when doing so), and we at least still hold the page lock, so
> the worst case is the other concurrent "thp collapse" will still serialize
> with this one on the huge page lock. But that doesn't look as solid as
> fetching again the ptl from another pte_offset_map_nolock(). So still just
> raise this question up. It's possible I just missed something.
It is safe, as you say because of us holding the hpage lock, which stops
any racing callers of collapse_pte_mapped_thp() or retract_page_tables():
and these are the functions which (currently) make the *pmd transition
which pte_offset_map_lock() etc. are being careful to guard against.
[In future we can imagine empty page table removal making that transition
too: and that wouldn't even have any hpage to lock. Will it rely on
mmap_lock for write? or pmd_lock? probably both, but no need to design
for it now.]
But I agree that it does *look* more questionable in this patch: there was
a reassuring pte_offset_map_lock() there before, and now I rely more on
the assumptions and just use the "previous" ptl (and that's why I chose
to make the !start_pte case a VM_BUG_ON a few lines later).
I expect, with more time spent, I could cast it back into more reassuring
form: but it's all a bit of a con trick - if you look further down (even
before v2 or v1 fixes) to "step 4", there we have "if (ptl != pml)" which
is also relying on the fact that ptl cannot have changed. And no doubt
that too could be recast into more reassuring-looking form, but it
wouldn't actually be worthwhile.
Thanks for considering these, Peter: I'll recommend v1 to Andrew.
Hugh
Powered by blists - more mailing lists