[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210611194249.GS1096940@ziepe.ca>
Date: Fri, 11 Jun 2021 16:42:49 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: Hugh Dickins <hughd@...gle.com>
Cc: "Kirill A. Shutemov" <kirill@...temov.name>,
Andrew Morton <akpm@...ux-foundation.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Yang Shi <shy828301@...il.com>,
Wang Yugui <wangyugui@...-tech.com>,
Matthew Wilcox <willy@...radead.org>,
Alistair Popple <apopple@...dia.com>,
Ralph Campbell <rcampbell@...dia.com>, Zi Yan <ziy@...dia.com>,
Peter Xu <peterx@...hat.com>, Will Deacon <will@...nel.org>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 03/11] mm: page_vma_mapped_walk(): use pmd_read_atomic()
On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote:
> > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> > index e896ebef8c24cb..0bf1fdec928e71 100644
> > +++ b/arch/x86/include/asm/pgtable-3level.h
> > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> > static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > {
> > pmdval_t ret;
> > - u32 *tmp = (u32 *)pmdp;
> > + u32 *tmp = READ_ONCE((u32 *)pmdp);
> >
> > ret = (pmdval_t) (*tmp);
> > if (ret) {
> > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > * or we can end up with a partial pmd.
> > */
> > smp_rmb();
> > - ret |= ((pmdval_t)*(tmp + 1)) << 32;
> > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> > }
>
> Maybe that. Or maybe now (since Will's changes) it can just do
> one READ_ONCE() of the whole, then adjust its local copy.
I think the smb_rmb() is critical here to ensure a PTE table pointer
is coherent, READ_ONCE is not a substitute, unless I am miss
understanding what Will's changes are???
> > @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > * only going to work, if the pmdval_t isn't larger than
> > * an unsigned long.
> > */
> > - return *pmdp;
> > + return READ_ONCE(*pmdp);
> > }
> > #endif
>
> And it should now be possible to delete the #ifdef THP barrier() in
> function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic().
Yes - I think I know what you mean :)
> > If I recall, (and I didn't recheck this right now) the only thing
> > pmd_read_atomic() provides is the special property that if the pmd's
> > flags are observed to point to a pte table then pmd_read_atomic() will
> > reliably return the pte table pointer.
>
> I expect your right, I haven't rechecked. But it does also provide that
> special guarantee on matching pmd_none() when half matches pmd_none():
> which one or more callers want, but irrelevant where I added it.
Ah, this I don't know much about.. Hum, I'd have to think about it way
too much to have an opinion if the races around pmd_none are
meaningful enough to resolve with _atomic vs READ_ONCE()
> > As far as is page_vma_mapped_walk correct.. Everything except
> > is_pmd_migration_entry() is fine to my eye, and I simply don't know
> > the rules aroudn migration entries to know right/wrong.
>
> So long as the swap "type" is entirely in the same word as the pte
> flags would be, I think is_pmd_migration_entry() should be fine.
> There it's just looking for a hint, is it worth taking pmd_lock()
> to obtain a reliable pmde.
I wonder if anyone knows to guarantee that in the arches?
> > I suspect it probably is required to manipulate a migration entry
> > while holding the mmap_sem in write mode??
>
> I don't think in terms of mmap_sem/mmap_lock here: this is on the
> rmap lookup path, and typically mmap_lock is not held (whereas I
> was surprised to find the comment above pmd_read_atomic() saying a
> lot about it - another reason it's inappropriate in pvmw I guess).
Ah.. Honestly I'm not familiar with all the ways to lock a VMA so that
the page tables it spans are guaranteed not to be freed.
I just saw the _offset() ladder without any page table spinlocks and
knew this was one of the lockless walkers that, somehow, relies on
another lock to ensure the page table level memory is not concurrently
freed.
In that case I take it back, I have no idea if this is correct or not
because it calls map_pte() which does pte_offset_map() based on that
READ_ONCE.
pte_offset_map() without holding the pmd_lock is only OK if you know
that the pmd can't be upgraded to a THP - and the only locking for
that I have memorized is the mmap lock :\
I can't tell what other lock is protecting the page tables here or if
that lock is held during the THP upgrade process.. Sorry
> > There is some list of changes to the page table that require
> > holding the mmap sem in write mode that I've never seen documented
> > for..
>
> That is a subtler subject than I dare get into at the moment.
> But if you're just doing lookups, pmd_lock() is enough.
It is enough if you take it, I don't see page_vma_mapped_walk() taking
it in the map_pte() flow :\
Jason
Powered by blists - more mailing lists