[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200903164032.GM24045@ziepe.ca>
Date: Thu, 3 Sep 2020 13:40:32 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Zi Yan <ziy@...dia.com>
CC: <linux-mm@...ck.org>, Roman Gushchin <guro@...com>,
Rik van Riel <riel@...riel.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
Shakeel Butt <shakeelb@...gle.com>,
Yang Shi <yang.shi@...ux.alibaba.com>,
David Nellans <dnellans@...dia.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64
On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:
>
> > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> >>
> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >>>
> >>>>> Surprised this doesn't touch mm/pagewalk.c ?
> >>>>
> >>>> 1GB PUD page support is present for DAX purpose, so the code is there
> >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >>>> the functions in mm/pagewalk.c. :)
> >>>
> >>> Yes, but doesn't this change what is possible under the mmap_sem
> >>> without the page table locks?
> >>>
> >>> ie I would expect some thing like pmd_trans_unstable() to be required
> >>> as well for lockless walkers. (and I don't think the pmd code is 100%
> >>> right either)
> >>>
> >>
> >> Right. I missed that. Thanks for pointing it out.
> >> The code like this, right?
> >
> > Technically all those *pud's are racy too, the design here with the
> > _unstable function call always seemed weird. I strongly suspect it
> > should mirror how get_user_pages_fast works for lockless walking
>
> You mean READ_ONCE on page table entry pointer first, then use the value
> for the rest of the loop? I am not quite familiar with this racy check
> part of the code and happy to hear more about it.
There are two main issues with the THPs and lockless walks
- The *pXX value can change at any time, as THPs can be split at any
moment. However, once observed to be a sub page table pointer the
value is fixed under the read side of the mmap (I think, I never
did find the code path supporting this, but everything is busted if
it isn't true...)
- Reading the *pXX without load tearing is difficult on 32 bit arches
So if you do READ_ONCE() it defeats the first problem.
However if the sizeof(*pXX) is 8 on a 32 bit platform then load
tearing is a problem. At lest the various pXX_*() test functions
operate on a single 32 bit word so don't tear, but to to convert the
*pXX to a lower level page table pointer a coherent, untorn, read is
required.
So, looking again, I remember now, I could never quite figure out why
gup_pmd_range() was safe to do:
pmd_t pmd = READ_ONCE(*pmdp);
[..]
} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
[..]
ptem = ptep = pte_offset_map(&pmd, addr);
As I don't see what prevents load tearing a 64 bit pmd.. Eg no
pmd_trans_unstable() or equivalent here.
But we see gup_get_pte() using an anti-load tearing technique..
Jason
Powered by blists - more mailing lists