[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c5d839da-cb7e-e887-11a6-30ccd1c3c845@nvidia.com>
Date: Tue, 6 Dec 2022 14:31:30 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Peter Xu <peterx@...hat.com>
CC: Mike Kravetz <mike.kravetz@...cle.com>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>,
James Houghton <jthoughton@...gle.com>,
"Jann Horn" <jannh@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
"Andrea Arcangeli" <aarcange@...hat.com>,
Rik van Riel <riel@...riel.com>,
Nadav Amit <nadav.amit@...il.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Muchun Song <songmuchun@...edance.com>,
David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd
unshare
On 12/6/22 13:51, Peter Xu wrote:
> On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
>> On 12/6/22 08:45, Peter Xu wrote:
>>> I've got a fixup attached. John, since this got your attention please also
>>> have a look too in case there's further issues.
>>>
>>
>> Well, one question: Normally, the pattern of "release_lock(A); call f();
>> acquire_lock(A);" is tricky, because one must revalidate that the state
>> protected by A has not changed while the lock was released. However, in
>> this case, it's letting page fault handling proceed, which already
>> assumes that pages might be gone, so generally that seems OK.
>
> Yes it's tricky, but not as tricky in this case.
>
> I hope my documentation supplemented that (in the fixup patch):
>
> + * @hugetlb_entry: if set, called for each hugetlb entry. Note that
> + * currently the hook function is protected by hugetlb
> + * vma lock to make sure pte_t* and the spinlock is valid
> + * to access. If the hook function needs to yield the
So far so good...
> + * thread or retake the vma lock for some reason, it
> + * needs to properly release the vma lock manually,
> + * and retake it before the function returns.
...but you can actually delete this second sentence. It does not add
any real information--clearly, if you must drop the lock, then you must
"manually" drop the lock.
And it still ignores my original question, which I don't think I've
fully communicated. Basically, what can happen to the protected data
during the time when the lock is not held?
>
> The vma lock here makes sure the pte_t and the pgtable spinlock being
> stable. Without the lock, they're prone to be freed in parallel.
>
Yes, but think about this: if the vma lock protects against the pte
going away, then:
lock()
get a pte
unlock()
...let hmm_vma_fault() cond_resched() run...
lock()
...whoops, something else release the pte that I'd previously
retrieved.
>>
>> However, I'm lagging behind on understanding what the vma lock actually
>> protects. It seems to be a hugetlb-specific protection for concurrent
>> freeing of the page tables?
>
> Not exactly freeing, but unsharing. Mike probably has more to say. The
> series is here:
>
> https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t
>
>> If so, then running a page fault handler seems safe. If there's something
>> else it protects, then we might need to revalidate that after
>> re-acquiring the vma lock.
>
> Nothing to validate here. The only reason to take the vma lock is to match
> with the caller who assumes the lock taken, so either it'll be released
> very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).
>
ummm, see above. :)
>>
>> Also, scattering hugetlb-specific locks throughout mm seems like an
>> unfortuate thing, I wonder if there is a longer term plan to Not Do
>> That?
>
> So far HMM is really the only one - normally hugetlb_entry() hook is pretty
> light, so not really throughout the whole mm yet. It's even not urgently
> needed for the other two places calling cond_sched(), I added it mostly
> just for completeness, and with the slight hope that maybe we can yield
> earlier for some pmd unsharers.
>
> But yes it's unfortunate, I just didn't come up with a good solution.
> Suggestion is always welcomed.
>
I guess it's on me to think of something cleaner, so if I do I'll pipe
up. :)
thanks,
--
John Hubbard
NVIDIA
Powered by blists - more mailing lists