linux-kernel - Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c5d839da-cb7e-e887-11a6-30ccd1c3c845@nvidia.com>
Date:   Tue, 6 Dec 2022 14:31:30 -0800
From:   John Hubbard <jhubbard@...dia.com>
To:     Peter Xu <peterx@...hat.com>
CC:     Mike Kravetz <mike.kravetz@...cle.com>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>,
        James Houghton <jthoughton@...gle.com>,
        "Jann Horn" <jannh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Andrea Arcangeli" <aarcange@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Nadav Amit <nadav.amit@...il.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Muchun Song <songmuchun@...edance.com>,
        David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd
 unshare

On 12/6/22 13:51, Peter Xu wrote:
> On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
>> On 12/6/22 08:45, Peter Xu wrote:
>>> I've got a fixup attached.  John, since this got your attention please also
>>> have a look too in case there's further issues.
>>>
>>
>> Well, one question: Normally, the pattern of "release_lock(A); call f();
>> acquire_lock(A);" is tricky, because one must revalidate that the state
>> protected by A has not changed while the lock was released. However, in
>> this case, it's letting page fault handling proceed, which already
>> assumes that pages might be gone, so generally that seems OK.
> 
> Yes it's tricky, but not as tricky in this case.
> 
> I hope my documentation supplemented that (in the fixup patch):
> 
> + * @hugetlb_entry:     if set, called for each hugetlb entry.  Note that
> + *                     currently the hook function is protected by hugetlb
> + *                     vma lock to make sure pte_t* and the spinlock is valid
> + *                     to access.  If the hook function needs to yield the

So far so good...

> + *                     thread or retake the vma lock for some reason, it
> + *                     needs to properly release the vma lock manually,
> + *                     and retake it before the function returns.

...but you can actually delete this second sentence. It does not add
any real information--clearly, if you must drop the lock, then you must
"manually" drop the lock.

And it still ignores my original question, which I don't think I've
fully communicated. Basically, what can happen to the protected data
during the time when the lock is not held?

> 
> The vma lock here makes sure the pte_t and the pgtable spinlock being
> stable.  Without the lock, they're prone to be freed in parallel.
> 

Yes, but think about this: if the vma lock protects against the pte
going away, then:

lock()
    get a pte
unlock()

...let hmm_vma_fault() cond_resched() run...

lock()
    ...whoops, something else release the pte that I'd previously
    retrieved.

>>
>> However, I'm lagging behind on understanding what the vma lock actually
>> protects. It seems to be a hugetlb-specific protection for concurrent
>> freeing of the page tables?
> 
> Not exactly freeing, but unsharing.  Mike probably has more to say.  The
> series is here:
> 
> https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@oracle.com/#t
> 
>> If so, then running a page fault handler seems safe. If there's something
>> else it protects, then we might need to revalidate that after
>> re-acquiring the vma lock.
> 
> Nothing to validate here.  The only reason to take the vma lock is to match
> with the caller who assumes the lock taken, so either it'll be released
> very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).
> 

ummm, see above. :)

>>
>> Also, scattering hugetlb-specific locks throughout mm seems like an
>> unfortuate thing, I wonder if there is a longer term plan to Not Do
>> That?
> 
> So far HMM is really the only one - normally hugetlb_entry() hook is pretty
> light, so not really throughout the whole mm yet.  It's even not urgently
> needed for the other two places calling cond_sched(), I added it mostly
> just for completeness, and with the slight hope that maybe we can yield
> earlier for some pmd unsharers.
> 
> But yes it's unfortunate, I just didn't come up with a good solution.
> Suggestion is always welcomed.
> 

I guess it's on me to think of something cleaner, so if I do I'll pipe
up. :)

thanks,
-- 
John Hubbard
NVIDIA