[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fe96ea70-5796-483b-ac1f-e58feeb63676@arm.com>
Date: Mon, 2 Dec 2024 15:24:18 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, ryan.roberts@....com,
kirill.shutemov@...ux.intel.com, willy@...radead.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, ziy@...dia.com
Subject: Re: [QUESTION] mmap lock in khugepaged
On 29/11/24 5:00 pm, David Hildenbrand wrote:
>>>
>>> Thinking about it, I am also not sure if most other code can deal with
>>> temporary pmd_none(). These don't necessarily take the PMD lock,
>>> because "why would they" right now.
>>>
>>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes
>>> "there really is nothing" without checking the PMD lock.
>>>
>>> As a more concrete example, assume someone calls MADV_DONTNEED and we
>>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
>>> we'll skip that entry without getting the PMD lock involved. That
>>> would mean that you would be breaking MADV_DONTNEED if you managed to
>>> collapse or not -- memory would not get discarded.
>>>
>>> This is a real problem with anonymous memory.
>>>
>> Yes, I thought of different locking fashions but the problem seems to be
>> that any pagetable walker will choose an action path according to the
>> value
>> it sees.
>>
>>>
>>> Unless I am missing something it's all very tricky and there might be
>>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in
>>> read mode, pmd_none() means there is nothing even without holding the
>>> PMD lock when checking".
>>
>> Yes, I was betting on the fact that, if the code assumes that pmd_none()
>> means there is nothing, eventually when it will take the PMD lock to
>> write to it, it will check whether
>> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
>> thingy, thanks.
>
> Note that this is just the tip of the iceberg. Most page table walkers
> that deal with anonymous memory have the same requirement.
>
>>>>>
>>>>>
>>>>> I recall that for shmem that's "easier", because we don't have to
>>>>> reinstall a PMD immediately, we cna be sure that the page table is
>>>>> kept empty/unmodified, ...
>>>>>
>>>>
>>
>> All in all, the general solution seems to be that, if I can take all
>> pagetable walkers into an invalid state and make them backoff, then I am
>> safe.
> > For example, we do not zero out the PMD, we take the pte PTL, we do>
> stuff, then instead of setting the PTEs to zero, we set it to a universal
>> invalid state upon which no pagetable walker can take an action; an
>> instance of that can be to set the PTE to a swap entry such that the
>> fault
>> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
>> the PMD lock (in fact we don't need it since any pagetable walking
>> is rendered useless) and populate the PMD to resume the new pagetable
>> walking. Another *ridiculous* idea may be to remember the PGD we
>> came from and nuke it (but I guess there is an alternate path for that
>> in walk_pgd_range() and so on?)
>
>
> What might work is introducing a PMD marker (note: we don't have PMD
> markers yet) for that purpose. Likely, the PMD marker may only be set
> while the PMD lock is being held, and we must not drop the PMD lock
> temporarily (otherwise people will spin on the value by repeadetly
> (un)locking the PMD lock, which is stupid).
>
>
> Then you have to make sure that each and every page table walker
> handles that properly.
>
> Then, there is the problem with holding the PMD lock for too long as I
> mentioned.
>
> Last but not least, there are more issues that I haven't even
> described before (putting aside the other issue):
>
>
> Our page table walkers can handle the transitions:
>
> PMD leaf -> PMD none
> * concurrent zap / MADV_DONTNEED / !anon PMD split
>
> PMD leaf -> PTE table
> * concurrent anon PMD split
> * concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
> allocation
>
> PTE table (empty) -> PMD none
> * concurrent page table reclaim, part of collapsing file THPs
> * collapse_pte_mapped_thp()
> * retract_page_tables()
>
>
> I think they *cannot* tolerate the transition *properly*:
>
> PTE table (non-empty) -> PMD leaf
> * what you want do do ;)
>
> -> Observe how we skip to the next PMD in all page table walkers /
> give up if pte_offset_map_lock() and friends fail! I even think there
> are more issues hiding there with pte_offset_map_ro_nolock().
>
>
> Of course, on top of all of this, you should net significantly degrade
> the ordinary page table walker performance ...
>
>
> I don't want to discourage you, but it's all extremely complicated.
>
Thanks for the detailed reply!
Powered by blists - more mailing lists