linux-kernel - Re: [QUESTION] mmap lock in khugepaged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d5b827a4-81ae-4283-9bb4-2e38adc47d71@redhat.com>
Date: Fri, 29 Nov 2024 12:30:25 +0100
From: David Hildenbrand <david@...hat.com>
To: Dev Jain <dev.jain@....com>, ryan.roberts@....com,
 kirill.shutemov@...ux.intel.com, willy@...radead.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, ziy@...dia.com
Subject: Re: [QUESTION] mmap lock in khugepaged

>>
>> Thinking about it, I am also not sure if most other code can deal with
>> temporary pmd_none(). These don't necessarily take the PMD lock,
>> because "why would they" right now.
>>
>> See walk_pmd_range() as one example, if it spots pmd_none() it assumes
>> "there really is nothing" without checking the PMD lock.
>>
>> As a more concrete example, assume someone calls MADV_DONTNEED and we
>> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing"
>> we'll skip that entry without getting the PMD lock involved. That
>> would mean that you would be breaking MADV_DONTNEED if you managed to
>> collapse or not -- memory would not get discarded.
>>
>> This is a real problem with anonymous memory.
>>
> Yes, I thought of different locking fashions but the problem seems to be
> that any pagetable walker will choose an action path according to the value
> it sees.
> 
>>
>> Unless I am missing something it's all very tricky and there might be
>> a lot of such code that assumes "if I hold a mmap lock / VMA lock in
>> read mode, pmd_none() means there is nothing even without holding the
>> PMD lock when checking".
> 
> Yes, I was betting on the fact that, if the code assumes that pmd_none()
> means there is nothing, eventually when it will take the PMD lock to
> write to it, it will check whether
> the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED
> thingy, thanks.

Note that this is just the tip of the iceberg. Most page table walkers 
that deal with anonymous memory have the same requirement.

>>>>
>>>>
>>>> I recall that for shmem that's "easier", because we don't have to
>>>> reinstall a PMD immediately, we cna be sure that the page table is
>>>> kept empty/unmodified, ...
>>>>
>>>
> 
> All in all, the general solution seems to be that, if I can take all
> pagetable walkers into an invalid state and make them backoff, then I am
> safe.
 > For example, we do not zero out the PMD, we take the pte PTL, we do> 
stuff, then instead of setting the PTEs to zero, we set it to a universal
> invalid state upon which no pagetable walker can take an action; an
> instance of that can be to set the PTE to a swap entry such that the fault
> handler ends up in do_swap_page() ->print_bad_pte(). So now we can take
> the PMD lock (in fact we don't need it since any pagetable walking
> is rendered useless) and populate the PMD to resume the new pagetable
> walking. Another *ridiculous* idea may be to remember the PGD we
> came from and nuke it (but I guess there is an alternate path for that
> in walk_pgd_range() and so on?)


What might work is introducing a PMD marker (note: we don't have PMD 
markers yet) for that purpose. Likely, the PMD marker may only be set 
while the PMD lock is being held, and we must not drop the PMD lock 
temporarily (otherwise people will spin on the value by repeadetly 
(un)locking the PMD lock, which is stupid).


Then you have to make sure that each and every page table walker handles 
that properly.

Then, there is the problem with holding the PMD lock for too long as I 
mentioned.

Last but not least, there are more issues that I haven't even described 
before (putting aside the other issue):


Our page table walkers can handle the transitions:

PMD leaf -> PMD none
* concurrent zap / MADV_DONTNEED / !anon PMD split

PMD leaf -> PTE table
* concurrent anon PMD split
* concurrent zap / MADV_DONTNEED / !anon PMD split + PTE table
   allocation

PTE table (empty) -> PMD none
* concurrent page table reclaim, part of collapsing file THPs
  * collapse_pte_mapped_thp()
  * retract_page_tables()


I think they *cannot* tolerate the transition *properly*:

PTE table (non-empty) -> PMD leaf
* what you want do do ;)

-> Observe how we skip to the next PMD in all page table walkers /  give 
up if pte_offset_map_lock() and friends fail! I even think there are 
more issues hiding there with pte_offset_map_ro_nolock().


Of course, on top of all of this, you should net significantly degrade 
the ordinary page table walker performance ...


I don't want to discourage you, but it's all extremely complicated.

-- 
Cheers,

David / dhildenb