linux-kernel - Re: [QUESTION] mmap lock in khugepaged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com>
Date: Fri, 29 Nov 2024 12:40:08 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, ryan.roberts@....com,
 kirill.shutemov@...ux.intel.com, willy@...radead.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, ziy@...dia.com
Subject: Re: [QUESTION] mmap lock in khugepaged


+the list and Zi Yan.

On 28/11/24 5:04 pm, David Hildenbrand wrote:
>>
>>
>>> Maybe, we'd have to do the isolation+copy under the PMD lock. And
>>> currently, we have to drop the PMD lock in order to have the
>>> pte_offset_map_lock() work IIRC.
>>>
>>
>> Is there a problem in holding two page table locks simultaneously?
>
> Depending on CONFIG_SPLIT_PTE_PTLOCKS, it might not even be two locks 
> (I assume we could have such configs with khugepaged).
>
> Not sure if there could be an issue with lock inversion.
>
> So I suspect this no not be 100% trivial :)
>
>>
>>
>>>
>>> Most importantly, the copy that currently runs under no spinlocks
>>> would now run under spinlock. Up to 512 MiB on arm64 64K, not sure if
>>> that can be a problem ... we currently seem to take care of that
>>
>> But we already are taking mmap_write_lock(), so that should not matter?
>
> We are dealing with a spinlock vs. a rwsem.
>
> We usually want to avoid holding spinlocks for an excessive amount of 
> time, because all other CPUs waiting for that lock will ... spin with 
> preemption disabled instead of rescheduling and doing something useful.
>

Ah okay.


>
> Further, without CONFIG_SPLIT_PMD_PTLOCKS, in fact everybody who wnats 
> to take a PMD lock in that process would be spinning on the same PMD 
> lock :)
>
>
>> I mean, if we can get rid of the mmap exclusive lock, then the copying
>> would still be a bottleneck, and all fault handlers will back off, but
>
> I'm trying to digest it once again, but I'm afraid I don't see how 
> fault handlers will back off.
>
> Won't they either see pmd_none(), to then either call pte_alloc_one() 
> where they would spin on the PMD lock, or try allocating a PMD THP to 
> insert it, and then spin on the PMD lock, to figure out later that it 
> was all in vain?

Ya that's what I meant when I said "back off", sorry I wasn't clear, I 
didn't mean VM_FAULT_FALLBACK/RETRY or something...yes, the work will be 
in vain and the MMU will refault...
>
>
> Thinking about it, I am also not sure if most other code can deal with 
> temporary pmd_none(). These don't necessarily take the PMD lock, 
> because "why would they" right now.
>
> See walk_pmd_range() as one example, if it spots pmd_none() it assumes 
> "there really is nothing" without checking the PMD lock.
>
> As a more concrete example, assume someone calls MADV_DONTNEED and we 
> end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" 
> we'll skip that entry without getting the PMD lock involved. That 
> would mean that you would be breaking MADV_DONTNEED if you managed to 
> collapse or not -- memory would not get discarded.
>
> This is a real problem with anonymous memory.
>
Yes, I thought of different locking fashions but the problem seems to be 
that any pagetable walker will choose an action path according to the value
it sees.

>
> Unless I am missing something it's all very tricky and there might be 
> a lot of such code that assumes "if I hold a mmap lock / VMA lock in 
> read mode, pmd_none() means there is nothing even without holding the 
> PMD lock when checking".

Yes, I was betting on the fact that, if the code assumes that pmd_none() 
means there is nothing, eventually when it will take the PMD lock to 
write to it, it will check whether
the PMD changed, and back off. I wasn't aware of the MADV_DONTNEED 
thingy, thanks.
>
>
>> at least processes will be able to mmap() and do stuff with their VMAs,
>> and I would guess that this is worth optimizing...
>
> It would certainly be interesting to get rid of the mmap lock in write 
> mode here, but it's all rater tricky (and the code has rather nasty 
> hidden implications).
>
>>>
>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
>>> if (pte) {
>>>      ...
>>>      spun_unlock(pte);
>>> } ...
>>>
>>> result = __collapse_huge_page_copy(...);
>>> pte_unmap(pte);
>>>
>>>
>>> Deep in __collapse_huge_page_copy() we seem to re-rake the PTL lock.
>>> No-split-spinlock confiogs might be problematic ...
>>
>> Could you elaborate a little? I haven't read about the older config...
>
> See above regarding CONFIG_SPLIT_PTE_PTLOCKS and friends.
>
>>>
>>>
>>> I recall that for shmem that's "easier", because we don't have to
>>> reinstall a PMD immediately, we cna be sure that the page table is
>>> kept empty/unmodified, ...
>>>
>>

All in all, the general solution seems to be that, if I can take all 
pagetable walkers into an invalid state and make them backoff, then I am 
safe.
For example, we do not zero out the PMD, we take the pte PTL, we do 
stuff, then instead of setting the PTEs to zero, we set it to a universal
invalid state upon which no pagetable walker can take an action; an 
instance of that can be to set the PTE to a swap entry such that the fault
handler ends up in do_swap_page() ->print_bad_pte(). So now we can take 
the PMD lock (in fact we don't need it since any pagetable walking
is rendered useless) and populate the PMD to resume the new pagetable 
walking. Another *ridiculous* idea may be to remember the PGD we
came from and nuke it (but I guess there is an alternate path for that 
in walk_pgd_range() and so on?)


>
>