linux-kernel - Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com>
Date: Fri, 16 Aug 2024 10:55:21 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: david@...hat.com, hughd@...gle.com, willy@...radead.org,
 muchun.song@...ux.dev, mgorman@...e.de, vbabka@...nel.org,
 akpm@...ux-foundation.org, zokeefe@...gle.com, rientjes@...gle.com
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE
 pages



On 2024/8/6 11:31, Qi Zheng wrote:
> Hi all,
> 
> On 2024/8/5 20:55, Qi Zheng wrote:
> 
> [...]
> 
>>
>> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the 
>> TLB is not
>>     flushed before pmd lock is unlocked. This may result in the 
>> following two
>>     situations:
>>
>>     1) Userland can trigger page fault and fill a huge page, which 
>> will cause
>>        the existence of small size TLB and huge TLB for the same address.
>>
>>     2) Userland can also trigger page fault and fill a PTE page, which 
>> will
>>        cause the existence of two small size TLBs, but the PTE page 
>> they map
>>        are different.
>>
>>     For case 1), according to Intel's TLB Application note (317080), 
>> some CPUs of
>>     x86 do not allow it:
>>
>>     ```
>>     If software modifies the paging structures so that the page size 
>> used for a
>>     4-KByte range of linear addresses changes, the TLBs may 
>> subsequently contain
>>     both ordinary and large-page translations for the address range.12 
>> A reference
>>     to a linear address in the address range may use either 
>> translation. Which of
>>     the two translations is used may vary from one execution to 
>> another and the
>>     choice may be implementation-specific.
>>
>>     Software wishing to prevent this uncertainty should not write to a 
>> paging-
>>     structure entry in a way that would change, for any linear 
>> address, both the
>>     page size and either the page frame or attributes. It can instead 
>> use the
>>     following algorithm: first mark the relevant paging-structure 
>> entry (e.g.,
>>     PDE) not present; then invalidate any translations for the 
>> affected linear
>>     addresses (see Section 5.2); and then modify the relevant 
>> paging-structure
>>     entry to mark it present and establish translation(s) for the new 
>> page size.
>>     ```
>>
>>     We can also learn more information from the comments above 
>> pmdp_invalidate()
>>     in __split_huge_pmd_locked().
>>
>>     For case 2), we can see from the comments above ptep_clear_flush() in
>>     wp_page_copy() that this situation is also not allowed. Even without
>>     this patch series, madvise(MADV_DONTNEED) can also cause this 
>> situation:
>>
>>             CPU 0                         CPU 1
>>
>>     madvise (MADV_DONTNEED)
>>     -->  clear pte entry
>>          pte_unmap_unlock
>>                                        touch and tlb miss
>>                       --> set pte entry
>>          mmu_gather flush tlb
>>
>>     But strangely, I didn't see any relevant fix code, maybe I missed 
>> something,
>>     or is this guaranteed by userland?
> 
> I'm still quite confused about this, is there anyone who is familiar
> with this part?

This is not a new issue introduced by this patch series, and I have
sent a separate RFC patch [1] to track this issue.

I will remove this part of the handling in the next version.

[1]. 
https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/

> 
> Thanks,
> Qi
> 
>>
>>     Anyway, this series defines the following two functions to be 
>> implemented by
>>     the architecture. If the architecture does not allow the above two 
>> situations,
>>     then define these two functions to flush the tlb before set_pmd_at().
>>
>>     - arch_flush_tlb_before_set_huge_page
>>     - arch_flush_tlb_before_set_pte_page
>>
> 
> [...]
> 
>>