[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com>
Date: Fri, 16 Aug 2024 10:55:21 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: david@...hat.com, hughd@...gle.com, willy@...radead.org,
muchun.song@...ux.dev, mgorman@...e.de, vbabka@...nel.org,
akpm@...ux-foundation.org, zokeefe@...gle.com, rientjes@...gle.com
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
the arch/x86 maintainers <x86@...nel.org>
Subject: Re: [RFC PATCH v2 0/7] synchronously scan and reclaim empty user PTE
pages
On 2024/8/6 11:31, Qi Zheng wrote:
> Hi all,
>
> On 2024/8/5 20:55, Qi Zheng wrote:
>
> [...]
>
>>
>> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the
>> TLB is not
>> flushed before pmd lock is unlocked. This may result in the
>> following two
>> situations:
>>
>> 1) Userland can trigger page fault and fill a huge page, which
>> will cause
>> the existence of small size TLB and huge TLB for the same address.
>>
>> 2) Userland can also trigger page fault and fill a PTE page, which
>> will
>> cause the existence of two small size TLBs, but the PTE page
>> they map
>> are different.
>>
>> For case 1), according to Intel's TLB Application note (317080),
>> some CPUs of
>> x86 do not allow it:
>>
>> ```
>> If software modifies the paging structures so that the page size
>> used for a
>> 4-KByte range of linear addresses changes, the TLBs may
>> subsequently contain
>> both ordinary and large-page translations for the address range.12
>> A reference
>> to a linear address in the address range may use either
>> translation. Which of
>> the two translations is used may vary from one execution to
>> another and the
>> choice may be implementation-specific.
>>
>> Software wishing to prevent this uncertainty should not write to a
>> paging-
>> structure entry in a way that would change, for any linear
>> address, both the
>> page size and either the page frame or attributes. It can instead
>> use the
>> following algorithm: first mark the relevant paging-structure
>> entry (e.g.,
>> PDE) not present; then invalidate any translations for the
>> affected linear
>> addresses (see Section 5.2); and then modify the relevant
>> paging-structure
>> entry to mark it present and establish translation(s) for the new
>> page size.
>> ```
>>
>> We can also learn more information from the comments above
>> pmdp_invalidate()
>> in __split_huge_pmd_locked().
>>
>> For case 2), we can see from the comments above ptep_clear_flush() in
>> wp_page_copy() that this situation is also not allowed. Even without
>> this patch series, madvise(MADV_DONTNEED) can also cause this
>> situation:
>>
>> CPU 0 CPU 1
>>
>> madvise (MADV_DONTNEED)
>> --> clear pte entry
>> pte_unmap_unlock
>> touch and tlb miss
>> --> set pte entry
>> mmu_gather flush tlb
>>
>> But strangely, I didn't see any relevant fix code, maybe I missed
>> something,
>> or is this guaranteed by userland?
>
> I'm still quite confused about this, is there anyone who is familiar
> with this part?
This is not a new issue introduced by this patch series, and I have
sent a separate RFC patch [1] to track this issue.
I will remove this part of the handling in the next version.
[1].
https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
>
> Thanks,
> Qi
>
>>
>> Anyway, this series defines the following two functions to be
>> implemented by
>> the architecture. If the architecture does not allow the above two
>> situations,
>> then define these two functions to flush the tlb before set_pmd_at().
>>
>> - arch_flush_tlb_before_set_huge_page
>> - arch_flush_tlb_before_set_pte_page
>>
>
> [...]
>
>>
Powered by blists - more mailing lists