linux-kernel - Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99aa61b6-afc9-445f-8f50-1e017450efd1@huawei.com>
Date: Tue, 25 Jun 2024 20:23:20 +0800
From: Kefeng Wang <wangkefeng.wang@...wei.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>, Ryan Roberts
	<ryan.roberts@....com>, Catalin Marinas <catalin.marinas@....com>, Will
 Deacon <will@...nel.org>, Ard Biesheuvel <ardb@...nel.org>, Marc Zyngier
	<maz@...nel.org>, James Morse <james.morse@....com>, Andrey Ryabinin
	<ryabinin.a.a@...il.com>, Andrew Morton <akpm@...ux-foundation.org>, Matthew
 Wilcox <willy@...radead.org>, Mark Rutland <mark.rutland@....com>, David
 Hildenbrand <david@...hat.com>, John Hubbard <jhubbard@...dia.com>, Zi Yan
	<ziy@...dia.com>, Barry Song <21cnbao@...il.com>, Alistair Popple
	<apopple@...dia.com>, Yang Shi <shy828301@...il.com>, Thomas Gleixner
	<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov
	<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin"
	<hpa@...or.com>, "Yin, Fengwei" <fengwei.yin@...el.com>
CC: <linux-arm-kernel@...ts.infradead.org>, <x86@...nel.org>,
	<linuxppc-dev@...ts.ozlabs.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings



On 2024/6/25 15:23, Baolin Wang wrote:
> 
> 
> On 2024/6/25 11:16, Kefeng Wang wrote:
>>
>>
>> On 2024/6/24 23:56, Ryan Roberts wrote:
>>> + Baolin Wang and Yin Fengwei, who maybe able to help with this.
>>>
>>>
>>> Hi Kefeng,
>>>
>>> Thanks for the report!
>>>
>>>
>>> On 24/06/2024 15:30, Kefeng Wang wrote:
>>>> Hi Ryan,
>>>>
>>>> A big regression on page-fault3("Separate file shared mapping page
>>>> fault") testcase from will-it-scale on arm64, no issue on x86,
>>>>
>>>> ./page_fault3_processes -t 128 -s 5
>>>
>>> I see that this program is mkstmp'ing a file at 
>>> "/tmp/willitscale.XXXXXX". Based
>>> on your description, I'm inferring that /tmp is backed by ext4 with 
>>> your large
>>> folio patches enabled?
>>
>> Yes, mount /tmp by ext4, sorry to forget to mention that.
>>
>>>
>>>>
>>>> 1) large folio disabled on ext4:
>>>>     92378735
>>>> 2) large folio  enabled on ext4 +  CONTPTE enabled
>>>>     16164943
>>>> 3) large folio  enabled on ext4 +  CONTPTE disabled
>>>>     80364074
>>>> 4) large folio  enabled on ext4 +  CONTPTE enabled + large folio 
>>>> mapping enabled
>>>> in finish_fault()[2]
>>>>     299656874
>>>>
>>>> We found *contpte_convert* consume lots of CPU(76%) in case 2),
>>>
>>> contpte_convert() is expensive and to be avoided; In this case I 
>>> expect it is
>>> repainting the PTEs with the PTE_CONT bit added in, and to do that it 
>>> needs to
>>> invalidate the tlb for the virtual range. The code is there to mop up 
>>> user space
>>> patterns where each page in a range is temporarily made RO, then 
>>> later changed
>>> back. In this case, we want to re-fold the contpte range once all 
>>> pages have
>>> been serviced in RO mode.
>>>
>>> Of course this path is only intended as a fallback, and the more 
>>> optimium
>>> approach is to set_ptes() the whole folio in one go where possible - 
>>> kind of
>>> what you are doing below.
>>>
>>>> and disappeared
>>>> by following change[2], it is easy to understood the different 
>>>> between case 2)
>>>> and case 4) since case 2) always map one page
>>>> size, but always try to fold contpte mappings, which spend a lot of
>>>> time. Case 4) is a workaround, any other better suggestion?
>>>
>>> See below.
>>>
>>>>
>>>> Thanks.
>>>>
>>>> [1] https://github.com/antonblanchard/will-it-scale
>>>> [2] enable large folio mapping in finish_fault()
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 00728ea95583..5623a8ce3a1e 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4880,7 +4880,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>           * approach also applies to non-anonymous-shmem faults to 
>>>> avoid
>>>>           * inflating the RSS of the process.
>>>>           */
>>>> -       if (!vma_is_anon_shmem(vma) || 
>>>> unlikely(userfaultfd_armed(vma))) {
>>>> +       if (unlikely(userfaultfd_armed(vma))) {
>>>
>>> The change to make finish_fault() handle multiple pages in one go are 
>>> new; added
>>> by Baolin Wang at [1]. That extra conditional that you have removed 
>>> is there to
>>> prevent RSS reporting bloat. See discussion that starts at [2].
>>>
>>> Anyway, it was my vague understanding that the fault around mechanism
>>> (do_fault_around()) would ensure that (by default) 64K worth of pages 
>>> get mapped
>>> together in a single set_ptes() call, via filemap_map_pages() ->
>>> filemap_map_folio_range(). Looking at the code, I guess fault around 
>>> only
>>> applies to read faults. This test is doing a write fault.
>>>
>>> I guess we need to do a change a bit like what you have done, but 
>>> also taking
>>> into account fault_around configuration?
> 
> For the writable mmap() of tmpfs, we will use mTHP interface to control 
> the size of folio to allocate, as discussed in previous meeting [1], so 
> I don't think fault_around configuration will be helpful for tmpfs.

Yes, tmpfs is different from ext4.

> 
> For other filesystems, like ext4, I did not found the logic to determin 
> what size of folio to allocate in writable mmap() path (Kefeng, please 
> correct me if I missed something). If there is a control like mTHP, we 
> can rely on that instead of 'fault_around'?

For ext4 or most filesystems, the folio is allocated from filemap_fault(),
we don't have explicit interface like mTHP to control the folio size.


> 
> [1] 
> https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
> 
>> Yes, the current changes is not enough, I hint some issue and still 
>> debugging, so our direction is trying to map large folio for 
>> do_shared_fault(), right?
> 
> I think this is the right direction to do. I add this 
> '!vma_is_anon_shmem(vma)' conditon to gradually implement support for 
> large folio mapping buidling, especially for writable mmap() support in 
> tmpfs.
> 
>>> [1]
>>> https://lore.kernel.org/all/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.com/
>>> [2] 
>>> https://lore.kernel.org/linux-mm/13939ade-a99a-4075-8a26-9be7576b7e03@arm.com/