linux-kernel - Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <483fd440-df7b-fab3-b138-f3789f2dc078@intel.com>
Date:   Mon, 20 Mar 2023 21:38:55 +0800
From:   "Yin, Fengwei" <fengwei.yin@...el.com>
To:     Matthew Wilcox <willy@...radead.org>
CC:     Ryan Roberts <ryan.roberts@....com>, <linux-arch@...r.kernel.org>,
        <will@...nel.org>, <linux-mm@...ck.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

Hi Matthew,

On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>
>>>>> Yes. You are right here.
>>>>>
>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>> can avoid to handle vmf->address == addr specially. It's OK to 
>>>>> drop prefault and change the logic here a little bit to:
>>>>>   if (arch_wants_old_prefaulted_pte())
>>>>>       entry = pte_mkold(entry);
>>>>>   else
>>>>>       entry = pte_sw_mkyong(entry);
>>>>>
>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>
>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>
>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>> this to old initially, we will get a second fault to set the access bit, which
>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>> calls to set_ptes()?
>>>
>>> I don't think we should do that.  The limited information I have from
>>> various microarchitectures is that the PTEs must differ only in their
>>> PFN bits in order to use larger TLB entries.  That includes the Accessed
>>> bit (or equivalent).  So we should mkyoung all the PTEs in the same
>>> folio, at least initially.
>>>
>>> That said, we should still do this conditionally.  We'll prefault some
>>> other folios too.  So I think this should be:
>>>
>>>         bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>
>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>> on ARM64 with hardware access flag.
>>
>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>> and let hardware to update it to "yong"?
> 
> Because we're tracking the entire folio as a single entity.  So we're
> better off avoiding the extra pagefaults to update the accessed bit,
> which won't actually give us any information (vmscan needs to know "were
> any of the accessed bits set", not "how many of them were set").
> 
> Anyway, hopefully Ryan can test this and let us know if it fixes the
> regression he sees.

Thanks a lot to Ryan for helping to test the debug patch I made.

Ryan confirmed that the following change could fix the kernel build regression:
diff --git a/mm/filemap.c b/mm/filemap.c
index db86e459dde6..343d6ff36b2c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3557,7 +3557,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,

                ret |= filemap_map_folio_range(vmf, folio,
                                xas.xa_index - folio->index, addr, nr_pages);
-               xas.xa_index += nr_pages;
+               xas.xa_index += folio_test_large(folio) ? nr_pages : 0;

                folio_unlock(folio);
                folio_put(folio);

I will make upstream-able change as "xas.xa_index += nr_pages - 1;"

Ryan and I also identify some other changes needed. I am not sure how to
integrate those changes to this series. Maybe an add-on patch after this
series? Thanks.

Regards
Yin, Fengwei