[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fac9dd13-0c1b-40de-a5ac-d2c4c2012385@arm.com>
Date: Wed, 17 Dec 2025 16:06:00 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: akpm@...ux-foundation.org, david@...nel.org, catalin.marinas@....com,
will@...nel.org, Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
surenb@...gle.com, mhocko@...e.com, riel@...riel.com, harry.yoo@...cle.com,
jannh@...gle.com, willy@...radead.org, baohua@...nel.org,
linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young
flag for large folios
On 17/12/2025 14:50, Lorenzo Stoakes wrote:
> On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
>>> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>>>> To support batch PTE operations for other sized large folios in the following
>>>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>>>
>>>>>> While we are at it, rename the functions to maintain consistency with other
>>>>>> contpte_*() functions.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>>> ---
>>>>>> arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>>> arch/arm64/mm/contpte.c | 44 ++++++++++++++++++++++----------
>>>>>> 2 files changed, 37 insertions(+), 19 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 0944e296dd4a..e03034683156 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>> extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>> unsigned long addr, pte_t *ptep,
>>>>>> unsigned int nr, int full);
>>>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> - unsigned long addr, pte_t *ptep);
>>>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>> - unsigned long addr, pte_t *ptep);
>>>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> + unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>> + unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>
>>>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>>>> as we go as they're not necessary.
>>>>
>>>> Sure. I can remove the 'extern'.
>>>
>>> Thanks.
>>>
>>>>
>>>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>>>> we are doing something... different and 'nr' is a bit vague.
>>>>>
>>>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>>>
>>>> Yes, 'nr' here means nr_pages, which follows the convention of the
>>>> parameters in contpte_xxx_() functions.
>>>
>>> Except you're violating the convention already by doing non-contpte stuff in
>>> contpte functions?
>>
>> Nope. Sorry for my poor English, which might have caused some confusion. I
>> followed the contpte's convention, and let me try to explain it again below.
>>
>>> The confusion is between nr of contpte ranges and nr of pages, so I think we
>>> should be explicit.
>>>
>>>>
>>>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>>>
>>>>> I wonder whether we'd be better off introducing a new function that
>>>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>>>
>>>> OK. I get your point. However, this is outside the scope of my patchset.
>>>
>>> Well, no, this is review feedback that needs to be addressed, I really don't
>>> like this patch as-is.
>>>
>>> So for this to be upstreamable you really need to separate this out.
>>>
>>>>
>>>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>>>> that this is confusing.
>>>
>>> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
>>> reject the patch unless we do this.
>>>
>>> The contpte logic is already very confusing, and this is only adding to it.
>>>
>>>>
>>>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>>>
>>>>>> extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>> pte_t *ptep, unsigned int nr);
>>>>>> extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> if (likely(!pte_valid_cont(orig_pte)))
>>>>>> return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>
>>>>>> - return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> + return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>> }
>>>>>>
>>>>>> #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>> if (likely(!pte_valid_cont(orig_pte)))
>>>>>> return __ptep_clear_flush_young(vma, addr, ptep);
>>>>>>
>>>>>> - return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>>> + return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>> }
>>>>>>
>>>>>> #define wrprotect_ptes wrprotect_ptes
>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>> index c0557945939c..19b122441be3 100644
>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>>>
>>>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> - unsigned long addr, pte_t *ptep)
>>>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> + unsigned long addr, pte_t *ptep,
>>>>>> + unsigned int nr)
>>>>>> {
>>>>>> /*
>>>>>> * ptep_clear_flush_young() technically requires us to clear the access
>>>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> * having to unfold.
>>>>>> */
>>>>>
>>>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>>>
>>>>> "And since we only create a contig range when the range is covered by a single
>>>>> folio, we can get away with clearing young for the whole contig range here, so
>>>>> we avoid having to unfold."
>>>>
>>>> I think the original comments are still clear:
>>>> "
>>>> /*
>>>> * ptep_clear_flush_young() technically requires us to clear the access
>>>> * flag for a _single_ pte. However, the core-mm code actually tracks
>>>> * access/dirty per folio, not per page. And since we only create a
>>>> * contig range when the range is covered by a single folio, we can get
>>>> * away with clearing young for the whole contig range here, so we avoid
>>>> * having to unfold.
>>>> */
>>>> "
>>>
>>> Except nr means you can span more than a single contig pte range? So this
>>> comment is no longer applicable as-is?
>>
>> The nr means consecutive (present) PTEs that map consecutive pages of the
>> same large folio in a single VMA and a single page table.
>>
>> I think this is a prerequisite. If you think it's necessary to explain it, I
>> can add it to the comments.
>>
>>> You've changed the function, the comment needs to change too.
>>>
>>>>
>>>>> However now you're allowing for large folios that are not contpte?
>>>>>
>>>>> Overall feeling pretty iffy about implementing this this way.
>>>>
>>>> Please see the above comments, which explain why we should do that.
>>>
>>> You haven't explained anything? Maybe I missed it?
>>>
>>> Can you explain clearly what nr might actually be in relation to contpte's? I'm
>>> confused even as to the utility here.
>>>
>>> Is it batched ranges of contptes? But then why are you trying to see if end is a
>>> contpte + expand the range if so?
>>
>> See below.
>>
>>>>>> + unsigned long start = addr;
>>>>>> + unsigned long end = start + nr * PAGE_SIZE;
>>>>>
>>>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>>>> table?
>>>>
>>>> Caller has made sure that (see folio_referenced_one()).
>>>
>>> You should comment this somehow, I hate this nested assumptions stuff 'function
>>> X which calls function Y which calls function Z makes sure that parameter A
>>> fulfils requirements B but we don't mention it anywhere'.
>>
>> Sure.
>>
>>>>>> int young = 0;
>>>>>> int i;
>>>>>>
>>>>>> - ptep = contpte_align_down(ptep);
>>>>>> - addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>>> + if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>>>> + end = ALIGN(end, CONT_PTE_SIZE);
>>>>>
>>>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>>>> correct me if I'm wrong!)
>>>>>
>>>>> I guess then no worry about running off the end of the PTE table here.
>>>>>
>>>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>>>> intending to limit to a range here but rather capture the entire contpte
>>>>> range of the end.
>>>>
>>>> Right.
>>>>
>>>>> I don't love that now a function prefixed with contpte_... can have a
>>>>> condition where part of the input range is _not_ a contpte entry, or is
>>>>> specified partially...
>>>>
>>>> Like I said above, this is outside the scope of my patchset. If Ryan also
>>>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>>>> the contpte.c file, we can send a follow-up patchset to rename all the
>>>> contpte_xxx() functions to make it clear.
>>>
>>> It's really not outside of the scope of the patch set, this is review feedback
>>> you have to address :) trying not to be a grumpy kernel dev here :>)
>>>
>>> In general in the kernel we don't accept confusing patches then defer making
>>> them not-confusing until later.
>>>
>>> We review until the series is at the correct standard to be accepted before we
>>> merge.
>>>
>>> As always I'm happy to be convinced that I'm mistaken or something's not
>>> necesary, however!
>>
>> Yes, let me try to explain it again. See below.
>>
>>>>>> - for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>>> - young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> + if (pte_cont(__ptep_get(ptep))) {
>>>>>> + start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>>>> + ptep = contpte_align_down(ptep);
>>>>>> + }
>>>>>> +
>>>>>> + nr = (end - start) / PAGE_SIZE;
>>>>>
>>>>> Hm don't love this reuse of input param.
>>>>
>>>> OK. Will use another local variable instead.
>>>
>>> Thanks.
>>>
>>>>
>>>>>
>>>>>> + for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>>>> + young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>>
>>>>> Again, overall find it a bit iffy that we are essentially overloading the
>>>>> code with non-contpte behaviour.
>>>>
>>>> Let me clarify further: this is the handling logic of the contpte_xxx()
>>>> functions in the contpte.c file. For example, the function
>>>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>>>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>>>> large folio.
>>>>
>>>> It might be a bit confusing, and I understand this is why you want to change
>>>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>>>> can provide some input on this.
>>>>
>>>> Thanks for reviewing.
>>>
>>> OK so we have some reeeeal confusion here. And this speaks to the patch needing
>>> work.
>>>
>>> This seems to answer what I asked above - basically - are we doing _batches_ of
>>> _contpte_ entries, or are we just accepting arbitrary large folios here and
>>> batching up operations on them, including non-contpte entries.
>>>
>>> Presumably from the above you're saying - this is dealing with batches of
>>> contpte entries.
>>>
>>> In which case this has to be made _super_ clear. Not just adding an arbitrary
>>> 'nr' parameter and letting people guess.
>>>
>>> The contpte code is _already_ confusing and somewhat surprising if you're not
>>> aware of it, we need to be going in the other direction.
>>
>> Sure. Sorry for causing confusion. I try to explain it again to make it
>> clear.
>>
>> First, it is necessary to explain how the contpte implementation in Arm64 is
>> exposed, using set_ptes() as an example.
>>
>> Arm64 defines an arch-specific implementation of set_ptes():
>> "
>> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
>> addr,
>> pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> pte = pte_mknoncont(pte);
>>
>> if (likely(nr == 1)) {
>> contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> __set_ptes(mm, addr, ptep, pte, 1);
>> contpte_try_fold(mm, addr, ptep, pte);
>> } else {
>> contpte_set_ptes(mm, addr, ptep, pte, nr);
>> }
>> }
>> "
>>
>> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
>> of the same large folio within a single VMA and a single page table. Based
>> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>>
>> Second, the underlying implementation of contpte_set_ptes() also calculates
>> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
>> For example:
>>
>> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
>> be set since the 'next' is not aligned with CONT_PTE_MASK.
>>
>> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
>> the PTE_CONT attribute to be set for the entire PTE entries of the large
>> folio.
>>
>> For a large folio with order = 5 (nr = 32), this involves two ranges of
>> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
>> be set separately for each range.
>>
>> "
>> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> .......
>>
>> end = addr + (nr << PAGE_SHIFT);
>> pfn = pte_pfn(pte);
>> prot = pte_pgprot(pte);
>>
>> do {
>> next = pte_cont_addr_end(addr, end);
>> nr = (next - addr) >> PAGE_SHIFT;
>> pte = pfn_pte(pfn, prot);
>>
>> if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> pte = pte_mkcont(pte);
>> else
>> pte = pte_mknoncont(pte);
>>
>> __set_ptes(mm, addr, ptep, pte, nr);
>>
>> addr = next;
>> ptep += nr;
>> pfn += nr;
>>
>> } while (addr != end);
>> }
>> "
>>
>> Third, regarding the implementation of my patch, I have followed the contpte
>> logic here by adding the 'nr' parameter to
>> contpte_ptep_test_and_clear_young() for batch processing consecutive
>> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
>> original contpte functions, and I have not broken the existing contpte
>> implementation logic.
>
> OK that helps a lot thanks.
>
> So the nr will be number of PTEs that are consecutive, belong to the same folio,
> and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.
I'm not sure test_and_clear_young_ptes() would technically require all of these
constraints. I think it would be sufficient to just ensure that all the PTEs in
the batch are pte_present(). And I assume the return value is intended to
indicate if *any* of the PTEs in the batch were previously young?
The usual order for adding new batched PTE helpers is to define the generic
function along with it's spec in a comment block and provide a default
implementation that is implemented as a loop around the existing non-batched
version of the interface. Then you can convert the core-mm to use that API. And
finally you can opt arm64 into overriding the API with a more efficient
implementation.
If you were to do it in that order, it tells a better story and is easier to
follow, I think.
>
> But since we are retrieving accessed bit state (young), isn't it a problem we're
> ignoring this bit when considering what goes into the batch?
>
> I mean now it's going to return true even if some do not have the accessed flag
> set?
test_and_clear_young_ptes() is exploiting the fact that the kernel tracks young
(and dirty) per *folio*. So having a young and dirty bit per PTE is providing
more information than the kernel needs if a large folio is mapped across
multiple PTEs. Which is why returning the logical OR of the access bits is ok here.
>
> I am more convinced things are correct in style at least then this way as, as
> you say, this is the approach taken in set_ptes().
I'm personally ok with the style and I believe the implementation is correct.
But it would help to reformulate the series in the order I suggest above since
then we can review against the spec of what the core-mm helper is supposed to do.
Thanks,
Ryan
>
> Thanks, Lorenzo
Powered by blists - more mailing lists