[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db4d0a19-598b-48ff-accc-f5940a481035@arm.com>
Date: Fri, 16 Jan 2026 15:23:02 +0530
From: Dev Jain <dev.jain@....com>
To: Wei Yang <richard.weiyang@...il.com>, Barry Song <21cnbao@...il.com>
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>, akpm@...ux-foundation.org,
david@...nel.org, catalin.marinas@....com, will@...nel.org,
lorenzo.stoakes@...cle.com, ryan.roberts@....com, Liam.Howlett@...cle.com,
vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
riel@...riel.com, harry.yoo@...cle.com, jannh@...gle.com,
willy@...radead.org, linux-mm@...ck.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large
folios
On 07/01/26 7:16 am, Wei Yang wrote:
> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@...il.com> wrote:
>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>> large folios to optimize the performance of file folios reclamation.
>>>>
>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>> large folios at that stage. As for file-backed large folios, the batched
>>>> unmapping support is relatively straightforward, as we only need to clear
>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>
>>>> Performance testing:
>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>> on my X86 machine) with this patch.
>>>>
>>>> W/o patch:
>>>> real 0m1.018s
>>>> user 0m0.000s
>>>> sys 0m1.018s
>>>>
>>>> W/ patch:
>>>> real 0m0.249s
>>>> user 0m0.000s
>>>> sys 0m0.249s
>>>>
>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>> Acked-by: Barry Song <baohua@...nel.org>
>>>> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>> ---
>>>> mm/rmap.c | 7 ++++---
>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 985ab0b085ba..e1d16003c514 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>> end_addr = pmd_addr_end(addr, vma->vm_end);
>>>> max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>
>>>> - /* We only support lazyfree batching for now ... */
>>>> - if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>> + /* We only support lazyfree or file folios batching for now ... */
>>>> + if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>> return 1;
>>>> +
>>>> if (pte_unused(pte))
>>>> return 1;
>>>>
>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>> *
>>>> * See Documentation/mm/mmu_notifier.rst
>>>> */
>>>> - dec_mm_counter(mm, mm_counter_file(folio));
>>>> + add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>> }
>>>> discard:
>>>> if (unlikely(folio_test_hugetlb(folio))) {
>>>> --
>>>> 2.47.3
>>>>
>>> Hi, Baolin
>>>
>>> When reading your patch, I come up one small question.
>>>
>>> Current try_to_unmap_one() has following structure:
>>>
>>> try_to_unmap_one()
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> nr_pages = folio_unmap_pte_batch()
>>>
>>> if (nr_pages = folio_nr_pages(folio))
>>> goto walk_done;
>>> }
>>>
>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>
>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>
>>> Not sure my understanding is correct, if so do we have some reason not to
>>> skip the cleared range?
>> I don’t quite understand your question. For nr_pages > 1 but not equal
>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>
>> take a look:
>>
>> next_pte:
>> do {
>> pvmw->address += PAGE_SIZE;
>> if (pvmw->address >= end)
>> return not_found(pvmw);
>> /* Did we cross page table boundary? */
>> if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>> if (pvmw->ptl) {
>> spin_unlock(pvmw->ptl);
>> pvmw->ptl = NULL;
>> }
>> pte_unmap(pvmw->pte);
>> pvmw->pte = NULL;
>> pvmw->flags |= PVMW_PGTABLE_CROSSED;
>> goto restart;
>> }
>> pvmw->pte++;
>> } while (pte_none(ptep_get(pvmw->pte)));
>>
> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
> will be skipped.
>
> I mean maybe we can skip it in try_to_unmap_one(), for example:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e5bd4834481..ea1afec7c802 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> */
> if (nr_pages == folio_nr_pages(folio))
> goto walk_done;
> + else {
> + pvmw.address += PAGE_SIZE * (nr_pages - 1);
> + pvmw.pte += nr_pages - 1;
> + }
> continue;
> walk_abort:
> ret = false;
I am of the opinion that we should do something like this. In the internal pvmw code,
we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
to not none, and we will lose the batching effect. I also plan to extend support to
anonymous folios (therefore generalizing for all types of memory) which will set a
batch of ptes as swap, and the internal pvmw code won't be able to skip through the
batch.
[1] https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/
>
> Not sure this is reasonable.
>
>
Powered by blists - more mailing lists