lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ccce0551-489a-4612-ab5d-2dd8a5cae66c@arm.com>
Date: Sun, 18 Jan 2026 11:16:40 +0530
From: Dev Jain <dev.jain@....com>
To: Barry Song <21cnbao@...il.com>
Cc: Wei Yang <richard.weiyang@...il.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, akpm@...ux-foundation.org,
 david@...nel.org, catalin.marinas@....com, will@...nel.org,
 lorenzo.stoakes@...cle.com, ryan.roberts@....com, Liam.Howlett@...cle.com,
 vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
 riel@...riel.com, harry.yoo@...cle.com, jannh@...gle.com,
 willy@...radead.org, linux-mm@...ck.org,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 5/5] mm: rmap: support batched unmapping for file large
 folios


On 16/01/26 7:58 pm, Barry Song wrote:
> On Fri, Jan 16, 2026 at 5:53 PM Dev Jain <dev.jain@....com> wrote:
>>
>> On 07/01/26 7:16 am, Wei Yang wrote:
>>> On Wed, Jan 07, 2026 at 10:29:25AM +1300, Barry Song wrote:
>>>> On Wed, Jan 7, 2026 at 2:22 AM Wei Yang <richard.weiyang@...il.com> wrote:
>>>>> On Fri, Dec 26, 2025 at 02:07:59PM +0800, Baolin Wang wrote:
>>>>>> Similar to folio_referenced_one(), we can apply batched unmapping for file
>>>>>> large folios to optimize the performance of file folios reclamation.
>>>>>>
>>>>>> Barry previously implemented batched unmapping for lazyfree anonymous large
>>>>>> folios[1] and did not further optimize anonymous large folios or file-backed
>>>>>> large folios at that stage. As for file-backed large folios, the batched
>>>>>> unmapping support is relatively straightforward, as we only need to clear
>>>>>> the consecutive (present) PTE entries for file-backed large folios.
>>>>>>
>>>>>> Performance testing:
>>>>>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>>>>>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>>>>>> 75% performance improvement on my Arm64 32-core server (and 50%+ improvement
>>>>>> on my X86 machine) with this patch.
>>>>>>
>>>>>> W/o patch:
>>>>>> real    0m1.018s
>>>>>> user    0m0.000s
>>>>>> sys     0m1.018s
>>>>>>
>>>>>> W/ patch:
>>>>>> real   0m0.249s
>>>>>> user   0m0.000s
>>>>>> sys    0m0.249s
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u
>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>>>> Acked-by: Barry Song <baohua@...nel.org>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>>> ---
>>>>>> mm/rmap.c | 7 ++++---
>>>>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 985ab0b085ba..e1d16003c514 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1863,9 +1863,10 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>>>>>       end_addr = pmd_addr_end(addr, vma->vm_end);
>>>>>>       max_nr = (end_addr - addr) >> PAGE_SHIFT;
>>>>>>
>>>>>> -      /* We only support lazyfree batching for now ... */
>>>>>> -      if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
>>>>>> +      /* We only support lazyfree or file folios batching for now ... */
>>>>>> +      if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>>>>>>               return 1;
>>>>>> +
>>>>>>       if (pte_unused(pte))
>>>>>>               return 1;
>>>>>>
>>>>>> @@ -2231,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>>>                        *
>>>>>>                        * See Documentation/mm/mmu_notifier.rst
>>>>>>                        */
>>>>>> -                      dec_mm_counter(mm, mm_counter_file(folio));
>>>>>> +                      add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
>>>>>>               }
>>>>>> discard:
>>>>>>               if (unlikely(folio_test_hugetlb(folio))) {
>>>>>> --
>>>>>> 2.47.3
>>>>>>
>>>>> Hi, Baolin
>>>>>
>>>>> When reading your patch, I come up one small question.
>>>>>
>>>>> Current try_to_unmap_one() has following structure:
>>>>>
>>>>>     try_to_unmap_one()
>>>>>         while (page_vma_mapped_walk(&pvmw)) {
>>>>>             nr_pages = folio_unmap_pte_batch()
>>>>>
>>>>>             if (nr_pages = folio_nr_pages(folio))
>>>>>                 goto walk_done;
>>>>>         }
>>>>>
>>>>> I am thinking what if nr_pages > 1 but nr_pages != folio_nr_pages().
>>>>>
>>>>> If my understanding is correct, page_vma_mapped_walk() would start from
>>>>> (pvmw->address + PAGE_SIZE) in next iteration, but we have already cleared to
>>>>> (pvmw->address + nr_pages * PAGE_SIZE), right?
>>>>>
>>>>> Not sure my understanding is correct, if so do we have some reason not to
>>>>> skip the cleared range?
>>>> I don’t quite understand your question. For nr_pages > 1 but not equal
>>>> to nr_pages, page_vma_mapped_walk will skip the nr_pages - 1 PTEs inside.
>>>>
>>>> take a look:
>>>>
>>>> next_pte:
>>>>                do {
>>>>                        pvmw->address += PAGE_SIZE;
>>>>                        if (pvmw->address >= end)
>>>>                                return not_found(pvmw);
>>>>                        /* Did we cross page table boundary? */
>>>>                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>>>                                if (pvmw->ptl) {
>>>>                                        spin_unlock(pvmw->ptl);
>>>>                                        pvmw->ptl = NULL;
>>>>                                }
>>>>                                pte_unmap(pvmw->pte);
>>>>                                pvmw->pte = NULL;
>>>>                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>>>                                goto restart;
>>>>                        }
>>>>                        pvmw->pte++;
>>>>                } while (pte_none(ptep_get(pvmw->pte)));
>>>>
>>> Yes, we do it in page_vma_mapped_walk() now. Since they are pte_none(), they
>>> will be skipped.
>>>
>>> I mean maybe we can skip it in try_to_unmap_one(), for example:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 9e5bd4834481..ea1afec7c802 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2250,6 +2250,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>                */
>>>               if (nr_pages == folio_nr_pages(folio))
>>>                       goto walk_done;
>>> +             else {
>>> +                     pvmw.address += PAGE_SIZE * (nr_pages - 1);
>>> +                     pvmw.pte += nr_pages - 1;
>>> +             }
>>>               continue;
>>>  walk_abort:
>>>               ret = false;
>> I am of the opinion that we should do something like this. In the internal pvmw code,
> I am still not convinced that skipping PTEs in try_to_unmap_one()
> is the right place. If we really want to skip certain PTEs early,
> should we instead hint page_vma_mapped_walk()? That said, I don't
> see much value in doing so, since in most cases nr is either 1 or
> folio_nr_pages(folio).
>
>> we keep skipping ptes till the ptes are none. With my proposed uffd-fix [1], if the old
>> ptes were uffd-wp armed, pte_install_uffd_wp_if_needed will convert all ptes from none
>> to not none, and we will lose the batching effect. I also plan to extend support to
>> anonymous folios (therefore generalizing for all types of memory) which will set a
>> batch of ptes as swap, and the internal pvmw code won't be able to skip through the
>> batch.
> Thanks for catching this, Dev. I already filter out some of the more
> complex cases, for example:
> if (pte_unused(pte))
>         return 1;
>
> Since the userfaultfd write-protection case is also a corner case,
> could we filter it out as well?
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c86f1135222b..6bb8ba6f046e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1870,6 +1870,9 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
>         if (pte_unused(pte))
>                 return 1;
>
> +       if (userfaultfd_wp(vma))
> +               return 1;
> +
>         return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> }
>
> Just offering a second option — yours is probably better.

No. This is not an edge case. This is a case which gets exposed by your work, and
I believe that if you intend to get the file folio batching thingy in, then you
need to fix the uffd stuff too.

>
> Thanks
> Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ