linux-kernel - Re: [PATCH RFC] mm: make try_to_unmap_one support batched unmap for anon large folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a985416-c8c5-429f-a83a-3c66be939439@linux.alibaba.com>
Date: Thu, 15 May 2025 11:40:59 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
 David Hildenbrand <david@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
 Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
 Kairui Song <kasong@...cent.com>, Chris Li <chrisl@...nel.org>,
 Baoquan He <bhe@...hat.com>, Dan Schatzberg <schatzberg.dan@...il.com>,
 Kaixiong Yu <yukaixiong@...wei.com>, Fan Ni <fan.ni@...sung.com>,
 Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH RFC] mm: make try_to_unmap_one support batched unmap for
 anon large folios



On 2025/5/15 09:35, Barry Song wrote:
> On Wed, May 14, 2025 at 8:11 PM Baolin Wang
> <baolin.wang@...ux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/5/13 16:46, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@...o.com>
>>>
>>> My commit 354dffd29575c ("mm: support batched unmap for lazyfree large
>>> folios during reclamation") introduced support for unmapping entire
>>> lazyfree anonymous large folios at once, instead of one page at a time.
>>> This patch extends that support to generic (non-lazyfree) anonymous
>>> large folios.
>>>
>>> Handling __folio_try_share_anon_rmap() and swap_duplicate() becomes
>>> extremely complex—if not outright impractical—for non-exclusive
>>> anonymous folios. As a result, this patch limits support to exclusive
>>> large folios. Fortunately, most anonymous folios are exclusive in
>>> practice, so this restriction should be acceptable in the majority of
>>> cases.
>>>
>>> SPARC is currently the only architecture that implements
>>> arch_unmap_one(), which also needs to be batched for consistency.
>>> However, this is not yet supported, so the platform is excluded for
>>> now.
>>>
>>> Using the following micro-benchmark to measure the time taken to perform
>>> PAGEOUT on 256MB of 64KiB anonymous large folios.
>>>
>>>    #define _GNU_SOURCE
>>>    #include <stdio.h>
>>>    #include <stdlib.h>
>>>    #include <sys/mman.h>
>>>    #include <string.h>
>>>    #include <time.h>
>>>    #include <unistd.h>
>>>    #include <errno.h>
>>>
>>>    #define SIZE_MB 256
>>>    #define SIZE_BYTES (SIZE_MB * 1024 * 1024)
>>>
>>>    int main() {
>>>        void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
>>>                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>        if (addr == MAP_FAILED) {
>>>            perror("mmap failed");
>>>            return 1;
>>>        }
>>>
>>>        memset(addr, 0, SIZE_BYTES);
>>>
>>>        struct timespec start, end;
>>>        clock_gettime(CLOCK_MONOTONIC, &start);
>>>
>>>        if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
>>>            perror("madvise(MADV_PAGEOUT) failed");
>>>            munmap(addr, SIZE_BYTES);
>>>            return 1;
>>>        }
>>>
>>>        clock_gettime(CLOCK_MONOTONIC, &end);
>>>
>>>        long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
>>>                           (end.tv_nsec - start.tv_nsec);
>>>        printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
>>>               duration_ns, duration_ns / 1e6);
>>>
>>>        munmap(addr, SIZE_BYTES);
>>>        return 0;
>>>    }
>>>
>>> w/o patch:
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 1337334000 ns (1337.334 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 1340471008 ns (1340.471 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 1385718992 ns (1385.719 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 1366070000 ns (1366.070 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 1347834992 ns (1347.835 ms)
>>>
>>> w/patch:
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 698178000 ns (698.178 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 708570000 ns (708.570 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 693884000 ns (693.884 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 693366000 ns (693.366 ms)
>>> ~ # ./a.out
>>> madvise(MADV_PAGEOUT) took 690790000 ns (690.790 ms)
>>>
>>> We found that the time to reclaim this memory was reduced by half.
>>
>> Do you have some performance numbers for the base page?
> 
> We verified that folio_test_large(folio) needs to run in a batched context;
> otherwise, nr_pages remains 1 for each folio.
> 
>                          if (folio_test_large(folio) && !(flags &
> TTU_HWPOISON) &&
>                              can_batch_unmap_folio_ptes(address, folio, pvmw.pte,
>                              anon_exclusive))
>                                  nr_pages = folio_nr_pages(folio);
> 
> I didn't expect any noticeable performance change for base pages, but testing
> shows the patch appears to make them slightly faster—likely due to test noise or
> jitter.
> 
> W/o patch:
> 
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5686488000 ns (5686.488 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5628330992 ns (5628.331 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5771742992 ns (5771.743 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5672108000 ns (5672.108 ms)
> 
> 
> W/ patch:
> 
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5481578000 ns (5481.578 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5425394992 ns (5425.395 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5522109008 ns (5522.109 ms)
> ~ # ./a.out
> madvise(MADV_PAGEOUT) took 5506832000 ns (5506.832 ms)

Thanks. My expectation is also that the batch processing of large folios 
should not have a performance impact on the base pages, but it would be 
best to clearly state this in the commit message.