lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGOyhvR-GaUYgLwQ@hyeyoo>
Date: Tue, 1 Jul 2025 19:03:50 +0900
From: Harry Yoo <harry.yoo@...cle.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
        baolin.wang@...ux.alibaba.com, chrisl@...nel.org, david@...hat.com,
        ioworker0@...il.com, kasong@...cent.com,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-riscv@...ts.infradead.org, lorenzo.stoakes@...cle.com,
        ryan.roberts@....com, v-songbaohua@...o.com, x86@...nel.org,
        ying.huang@...el.com, zhengtangquan@...o.com
Subject: Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large
 folios during reclamation

On Fri, Feb 14, 2025 at 10:30:14PM +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua@...o.com>
> 
> Currently, the PTEs and rmap of a large folio are removed one at a time.
> This is not only slow but also causes the large folio to be unnecessarily
> added to deferred_split, which can lead to races between the
> deferred_split shrinker callback and memory reclamation. This patch
> releases all PTEs and rmap entries in a batch.
> Currently, it only handles lazyfree large folios.
> 
> The below microbench tries to reclaim 128MB lazyfree large folios
> whose sizes are 64KiB:
> 
>  #include <stdio.h>
>  #include <sys/mman.h>
>  #include <string.h>
>  #include <time.h>
> 
>  #define SIZE 128*1024*1024  // 128 MB
> 
>  unsigned long read_split_deferred()
>  {
>  	FILE *file = fopen("/sys/kernel/mm/transparent_hugepage"
> 			"/hugepages-64kB/stats/split_deferred", "r");
>  	if (!file) {
>  		perror("Error opening file");
>  		return 0;
>  	}
> 
>  	unsigned long value;
>  	if (fscanf(file, "%lu", &value) != 1) {
>  		perror("Error reading value");
>  		fclose(file);
>  		return 0;
>  	}
> 
>  	fclose(file);
>  	return value;
>  }
> 
>  int main(int argc, char *argv[])
>  {
>  	while(1) {
>  		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>  				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
>  		memset((void *)p, 1, SIZE);
> 
>  		madvise((void *)p, SIZE, MADV_FREE);
> 
>  		clock_t start_time = clock();
>  		unsigned long start_split = read_split_deferred();
>  		madvise((void *)p, SIZE, MADV_PAGEOUT);
>  		clock_t end_time = clock();
>  		unsigned long end_split = read_split_deferred();
> 
>  		double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
>  		printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n",
>  			elapsed_time, end_split - start_split);
> 
>  		munmap((void *)p, SIZE);
>  	}
>  	return 0;
>  }
> 
> w/o patch:
> ~ # ./a.out
> Time taken by reclamation: 0.177418 seconds, split_deferred: 2048
> Time taken by reclamation: 0.178348 seconds, split_deferred: 2048
> Time taken by reclamation: 0.174525 seconds, split_deferred: 2048
> Time taken by reclamation: 0.171620 seconds, split_deferred: 2048
> Time taken by reclamation: 0.172241 seconds, split_deferred: 2048
> Time taken by reclamation: 0.174003 seconds, split_deferred: 2048
> Time taken by reclamation: 0.171058 seconds, split_deferred: 2048
> Time taken by reclamation: 0.171993 seconds, split_deferred: 2048
> Time taken by reclamation: 0.169829 seconds, split_deferred: 2048
> Time taken by reclamation: 0.172895 seconds, split_deferred: 2048
> Time taken by reclamation: 0.176063 seconds, split_deferred: 2048
> Time taken by reclamation: 0.172568 seconds, split_deferred: 2048
> Time taken by reclamation: 0.171185 seconds, split_deferred: 2048
> Time taken by reclamation: 0.170632 seconds, split_deferred: 2048
> Time taken by reclamation: 0.170208 seconds, split_deferred: 2048
> Time taken by reclamation: 0.174192 seconds, split_deferred: 2048
> ...
> 
> w/ patch:
> ~ # ./a.out
> Time taken by reclamation: 0.074231 seconds, split_deferred: 0
> Time taken by reclamation: 0.071026 seconds, split_deferred: 0
> Time taken by reclamation: 0.072029 seconds, split_deferred: 0
> Time taken by reclamation: 0.071873 seconds, split_deferred: 0
> Time taken by reclamation: 0.073573 seconds, split_deferred: 0
> Time taken by reclamation: 0.071906 seconds, split_deferred: 0
> Time taken by reclamation: 0.073604 seconds, split_deferred: 0
> Time taken by reclamation: 0.075903 seconds, split_deferred: 0
> Time taken by reclamation: 0.073191 seconds, split_deferred: 0
> Time taken by reclamation: 0.071228 seconds, split_deferred: 0
> Time taken by reclamation: 0.071391 seconds, split_deferred: 0
> Time taken by reclamation: 0.071468 seconds, split_deferred: 0
> Time taken by reclamation: 0.071896 seconds, split_deferred: 0
> Time taken by reclamation: 0.072508 seconds, split_deferred: 0
> Time taken by reclamation: 0.071884 seconds, split_deferred: 0
> Time taken by reclamation: 0.072433 seconds, split_deferred: 0
> Time taken by reclamation: 0.071939 seconds, split_deferred: 0
> ...
> 
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---

I'm still following the long discussions and follow-up patch series,
but let me ask a possibly silly question here :)

>  mm/rmap.c | 72 ++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 50 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 89e51a7a9509..8786704bd466 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1933,23 +1953,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			if (pte_dirty(pteval))
>  				folio_mark_dirty(folio);
>  		} else if (likely(pte_present(pteval))) {
> -			flush_cache_page(vma, address, pfn);
> -			/* Nuke the page table entry. */
> -			if (should_defer_flush(mm, flags)) {
> -				/*
> -				 * We clear the PTE but do not flush so potentially
> -				 * a remote CPU could still be writing to the folio.
> -				 * If the entry was previously clean then the
> -				 * architecture must guarantee that a clear->dirty
> -				 * transition on a cached TLB entry is written through
> -				 * and traps if the PTE is unmapped.
> -				 */
> -				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> +			if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
> +			    can_batch_unmap_folio_ptes(address, folio, pvmw.pte))
> +				nr_pages = folio_nr_pages(folio);
> +			end_addr = address + nr_pages * PAGE_SIZE;
> +			flush_cache_range(vma, address, end_addr);
>  
> -				set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
> -			} else {
> -				pteval = ptep_clear_flush(vma, address, pvmw.pte);
> -			}
> +			/* Nuke the page table entry. */
> +			pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> +			/*
> +			 * We clear the PTE but do not flush so potentially
> +			 * a remote CPU could still be writing to the folio.
> +			 * If the entry was previously clean then the
> +			 * architecture must guarantee that a clear->dirty
> +			 * transition on a cached TLB entry is written through
> +			 * and traps if the PTE is unmapped.
> +			 */
> +			if (should_defer_flush(mm, flags))
> +				set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);

When the first pte of a PTE-mapped THP has _PAGE_PROTNONE bit set
(by NUMA balancing), can set_tlb_ubc_flush_pending() mistakenly think that
it doesn't need to flush the whole range, although some ptes in the range
doesn't have _PAGE_PROTNONE bit set?

> +			else
> +				flush_tlb_range(vma, address, end_addr);
>  			if (pte_dirty(pteval))
>  				folio_mark_dirty(folio);
>  		} else {

-- 
Cheers,
Harry / Hyeonggon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ