linux-kernel - Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <755343a1-ce94-4d38-8317-0925e2dae3bc@arm.com>
Date:   Mon, 27 Nov 2023 09:24:31 +0000
From:   Ryan Roberts <ryan.roberts@....com>
To:     Barry Song <21cnbao@...il.com>
Cc:     akpm@...ux-foundation.org, andreyknvl@...il.com,
        anshuman.khandual@....com, ardb@...nel.org,
        catalin.marinas@....com, david@...hat.com, dvyukov@...gle.com,
        glider@...gle.com, james.morse@....com, jhubbard@...dia.com,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, mark.rutland@....com, maz@...nel.org,
        oliver.upton@...ux.dev, ryabinin.a.a@...il.com,
        suzuki.poulose@....com, vincenzo.frascino@....com,
        wangkefeng.wang@...wei.com, will@...nel.org, willy@...radead.org,
        yuzenghui@...wei.com, yuzhao@...gle.com, ziy@...dia.com
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On 27/11/2023 05:54, Barry Song wrote:
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> +		  pte_t *dst_pte, pte_t *src_pte,
>> +		  unsigned long addr, unsigned long end,
>> +		  int *rss, struct folio **prealloc)
>>  {
>>  	struct mm_struct *src_mm = src_vma->vm_mm;
>>  	unsigned long vm_flags = src_vma->vm_flags;
>>  	pte_t pte = ptep_get(src_pte);
>>  	struct page *page;
>>  	struct folio *folio;
>> +	int nr = 1;
>> +	bool anon;
>> +	bool any_dirty = pte_dirty(pte);
>> +	int i;
>>  
>>  	page = vm_normal_page(src_vma, addr, pte);
>> -	if (page)
>> +	if (page) {
>>  		folio = page_folio(page);
>> -	if (page && folio_test_anon(folio)) {
>> -		/*
>> -		 * If this page may have been pinned by the parent process,
>> -		 * copy the page immediately for the child so that we'll always
>> -		 * guarantee the pinned page won't be randomly replaced in the
>> -		 * future.
>> -		 */
>> -		folio_get(folio);
>> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -			/* Page may be pinned, we have to copy. */
>> -			folio_put(folio);
>> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -						 addr, rss, prealloc, page);
>> +		anon = folio_test_anon(folio);
>> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +						end, pte, &any_dirty);
> 
> in case we have a large folio with 16 CONTPTE basepages, and userspace
> do madvise(addr + 4KB * 5, DONTNEED);

nit: if you are offsetting by 5 pages from addr, then below I think you mean
page0~page4 and page6~15?

> 
> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> will return 15. in this case, we should copy page0~page3 and page5~page15.

No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
not how its intended to work. The function is scanning forwards from the current
pte until it finds the first pte that does not fit in the batch - either because
it maps a PFN that is not contiguous, or because the permissions are different
(although this is being relaxed a bit; see conversation with DavidH against this
same patch).

So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
(page0~page4) then the next time through the loop we will go through the
!present path and process the single swap marker. Then the 3rd time through the
loop folio_nr_pages_cont_mapped() will return 10.

Thanks,
Ryan

> 
> but the current code is copying page0~page14, right? unless we are immediatly
> split_folio to basepages in zap_pte_range(), we will have problems?
> 
>> +
>> +		for (i = 0; i < nr; i++, page++) {
>> +			if (anon) {
>> +				/*
>> +				 * If this page may have been pinned by the
>> +				 * parent process, copy the page immediately for
>> +				 * the child so that we'll always guarantee the
>> +				 * pinned page won't be randomly replaced in the
>> +				 * future.
>> +				 */
>> +				if (unlikely(page_try_dup_anon_rmap(
>> +						page, false, src_vma))) {
>> +					if (i != 0)
>> +						break;
>> +					/* Page may be pinned, we have to copy. */
>> +					return copy_present_page(
>> +						dst_vma, src_vma, dst_pte,
>> +						src_pte, addr, rss, prealloc,
>> +						page);
>> +				}
>> +				rss[MM_ANONPAGES]++;
>> +				VM_BUG_ON(PageAnonExclusive(page));
>> +			} else {
>> +				page_dup_file_rmap(page, false);
>> +				rss[mm_counter_file(page)]++;
>> +			}
> 
> Thanks
> Barry
>