linux-kernel - Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca198de7-1ebf-4749-855c-f8358046cd22@arm.com>
Date: Wed, 18 Dec 2024 14:54:15 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
 david@...hat.com, willy@...radead.org, kirill.shutemov@...ux.intel.com
Cc: anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 jhubbard@...dia.com, 21cnbao@...il.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise


On 17/12/24 11:45 pm, Ryan Roberts wrote:
> On 16/12/2024 16:51, Dev Jain wrote:
>> Scan the PTEs order-wise, using the mask of suitable orders for this VMA
>> derived in conjunction with sysfs THP settings. Scale down the tunables; in
>> case of collapse failure, we drop down to the next order. Otherwise, we try to
>> jump to the highest possible order and then start a fresh scan. Note that
>> madvise(MADV_COLLAPSE) has not been generalized.
> Is there are reason you are not modifying MADV_COLLAPSE? It's really just a
> synchonous way to do what khugepaged does asynchonously (isn't it?), so it would
> behave the same way in an ideal world.

Correct, but I started running into return value problems for madvise(). For example,
the return value of hpage_collapse_scan_ptes() will be the return value of the last
mTHP scan. In this case, what do we want madvise() to return? If I collapse the range
to multiple 64K mTHPs, then I should still return failure, because otherwise the caller would
logically assume that MADV_COLLAPSE succeeded so will assume a PMD-hugepage mapped there. But
then the caller ended up collapsing memory...if you return success, then the khugepaged selftest
starts failing. Basically, this will be (kind of?) an ABI change and I really didn't want to sway
the discussion away from khugepaged, so I just kept it simple :)

>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>>   mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 69 insertions(+), 15 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 886c76816963..078794aa3335 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/count_zeros.h>
>>   
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>   }
>>   
>>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>> -			      int referenced, int unmapped,
>> +			      int referenced, int unmapped, int order,
>>   			      struct collapse_control *cc)
>>   {
>>   	LIST_HEAD(compound_pagelist);
>> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   				   unsigned long address, bool *mmap_locked,
>>   				   struct collapse_control *cc)
>>   {
>> -	pmd_t *pmd;
>> -	pte_t *pte, *_pte;
>> -	int result = SCAN_FAIL, referenced = 0;
>> -	int none_or_zero = 0, shared = 0;
>> -	struct page *page = NULL;
>> +	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
>> +	int referenced, shared, none_or_zero, unmapped;
>> +	unsigned long _address, org_address = address;
> nit: Perhaps it's clearer to keep the original address in address and use a
> variable, start, for the starting point of each scan?

Probably...will keep it in mind.

>
>>   	struct folio *folio = NULL;
>> -	unsigned long _address;
>> -	spinlock_t *ptl;
>> -	int node = NUMA_NO_NODE, unmapped = 0;
>> +	struct page *page = NULL;
>> +	int node = NUMA_NO_NODE;
>> +	int result = SCAN_FAIL;
>>   	bool writable = false;
>> +	unsigned long orders;
>> +	pte_t *pte, *_pte;
>> +	spinlock_t *ptl;
>> +	pmd_t *pmd;
>> +	int order;
>>   
>>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>   
>> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
>> +			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1);
> Perhaps THP_ORDERS_ALL instead of "BIT(PMD_ORDER + 1) - 1"?

Ah yes, THP_ORDERS_ALL_ANON.

>
>> +	orders = thp_vma_suitable_orders(vma, address, orders);
>> +	order = highest_order(orders);
>> +
>> +	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
>> +	if (!cc->is_khugepaged)
>> +		order = HPAGE_PMD_ORDER;
>> +
>> +scan_pte_range:
>> +
>> +	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
>> +	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
>> +	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
>> +
>> +	/* Check pmd after taking mmap lock */
>>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>>   	if (result != SCAN_SUCCEED)
>>   		goto out;
>>   
>>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>>   	nodes_clear(cc->alloc_nmask);
>> +
>>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>>   	if (!pte) {
>>   		result = SCAN_PMD_NULL;
>>   		goto out;
>>   	}
>>   
>> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> +	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
>>   	     _pte++, _address += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>>   		if (is_swap_pte(pteval)) {
>>   			++unmapped;
>>   			if (!cc->is_khugepaged ||
>> -			    unmapped <= khugepaged_max_ptes_swap) {
>> +			    unmapped <= max_ptes_swap) {
>>   				/*
>>   				 * Always be strict with uffd-wp
>>   				 * enabled swap entries.  Please see
>> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   			++none_or_zero;
>>   			if (!userfaultfd_armed(vma) &&
>>   			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> +			     none_or_zero <= max_ptes_none)) {
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_NONE_PTE;
>> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		if (folio_likely_mapped_shared(folio)) {
>>   			++shared;
>>   			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> +			    shared > max_ptes_shared) {
>>   				result = SCAN_EXCEED_SHARED_PTE;
>>   				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>>   				goto out_unmap;
>> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   		result = SCAN_PAGE_RO;
>>   	} else if (cc->is_khugepaged &&
>>   		   (!referenced ||
>> -		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>> +		    (unmapped && referenced < (1UL << order) / 2))) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>>   	} else {
>>   		result = SCAN_SUCCEED;
>> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>>   	pte_unmap_unlock(pte, ptl);
>>   	if (result == SCAN_SUCCEED) {
>>   		result = collapse_huge_page(mm, address, referenced,
>> -					    unmapped, cc);
>> +					    unmapped, order, cc);
>>   		/* collapse_huge_page will return with the mmap_lock released */
>>   		*mmap_locked = false;
>> +
>> +		/* Immediately exit on exhaustion of range */
>> +		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>> +			goto out;
> Looks like this assumes this function is always asked to scan a full PTE table?
> Does that mean that you can't handle collapse for VMAs that don't span a whole
> PMD entry? I think we will want to support that.

Correct. Yes, we should support that, otherwise khugepaged will scan only large VMAs.
Will have to make a change in khugepaged_scan_mm_slot() for anon case.

>
>> +	}
>> +	if (result != SCAN_SUCCEED) {
>> +
>> +		/* Go to the next order. */
>> +		order = next_order(&orders, order);
>> +		if (order < 2)
> This should be:
> 		if (!orders)
>
> I think the return order is undefined when order is the last order in orders.

The return order is -1, from what I could gather from reading the code.

>
>> +			goto out;
>> +		goto maybe_mmap_lock;
>> +	} else {
>> +		address = _address;
>> +		pte = _pte;
>> +
>> +
>> +		/* Get highest order possible starting from address */
>> +		order = count_trailing_zeros(address >> PAGE_SHIFT);
>> +
>> +		/* This needs to be present in the mask too */
>> +		if (!(orders & (1UL << order)))
>> +			order = next_order(&orders, order);
> Not quite; if the exact order isn't in the bitmap, this will pick out the
> highest order in the bitmap, which may be higher than count_trailing_zeros()
> returned.

Oh okay, nice catch.

>   You could do:
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders &= (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> That will mask out any orders that are bigger than the one returned by
> count_trailing_zeros() then next_order() will return the highest order in the
> remaining set.
>
> But even that doesn't quite work because next_order() is destructive. Once you
> arrive on a higher order address boundary, you want to be able to select a
> higher order from the original orders bitmap. But you have lost them on a
> previous go around the loop.
>
> Perhaps stash orig_orders at the top of the function when you first calculate
> it. Then I think this works (totally untested):
>
> 		order = count_trailing_zeros(address >> PAGE_SHIFT);
> 		orders = orig_orders & (1UL << order + 1) - 1;
> 		order = next_order(&orders, order);
> 		if (!orders)
> 			goto out;
>
> You might want to do something like this for the first go around the loop, but I
> think address is currently always at the start of the PMD on entry, so not
> needed until that restriction is removed.

Will take a look, we just need order <= order derived from trailing zeroes,
and then we need the first enabled order below this in the bitmask, shouldn't
be too complicated.

>
>> +		if (order < 2)
>> +			goto out;
>> +
>> +maybe_mmap_lock:
>> +		if (!(*mmap_locked)) {
>> +			mmap_read_lock(mm);
> Given the lock was already held in read mode on entering this function, then
> released by collapse_huge_page(), is it definitely safe to retake this lock and
> rerun this function? Is it possible that state that was checked before entering
> this function has changed since the lock was released that would now need
> re-checking?

Thanks, what I am missing is a hugepage_vma_revalidate().

>> +			*mmap_locked = true;
>> +		}
>> +		goto scan_pte_range;
>>   	}
>>   out:
>>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,