linux-kernel - Re: [PATCH 1/3] mm: Allow pagewalk without locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <972a602e-27c0-48bc-aa69-acf6425a3871@os.amperecomputing.com>
Date: Fri, 30 May 2025 14:33:12 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>,
 akpm@...ux-foundation.org, david@...hat.com, catalin.marinas@....com,
 will@...nel.org
Cc: lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
 rppt@...nel.org, surenb@...gle.com, mhocko@...e.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, suzuki.poulose@....com, steven.price@....com,
 gshan@...hat.com, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH 1/3] mm: Allow pagewalk without locks



On 5/30/25 3:33 AM, Ryan Roberts wrote:
> On 30/05/2025 10:04, Dev Jain wrote:
>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
>> locks.
> It looks like riscv solved this problem by moving from walk_page_range_novma()
> to apply_to_existing_page_range() in Commit fb1cf0878328 ("riscv: rewrite
> __kernel_map_pages() to fix sleeping in invalid context"). That won't work for
> us because the whole point is that we want to support changing permissions on
> block mappings.
>
> Yang:
>
> Not directly relavent to this patch, but I do worry about the potential need to
> split the range here though once Yang's series comes in - we would need to
> allocate memory for pgtables atomically in softirq context. KFENCE is intended
> to be enabled in production IIRC, so we can't just not allow block mapping when
> KFENCE is enabled and will likely need to think of a solution for this?

IIRC kfence typically uses a dedicated pool by default (if sample 
interval is not 0 on arm64), the pool is separate from linear mapping 
and it is mapped at PTE level. This should be the preferred way for 
production environment.

But if the pool is not used, typically 0 sample interval, we have to 
have the whole linear mapping mapped at PTE level always. I don't see a 
simple solution for it.

Thanks,
Yang

>
>
>> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>>   include/linux/pagewalk.h |  2 ++
>>   mm/pagewalk.c            | 12 ++++++++----
>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 9700a29f8afb..9bc8853ed3de 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/* no lock is needed */
>> +	PGWALK_NOLOCK = 3,
> I'd imagine you either want to explicitly forbid this option for the other
> entrypoints (i.e. the non- _novma variants) or you need to be able to handle
> this option being passed in to the other functions, which you currently don't
> do. I'd vote for explcitly disallowing (document it and return error if passed in).
>
>>   };
>>   
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index e478777c86e1..9657cf4664b2 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	default:
>> +		break;
>>   	}
>>   #endif
>>   }
>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>>   	 */
>> -	if (mm == &init_mm)
>> -		mmap_assert_locked(walk.mm);
> Given apply_to_page_range() doesn't do any locking for kernel pgtable walking, I
> can be convinced that it's also not required for our case using this framework.
> But why does this framework believe it is necessary? Should the comment above
> this at least be updated?
>
> Thanks,
> Ryan
>
>> -	else
>> -		mmap_assert_write_locked(walk.mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK) {
>> +		if (mm == &init_mm)
>> +			mmap_assert_locked(walk.mm);
>> +		else
>> +			mmap_assert_write_locked(walk.mm);
>> +	}
>>   
>>   	return walk_pgd_range(start, end, &walk);
>>   }