linux-kernel - Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was only active on local cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <90ed2574-74f8-47a0-ac46-4b9418c3242d@arm.com>
Date: Tue, 2 Sep 2025 17:54:07 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Will Deacon <will@...nel.org>, Mark Rutland <mark.rutland@....com>,
 James Morse <james.morse@....com>, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 2/2] arm64: tlbflush: Don't broadcast if mm was
 only active on local cpu

On 02/09/2025 17:23, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:08PM +0100, Ryan Roberts wrote:
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index f66b8c4696d0..651440e0aff9 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -182,6 +182,52 @@ static inline unsigned long get_trans_granule(void)
>>  		(__pages >> (5 * (scale) + 1)) - 1;			\
>>  	})
>>  
>> +/*
>> + * Determines whether the user tlbi invalidation can be performed only on the
>> + * local CPU or whether it needs to be broadcast. (Returns true for local).
>> + * Additionally issues appropriate barrier to ensure prior pgtable updates are
>> + * visible to the table walker. Must be paired with flush_tlb_user_post().
>> + */
>> +static inline bool flush_tlb_user_pre(struct mm_struct *mm)
>> +{
>> +	unsigned int self, active;
>> +	bool local;
>> +
>> +	migrate_disable();
>> +
>> +	self = smp_processor_id();
>> +
>> +	/*
>> +	 * The load of mm->context.active_cpu must not be reordered before the
>> +	 * store to the pgtable that necessitated this flush. This ensures that
>> +	 * if the value read is our cpu id, then no other cpu can have seen the
>> +	 * old pgtable value and therefore does not need this old value to be
>> +	 * flushed from its tlb. But we don't want to upgrade the dsb(ishst),
>> +	 * needed to make the pgtable updates visible to the walker, to a
>> +	 * dsb(ish) by default. So speculatively load without a barrier and if
>> +	 * it indicates our cpu id, then upgrade the barrier and re-load.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active == self) {
>> +		dsb(ish);
>> +		active = READ_ONCE(mm->context.active_cpu);
>> +	} else {
>> +		dsb(ishst);
>> +	}
> 
> Does the ISH vs ISHST make much difference in practice? I wonder whether
> we could keep this simple.

I don't know. I was being conservative - I'm a bit nervous about upgrading a
barrier unconditionally. I'll run some benchmarks with the simpler version and
compare.

> 
>> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
>> index b2ac06246327..adf4fc26ddb6 100644
>> --- a/arch/arm64/mm/context.c
>> +++ b/arch/arm64/mm/context.c
>> @@ -214,9 +214,10 @@ static u64 new_context(struct mm_struct *mm)
>>  
>>  void check_and_switch_context(struct mm_struct *mm)
>>  {
>> -	unsigned long flags;
>> -	unsigned int cpu;
>> +	unsigned int cpu = smp_processor_id();
>>  	u64 asid, old_active_asid;
>> +	unsigned int active;
>> +	unsigned long flags;
>>  
>>  	if (system_supports_cnp())
>>  		cpu_set_reserved_ttbr0();
>> @@ -251,7 +252,6 @@ void check_and_switch_context(struct mm_struct *mm)
>>  		atomic64_set(&mm->context.id, asid);
>>  	}
>>  
>> -	cpu = smp_processor_id();
>>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>>  		local_flush_tlb_all();
>>  
>> @@ -262,6 +262,30 @@ void check_and_switch_context(struct mm_struct *mm)
>>  
>>  	arm64_apply_bp_hardening();
>>  
>> +	/*
>> +	 * Update mm->context.active_cpu in such a manner that we avoid cmpxchg
>> +	 * and dsb unless we definitely need it. If initially ACTIVE_CPU_NONE
>> +	 * then we are the first cpu to run so set it to our id. If initially
>> +	 * any id other than ours, we are the second cpu to run so set it to
>> +	 * ACTIVE_CPU_MULTIPLE. If we update the value then we must issue
>> +	 * dsb(ishst) to ensure stores to mm->context.active_cpu are ordered
>> +	 * against the TTBR0 write in cpu_switch_mm()/uaccess_enable(); the
>> +	 * store must be visible to another cpu before this cpu could have
>> +	 * populated any TLB entries based on the pgtables that will be
>> +	 * installed.
>> +	 */
>> +	active = READ_ONCE(mm->context.active_cpu);
>> +	if (active != cpu && active != ACTIVE_CPU_MULTIPLE) {
>> +		if (active == ACTIVE_CPU_NONE)
>> +			active = cmpxchg_relaxed(&mm->context.active_cpu,
>> +						 ACTIVE_CPU_NONE, cpu);
>> +
>> +		if (active != ACTIVE_CPU_NONE)
>> +			WRITE_ONCE(mm->context.active_cpu, ACTIVE_CPU_MULTIPLE);
>> +
>> +		dsb(ishst);
>> +	}
> 
> I think this works. I recall James had a litmus test for the model
> checker confirming this.
> 
> In a cut-down version, we'd have:
> 
> P0:				P1:
> 
> set_pte();			WRITE_ONCE(active_cpu);
> dsb();				dsb();
> READ_ONCE(active_cpu);		READ_ONCE(pte);
> 
> The pte read on P1 is implied following the TTBR0 write. As you
> described, if P0 did not see the active_cpu update, P1 would have seen
> the updated pte.
> 
> So far I couldn't fail this, so:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@....com>

Hmm... Rb at v1.. are you feeling ok? :)