[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bc8f62a8-8e9d-45a2-8d77-e193b12d465d@arm.com>
Date: Thu, 23 Oct 2025 11:18:53 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Barry Song <21cnbao@...il.com>, Huang Ying <ying.huang@...ux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Yang Shi <yang@...amperecomputing.com>,
"Christoph Lameter (Ampere)" <cl@...two.org>, Dev Jain <dev.jain@....com>,
Anshuman Khandual <anshuman.khandual@....com>,
Yicong Yang <yangyicong@...ilicon.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Kevin Brodsky <kevin.brodsky@....com>,
Yin Fengwei <fengwei_yin@...ux.alibaba.com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH -v2 2/2] arm64, tlbflush: don't TLBI broadcast if page
reused in write fault
Hi Barry, Huang,
On 22/10/2025 05:08, Barry Song wrote:
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index aa89c2e67ebc..35bae2e4bcfe 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> /*
>> - * Outside of a few very special situations (e.g. hibernation), we always
>> - * use broadcast TLB invalidation instructions, therefore a spurious page
>> - * fault on one CPU which has been handled concurrently by another CPU
>> - * does not need to perform additional invalidation.
>> + * We use local TLB invalidation instruction when reusing page in
>> + * write protection fault handler to avoid TLBI broadcast in the hot
>> + * path. This will cause spurious page faults if stall read-only TLB
>> + * entries exist.
>> */
>> -#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
>> +#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
>> + local_flush_tlb_page_nonotify(vma, address)
>> +
>> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
>> + local_flush_tlb_page_nonotify(vma, address)
>>
>> /*
>> * ZERO_PAGE is a global shared page that is always zero: used
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 18a5dc0c9a54..651b31fd18bb 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -249,6 +249,18 @@ static inline unsigned long get_trans_granule(void)
>> * cannot be easily determined, the value TLBI_TTL_UNKNOWN will
>> * perform a non-hinted invalidation.
>> *
>> + * local_flush_tlb_page(vma, addr)
>> + * Local variant of flush_tlb_page(). Stale TLB entries may
>> + * remain in remote CPUs.
>> + *
>> + * local_flush_tlb_page_nonotify(vma, addr)
>> + * Same as local_flush_tlb_page() except MMU notifier will not be
>> + * called.
>> + *
>> + * local_flush_tlb_contpte_range(vma, start, end)
>> + * Invalidate the virtual-address range '[start, end)' mapped with
>> + * contpte on local CPU for the user address space corresponding
>> + * to 'vma->mm'. Stale TLB entries may remain in remote CPUs.
>> *
>> * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
>> * on top of these routines, since that is our interface to the mmu_gather
>> @@ -282,6 +294,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>> mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
>> }
>>
>> +static inline void __local_flush_tlb_page_nonotify_nosync(
>> + struct mm_struct *mm, unsigned long uaddr)
>> +{
>> + unsigned long addr;
>> +
>> + dsb(nshst);
I've skimmed through this thread so appologies if I've missed some of the
detail, but thought it might be useful to give my opinion as a summary...
> We were issuing dsb(ishst) even for the nosync case, likely to ensure
> PTE visibility across cores.
The leading dsb prior to issuing the tlbi is to ensure that the HW table
walker(s) will always see the new pte immediately after the tlbi completes.
Without it, you could end up with the old value immediately re-cached after the
tlbi completes. So if you are broadcasting the tlbi, the dsb needs to be to ish.
If you're doing local invalidation, then nsh is sufficient.
"nosync" is just saying that we will not wait for the tlbi to complete. You
still need to issue the leading dsb to ensure that the table walkers see the
latest pte once the tlbi (eventually) completes.
> However, since set_ptes already includes a
> dsb(ishst) in __set_pte_complete(), does this mean we’re being overly
> cautious in __flush_tlb_page_nosync() in many cases?
We only issue a dsb in __set_pte_complete() for kernel ptes. We elide for user
ptes becaue we can safely take a fault (for the case where we transition
invalid->valid) for user mappings and that race will resolve itself with the
help of the PTL. For valid->valid or valid->invalid, there will be an associated
tlb flush, which has the barrier.
>
> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> unsigned long uaddr)
> {
> unsigned long addr;
>
> dsb(ishst);
> addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
> (uaddr & PAGE_MASK) +
> PAGE_SIZE);
> }
>
> On the other hand, __ptep_set_access_flags() doesn’t seem to use
> set_ptes(), so there’s no guarantee the updated PTEs are visible to all
> cores. If a remote CPU later encounters a page fault and performs a TLB
> invalidation, will it still see a stable PTE?
Yes, because the reads and writes are done under the PTL; that synchonizes the
memory for us.
You were discussing the potential value of upgrading the leading dsb from nshst
to ishst during the discussion. IMHO that's neither required nor desirable - the
memory synchonization is handled by the PTL. Overall, this optimization relies
on the premise that synchronizing with remote CPUs is expensive and races are
rare, so we should keep everything local for as long as possible and not worry
about micro-optimizing the efficiency of the race case.
>
>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>> + __tlbi(vale1, addr);
>> + __tlbi_user(vale1, addr);
>> +}
>> +
>> +static inline void local_flush_tlb_page_nonotify(
>> + struct vm_area_struct *vma, unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + dsb(nsh);
>> +}
>> +
>> +static inline void local_flush_tlb_page(struct vm_area_struct *vma,
>> + unsigned long uaddr)
>> +{
>> + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
>> + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
>> + (uaddr & PAGE_MASK) + PAGE_SIZE);
>> + dsb(nsh);
>> +}
>> +
>> static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>> unsigned long uaddr)
>> {
>> @@ -472,6 +511,23 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>> dsb(ish);
>> }
>>
>
> We already have functions like
> __flush_tlb_page_nosync() and __flush_tlb_range_nosync().
> Is there a way to factor out or extract their common parts?
>
> Is it because of the differences in barriers that this extraction of
> common code isn’t feasible?
I've proposed re-working these functions to add indepednent flags for
sync/nosync, local/broadcast and notify/nonotify. I think that will clean it up
quite a bit. But I was going to wait for this to land first. And also, Will has
an RFC for some other tlbflush API cleanup (converting it to C functions) so
might want to wait for or incorporate that too.
Thanks,
Ryan
>
> Thanks
> Barry
Powered by blists - more mailing lists