linux-kernel - Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry remains valid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zZdfgiy0k+kK3Gqg5KvuFs8Rx3zQXLhkLk-K-+Hf0grw@mail.gmail.com>
Date: Tue, 3 Jun 2025 13:55:31 +1200
From: Barry Song <21cnbao@...il.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Will Deacon <will@...nel.org>, Catalin Marinas <catalin.marinas@....com>, 
	Mark Rutland <mark.rutland@....com>, Yicong Yang <yangyicong@...ilicon.com>, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	Mel Gorman <mgorman@...e.de>
Subject: Re: [PATCH v1] arm64/mm: Close theoretical race where stale TLB entry
 remains valid

On Tue, Jun 3, 2025 at 2:00 AM Ryan Roberts <ryan.roberts@....com> wrote:
>
> On 02/06/2025 13:00, Will Deacon wrote:
> > On Fri, May 30, 2025 at 04:23:47PM +0100, Ryan Roberts wrote:
> >> Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with
> >> a parallel reclaim leaving stale TLB entries") describes a race that,
> >> prior to the commit, could occur between reclaim and operations such as
> >> mprotect() when using reclaim's tlbbatch mechanism. See that commit for
> >> details but the summary is:
> >>
> >> """
> >> Nadav Amit identified a theoritical race between page reclaim and
> >> mprotect due to TLB flushes being batched outside of the PTL being held.
> >>
> >> He described the race as follows:
> >>
> >>      CPU0                            CPU1
> >>      ----                            ----
> >>                                      user accesses memory using RW PTE
> >>                                      [PTE now cached in TLB]
> >>      try_to_unmap_one()
> >>      ==> ptep_get_and_clear()
> >>      ==> set_tlb_ubc_flush_pending()
> >>                                      mprotect(addr, PROT_READ)
> >>                                      ==> change_pte_range()
> >>                                      ==> [ PTE non-present - no flush ]
> >>
> >>                                      user writes using cached RW PTE
> >>      ...
> >>
> >>      try_to_unmap_flush()
> >> """
> >>
> >> The solution was to insert flush_tlb_batched_pending() in mprotect() and
> >> friends to explcitly drain any pending reclaim TLB flushes. In the
> >> modern version of this solution, arch_flush_tlb_batched_pending() is
> >> called to do that synchronisation.
> >>
> >> arm64's tlbbatch implementation simply issues TLBIs at queue-time
> >> (arch_tlbbatch_add_pending()), eliding the trailing dsb(ish). The
> >> trailing dsb(ish) is finally issued in arch_tlbbatch_flush() at the end
> >> of the batch to wait for all the issued TLBIs to complete.
> >>
> >> Now, the Arm ARM states:
> >>
> >> """
> >> The completion of the TLB maintenance instruction is guaranteed only by
> >> the execution of a DSB by the observer that performed the TLB
> >> maintenance instruction. The execution of a DSB by a different observer
> >> does not have this effect, even if the DSB is known to be executed after
> >> the TLB maintenance instruction is observed by that different observer.
> >> """
> >>
> >> arch_tlbbatch_add_pending() and arch_tlbbatch_flush() conform to this
> >> requirement because they are called from the same task (either kswapd or
> >> caller of madvise(MADV_PAGEOUT)), so either they are on the same CPU or
> >> if the task was migrated, __switch_to() contains an extra dsb(ish).
> >>
> >> HOWEVER, arm64's arch_flush_tlb_batched_pending() is also implemented as
> >> a dsb(ish). But this may be running on a CPU remote from the one that
> >> issued the outstanding TLBIs. So there is no architectural gurantee of
> >> synchonization. Therefore we are still vulnerable to the theoretical
> >> race described in Commit 3ea277194daa ("mm, mprotect: flush TLB if
> >> potentially racing with a parallel reclaim leaving stale TLB entries").
> >>
> >> Fix this by flushing the entire mm in arch_flush_tlb_batched_pending().
> >> This aligns with what the other arches that implement the tlbbatch
> >> feature do.
> >>
> >> Fixes: 43b3dfdd0455 ("arm64: support batched/deferred tlb shootdown during page reclamation/migration")
> >
> > Barry -- it would be great if you could re-run some of the benchmarks
> > from that commit with this fix applied.
>
> Worth rerunning if possible, but I would guess that those benchmarks will still
> show the similar improvement because they are measuring the cost of doing the
> TLB flushing. But with the fix, there is an extra cost that those benchmarks
> probably won't measure; subsequent work within the target mm will have no VAs
> cached in the TLB so the miss rate will be much higher.

Right, not sure if we have a suitable benchmark to measure the
side effect, but I assume reclamation speed is more important
when we're reclaiming memory.

This was originally introduced in commit 3ea277194daae
("mm, mprotect: flush TLB if potentially racing with a parallel
reclaim leaving stale TLB entries").

Cc'ing Mel to see if he has any comments.

>
> >
> >> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> >> ---
> >>  arch/arm64/include/asm/tlbflush.h | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >> index eba1a98657f1..7d564c2a126f 100644
> >> --- a/arch/arm64/include/asm/tlbflush.h
> >> +++ b/arch/arm64/include/asm/tlbflush.h
> >> @@ -323,13 +323,14 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>  }
> >>
> >>  /*
> >> - * If mprotect/munmap/etc occurs during TLB batched flushing, we need to
> >> - * synchronise all the TLBI issued with a DSB to avoid the race mentioned in
> >> - * flush_tlb_batched_pending().
> >> + * If mprotect/munmap/etc occurs during TLB batched flushing, we need to ensure
> >> + * all the previously issued TLBIs targeting mm have completed. But since we
> >> + * can be executing on a remote CPU, a DSB cannot guarrantee this like it can
> >> + * for arch_tlbbatch_flush(). Our only option is to flush the entire mm.
> >>   */
> >>  static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> >>  {
> >> -    dsb(ish);
> >> +    flush_tlb_mm(mm);
> >>  }
> >
> > Thanks, Ryan. I'll pick this as a fix, but perhaps the core code should
> > do this given that all the architectures selecting
> > ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH now have an identical implementation
> > of arch_flush_tlb_batched_pending()?
>
> Ha, yes... infact it looks like that's what it did prior to commit db6c1f6f236d
> ("mm/tlbbatch: introduce arch_flush_tlb_batched_pending()").

Yep, it was just a flush_tlb_mm(mm) inside flush_tlb_batched_pending().

>
> I'll do that tidy up once this fix appears in mm-unstable.
>
> Thanks,
> Ryan
>
>
> >
> > Will
>

Thanks
Barry