linux-kernel - Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from tlbbatch code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+i-1C16x5u-1qAqDPSONgs+pGWrfUTO1zq2r6Rrvq=q48NHpw@mail.gmail.com>
Date: Tue, 11 Feb 2025 11:02:27 +0100
From: Brendan Jackman <jackmanb@...gle.com>
To: Rik van Riel <riel@...riel.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, bp@...en8.de, 
	peterz@...radead.org, dave.hansen@...ux.intel.com, zhengqi.arch@...edance.com, 
	nadav.amit@...il.com, thomas.lendacky@....com, kernel-team@...a.com, 
	linux-mm@...ck.org, akpm@...ux-foundation.org, jannh@...gle.com, 
	mhklinux@...look.com, andrew.cooper3@...rix.com, 
	Manali Shukla <Manali.Shukla@....com>
Subject: Re: [PATCH v9 10/12] x86/mm: do targeted broadcast flushing from
 tlbbatch code

On Tue, 11 Feb 2025 at 04:50, Rik van Riel <riel@...riel.com> wrote:
>
> On Mon, 2025-02-10 at 16:27 +0100, Brendan Jackman wrote:
> > On Thu, 6 Feb 2025 at 05:46, Rik van Riel <riel@...riel.com> wrote:
> > >  /* Wait for INVLPGB originated by this CPU to complete. */
> > > -static inline void tlbsync(void)
> > > +static inline void __tlbsync(void)
> > >  {
> > > -       cant_migrate();
> >
> > Why does this have to go away?
>
> I'm not sure the current task in sched_init() has
> all the correct bits set to prevent the warning
> from firing, but on the flip side it won't have
> called INVLPGB yet at that point, so the call to
> enter_lazy_tlb() won't actually end up here.
>
> I'll put it back.

Sounds good.

FWIW I think if we do run into early-boot code hitting false
DEBUG_ATOMIC_SLEEP warnings, the best response might be to update the
DEBUG_ATOMIC_SLEEP code. Like maybe there's a more targeted solution
but something roughly equivalent to checking if (system_state ==
SYSTEM_STATE_SCHEDULING) before the warning.

> > > @@ -794,6 +825,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >         if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> > >                 WARN_ON_ONCE(!irqs_disabled());
> > >
> > > +       tlbsync();
> > > +
> > >         /*
> > >          * Verify that CR3 is what we think it is.  This will catch
> > >          * hypothetical buggy code that directly switches to
> > > swapper_pg_dir
> > > @@ -973,6 +1006,8 @@ void switch_mm_irqs_off(struct mm_struct
> > > *unused, struct mm_struct *next,
> > >   */
> > >  void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> > >  {
> > > +       tlbsync();
> > > +
> >
> > I have a feeling I'll look stupid for asking this, but why do we need
> > this and the one in switch_mm_irqs_off()?
>
> This is an architectural thing: TLBSYNC waits for
> the INVLPGB flushes to finish that were issued
> from the same CPU.
>
> That means if we have pending flushes (from the
> pageout code), we need to wait for them at context
> switch time, before the task could potentially be
> migrated to another CPU.

Oh right thanks, that makes sense.

So I think here we're encoding the assumption that context_switch()
always calls either enter_lazy_tlb() or switch_mm_irqs_off(), which is
a little awkward, plus the job of these functions is already kinda
hazy and this makes it even hazier. What about doing it in
arch_start_context_switch() instead?

That would mean a bit of plumbing since we'd still wanna have the
tlbsync() in tlb.c, but that seems worth it to me. Plus, having it in
one place would give us a spot to add a comment. Now that you point it
out it does indeed seem obvious but it didn't seem so yesterday.

Now I think about it... if we always tlbsync() before a context
switch, is the cant_migrate() above actually required? I think with
that, even if we migrated in the middle of e.g.
broadcast_kernel_range_flush(), we'd be fine? (At least, from the
specific perspective of the invplgb code, presumably having preemption
on there would break things horribly in other ways).