linux-kernel - Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99ef5eec8502a7b53eee362063b9b2252a5a47da.camel@amazon.com>
Date:   Sun, 22 Mar 2020 05:08:20 +0000
From:   "Herrenschmidt, Benjamin" <benh@...zon.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "Singh, Balbir" <sblbir@...zon.com>
CC:     "keescook@...omium.org" <keescook@...omium.org>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> Balbir,
> 
> "Singh, Balbir" <sblbir@...zon.com> writes:
> > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> > > What's the point? The attack surface is the L1D content of the scheduled
> > > out task. If the malicious task schedules out, then why would you care?
> > > 
> > > I might be missing something, but AFAICT this is beyond paranoia.
> > > 
> > 
> > I think there are two cases
> > 
> > 1. Task with important data schedules out
> > 2. Malicious task schedules in
> > 
> > These patches address 1, but call out case #2
> 
> The point is if the victim task schedules out, then there is no reason
> to flush L1D immediately in context switch. If that just schedules a
> kernel thread and then goes back to the task, then there is no point
> unless you do not even trust the kernel thread.

A switch to a kernel thread will not call switch_mm, will it ? At least it used not to...

> > > > 3. There is a fallback software L1D load, similar to what L1TF does, but
> > > >     we don't prefetch the TLB, is that sufficient?
> > > 
> > > If we go there, then the KVM L1D flush code wants to move into general
> > > x86 code.
> > 
> > OK.. we can definitely consider reusing code, but I think the KVM bits require
> > tlb prefetching, IIUC before cache flush to negate any bad translations
> > associated with an L1TF fault, but the code/comments are not clear on the need
> > to do so.
> 
> I forgot the gory details by now, but having two entry points or a
> conditional and share the rest (page allocation etc.) is definitely
> better than two slightly different implementation which basically do the same thing.
> 
> > > > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > > > +{
> > > > +     struct page *page;
> > > > +
> > > > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > > > +             goto done;
> > > > +
> > > > +     mutex_lock(&l1d_flush_mutex);
> > > > +     if (l1d_flush_pages)
> > > > +             goto done;
> > > > +     /*
> > > > +      * These pages are never freed, we use the same
> > > > +      * set of pages across multiple processes/contexts
> > > > +      */
> > > > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > > > +     if (!page)
> > > > +             return;
> > > > +
> > > > +     l1d_flush_pages = page_address(page);
> > > > +     /* I don't think we need to worry about KSM */
> > > 
> > > Why not? Even if it wouldn't be necessary why would we care as this is a
> > > once per boot operation in fully preemptible code.
> > 
> > Not sure I understand your question, I was stating that even if KSM was
> > running, it would not impact us (with dedup), as we'd still be writing out 0s
> > to the cache line in the fallback case.
> 
> I probably confused myself vs. the comment in the VMX code, but that
> mentions nested virt. Needs at least some consideration when we reuse
> that code.
> 
> > > >   void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > >               struct task_struct *tsk)
> > > >   {
> > > > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > > > mm_struct *next,
> > > >                trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > > >        }
> > > > 
> > > > +     l1d_flush(next, tsk);
> > > 
> > > This is really the wrong place. You want to do that:
> > > 
> > >    1) Just before return to user space
> > >    2) When entering a guest
> > > 
> > > and only when the previously running user space task was the one which
> > > requested this massive protection.
> > > 
> > 
> > Cases 1 and 2 are handled via
> > 
> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> 
> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

> > 2. L1TF fault handling
> > 
> > This mechanism allows for flushing not restricted to 1 or 2, the idea is to
> > immediately flush L1D for paranoid processes on mm switch.
> 
> Why? To protect the victim task against the malicious kernel?

Mostly malicious other tasks for us. As I said, I don't think switch_mm
is called on switching to a kernel thread and is definitely a colder
path than the return to userspace, so it felt like the right place to
put this, but I don't mind if you prefer it elsewhere as long as it
does the job which is to prevent task B to snoop task A data.

> The L1D content of the victim is endangered in the following case:
> 
>     victim out -> attacker in
> 
> The attacker can either run in user space or in guest mode. So the flush
> is only interesting when the attacker actually goes back to user space
> or reenters the guest.
> 
> The following is completely uninteresting:
> 
>     victim out -> kernel thread in/out -> victim in

Sure but will that call switch_mm to be called ?

> Even this is uninteresting:
> 
>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>     attacker out -> victim in

I don't get this ... how do you get attacker_in without victim_out
first ? In which case you have a victim_out -> attacker_in transition
which is what we are trying to protect.

I still think flushing the "high value" process L1D on switch_mm out is
the way to go here...

> So the point where you want to flush conditionally is when the attacker
> leaves kernel space either to user mode or guest mode.
> 
> So if the victim schedules out it sets a per cpu request to flush L1D
> on the borders.
> 
> And then you have on return to user:
> 
>     if (this_cpu_flush_l1d())
>         flush_l1d()
> 
> and in kvm:
> 
>     if (this_cpu_flush_l1d() || L1TF_flush_L1D)
>         flush_l1d()
> 
> The request does:
> 
>     if (!this_cpu_read(l1d_flush_for_task))
>         this_cpu_write(l1d_flush_for_task, current)
> 
> The check does:
> 
>     p = this_cpu_read(l1d_flush_for_task);
>     if (p) {
>         this_cpu_write(l1d_flush_for_task, NULL);
>         return p != current;
>     }
>     return false;
> 
> Hmm?
> 
> Thanks,
> 
>         tglx
> 
>