linux-kernel - Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <034a2c0e2cc1bb0f4f7ff9a2c5cbdc269a483a71.camel@amazon.com>
Date:   Sat, 21 Mar 2020 01:42:34 +0000
From:   "Singh, Balbir" <sblbir@...zon.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "keescook@...omium.org" <keescook@...omium.org>,
        "Herrenschmidt, Benjamin" <benh@...zon.com>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

On Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
> 
> Balbir,
> 
> "Singh, Balbir" <sblbir@...zon.com> writes:
> > On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> > > What's the point? The attack surface is the L1D content of the scheduled
> > > out task. If the malicious task schedules out, then why would you care?
> > > 
> > > I might be missing something, but AFAICT this is beyond paranoia.
> > > 
> > 
> > I think there are two cases
> > 
> > 1. Task with important data schedules out
> > 2. Malicious task schedules in
> > 
> > These patches address 1, but call out case #2
> 
> The point is if the victim task schedules out, then there is no reason
> to flush L1D immediately in context switch. If that just schedules a
> kernel thread and then goes back to the task, then there is no point
> unless you do not even trust the kernel thread.

Yes, the implementation tries to do that by tracking switch_mm(). 

> 
> > > > 3. There is a fallback software L1D load, similar to what L1TF does,
> > > > but
> > > >    we don't prefetch the TLB, is that sufficient?
> > > 
> > > If we go there, then the KVM L1D flush code wants to move into general
> > > x86 code.
> > 
> > OK.. we can definitely consider reusing code, but I think the KVM bits
> > require
> > tlb prefetching, IIUC before cache flush to negate any bad translations
> > associated with an L1TF fault, but the code/comments are not clear on the
> > need
> > to do so.
> 
> I forgot the gory details by now, but having two entry points or a
> conditional and share the rest (page allocation etc.) is definitely
> better than two slightly different implementation which basically do the
> same thing.

OK, I can try and dedup them to the extent possible, but please do remember
that 

1. KVM is usually loaded as a module
2. KVM is optional

We can share code, by putting the common bits in the core kernel.

> 
> > > > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > > > +{
> > > > +     struct page *page;
> > > > +
> > > > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > > > +             goto done;
> > > > +
> > > > +     mutex_lock(&l1d_flush_mutex);
> > > > +     if (l1d_flush_pages)
> > > > +             goto done;
> > > > +     /*
> > > > +      * These pages are never freed, we use the same
> > > > +      * set of pages across multiple processes/contexts
> > > > +      */
> > > > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > > > +     if (!page)
> > > > +             return;
> > > > +
> > > > +     l1d_flush_pages = page_address(page);
> > > > +     /* I don't think we need to worry about KSM */
> > > 
> > > Why not? Even if it wouldn't be necessary why would we care as this is a
> > > once per boot operation in fully preemptible code.
> > 
> > Not sure I understand your question, I was stating that even if KSM was
> > running, it would not impact us (with dedup), as we'd still be writing out
> > 0s
> > to the cache line in the fallback case.
> 
> I probably confused myself vs. the comment in the VMX code, but that
> mentions nested virt. Needs at least some consideration when we reuse
> that code.

Yes, definitely!

> 
> > > >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > >              struct task_struct *tsk)
> > > >  {
> > > > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev,
> > > > struct
> > > > mm_struct *next,
> > > >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> > > >       }
> > > > 
> > > > +     l1d_flush(next, tsk);
> > > 
> > > This is really the wrong place. You want to do that:
> > > 
> > >   1) Just before return to user space
> > >   2) When entering a guest
> > > 
> > > and only when the previously running user space task was the one which
> > > requested this massive protection.
> > > 
> > 
> > Cases 1 and 2 are handled via
> > 
> > 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
> 
> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.

Sorry, my bad, I was thinking MDS_CLEAR (via verw), which does flush out
things, which I suspect should be sufficient from a return to user/signal
handling, etc perspective. Right now, reading through 
https://software.intel.com/security-software-guidance/insights/deep-dive-snoop-assisted-l1-data-sampling
, it does seem like we need this during a context switch, specifically since a
dirty cache line can cause snooped reads for the attacker to leak data. Am I
missing anything?

> 
> > 2. L1TF fault handling
> > 
> > This mechanism allows for flushing not restricted to 1 or 2, the idea is
> > to
> > immediately flush L1D for paranoid processes on mm switch.
> 
> Why? To protect the victim task against the malicious kernel?
> 
> The L1D content of the victim is endangered in the following case:
> 
>     victim out -> attacker in
> 
> The attacker can either run in user space or in guest mode. So the flush
> is only interesting when the attacker actually goes back to user space
> or reenters the guest.
> 
> The following is completely uninteresting:
> 
>     victim out -> kernel thread in/out -> victim in
> 
> Even this is uninteresting:
> 
>     victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>     attacker out -> victim in
> 

Not from what I understand from the link above, the attack is a function of
what can be snooped by another core/thread and that is a function of what
modified secrets are in the cache line/store buffer.


> So the point where you want to flush conditionally is when the attacker
> leaves kernel space either to user mode or guest mode.
> 
> So if the victim schedules out it sets a per cpu request to flush L1D
> on the borders.
> 
> And then you have on return to user:
> 
>     if (this_cpu_flush_l1d())
>         flush_l1d()
> 
> and in kvm:
> 
>     if (this_cpu_flush_l1d() || L1TF_flush_L1D)
>         flush_l1d()
> 
> The request does:
> 
>     if (!this_cpu_read(l1d_flush_for_task))
>         this_cpu_write(l1d_flush_for_task, current)
> 
> The check does:
> 
>     p = this_cpu_read(l1d_flush_for_task);
>     if (p) {
>         this_cpu_write(l1d_flush_for_task, NULL);
>         return p != current;
>     }
>     return false;
> 
> Hmm?

On return to user, we already use VERW (verw), but just return to user
protection is not sufficient IMHO. Based on the link above, we need to clear
the L1D cache before it can be snooped.

Thanks for the review,
Balbir Singh