linux-kernel - Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97b2bffc16257e70b8aa98ee86622dc4178154c4.camel@amazon.com>
Date:   Fri, 20 Mar 2020 01:37:58 +0000
From:   "Singh, Balbir" <sblbir@...zon.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "keescook@...omium.org" <keescook@...omium.org>,
        "Herrenschmidt, Benjamin" <benh@...zon.com>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> Balbir,
> 
> Balbir Singh <sblbir@...zon.com> writes:
> > This patch is an RFC/PoC to start the discussion on optionally flushing
> > L1D cache.  The goal is to allow tasks that are paranoid due to the recent
> > snoop assisted data sampling vulnerabilites, to flush their L1D on being
> > switched out.  This protects their data from being snooped or leaked via
> > side channels after the task has context switched out.
> 
> What you mean is that L1D is flushed before another task which does not
> belong to the same trust zone returns to user space or enters guest
> mode.
> 
> > There are two scenarios we might want to protect against, a task leaving
> > the CPU with data still in L1D (which is the main concern of this
> > patch), the second scenario is a malicious task coming in (not so well
> > trusted) for which we want to clean up the cache before it starts
> > execution. The latter was proposed by benh and is not currently
> > addressed by this patch, but can be easily accommodated by the same
> > mechanism.
> 
> What's the point? The attack surface is the L1D content of the scheduled
> out task. If the malicious task schedules out, then why would you care?
> 
> I might be missing something, but AFAICT this is beyond paranoia.
> 

I think there are two cases

1. Task with important data schedules out
2. Malicious task schedules in

These patches address 1, but call out case #2


> > The points of discussion/review are:
> > 
> > 1. Discuss the use case and the right approach to address this
> 
> It might be the quick fix for the next not yet known variant of L1D
> based horrors, so yes it's at least worth to discuss it.
> 

Yes, that is the broader thinking, there might be yet to be discovered attack
vectors and having something like this provides a good workaround

> > 2. Does an arch prctl allowing for opt-in flushing make sense, would other
> >    arches care about something similar?
> 
> No idea, but I assume in the light of L1D based issues that might be the
> case. Though that still is per architecture as the L1D flush mechanisms
> are architecture specific. Aside of that I don't think that more than a
> few will enable / support it.
> 

Fair enough

> > 3. There is a fallback software L1D load, similar to what L1TF does, but
> >    we don't prefetch the TLB, is that sufficient?
> 
> If we go there, then the KVM L1D flush code wants to move into general
> x86 code.

OK.. we can definitely consider reusing code, but I think the KVM bits require
tlb prefetching, IIUC before cache flush to negate any bad translations
associated with an L1TF fault, but the code/comments are not clear on the need
to do so.

> 
> > 4. The atomics can be improved and we could use a static key like ibpb
> >    does to optimize the code path
> 
> Yes to static key.

Sure, will do

> 
> > 5. The code works with a special hack for 64 bit systems (TIF_L1D_FLUSH
> >    is bit 32), we could generalize it with some effort
> 
> Why so? There are gaps in the TIF flags (18, 23, 26). Why do you want to
> push that to 32?

I might have missed the holes, I can move to using one of them, thanks.

> 
> > 6. Should we consider cleaning up the L1D on arrival of tasks?
> 
> That's the Ben idea, right? If so see above.

Yes

> 
> > +#define L1D_CACHE_ORDER 4
> > +static void *l1d_flush_pages;
> > +static DEFINE_MUTEX(l1d_flush_mutex);
> > +
> > +void enable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     struct page *page;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
> > +             goto done;
> > +
> > +     mutex_lock(&l1d_flush_mutex);
> > +     if (l1d_flush_pages)
> > +             goto done;
> > +     /*
> > +      * These pages are never freed, we use the same
> > +      * set of pages across multiple processes/contexts
> > +      */
> > +     page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
> > +     if (!page)
> > +             return;
> > +
> > +     l1d_flush_pages = page_address(page);
> > +     /* I don't think we need to worry about KSM */
> 
> Why not? Even if it wouldn't be necessary why would we care as this is a
> once per boot operation in fully preemptible code.

Not sure I understand your question, I was stating that even if KSM was
running, it would not impact us (with dedup), as we'd still be writing out 0s
to the cache line in the fallback case.

> 
> Aside of that why do we need another pointlessly different copy of what
> we have in the VMX code?
> 
> > +done:
> > +     set_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     mutex_unlock(&l1d_flush_mutex);
> > +}
> > +
> > +void disable_l1d_flush_for_task(struct task_struct *tsk)
> > +{
> > +     clear_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH);
> > +     smp_mb__after_atomic();
> 
> Lacks an explanation / comment what this smp_mb is for and where the
> counterpart lives.
> 
> Aside of that, this is pointless AFAICT. Disable/enable is really not a
> concern of being perfect. If you want perfect protection, simply switch
> off your computer.

Yes, the counter part was in the l1d_flush(), which did an smp_rmb(), but the
comment does point out that it can and will be lazy, so we can safely remove
this bit if needed.

> 
> > +/*
> > + * Flush the L1D cache for this CPU. We want to this at switch mm time,
> > + * this is a pessimistic security measure and an opt-in for those tasks
> > + * that host sensitive information and there are concerns about spills
> > + * from fill buffers.
> > + */
> > +static void l1d_flush(struct mm_struct *next, struct task_struct *tsk)
> > +{
> > +     struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > +     int size = PAGE_SIZE << L1D_CACHE_ORDER;
> > +
> > +     if (this_cpu_read(cpu_tlbstate.last_user_mm_l1d_flush) == 0)
> > +             goto check_next;
> > +
> > +     if (real_prev == next)
> > +             return;
> > +
> > +     if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +             wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +             goto done;
> > +     }
> > +
> > +     asm volatile(
> > +             /* Fill the cache */
> > +             "xorl   %%eax, %%eax\n"
> > +             ".Lfill_cache:\n"
> > +             "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
> > +             "addl   $64, %%eax\n\t"
> > +             "cmpl   %%eax, %[size]\n\t"
> > +             "jne    .Lfill_cache\n\t"
> > +             "lfence\n"
> > +             :: [flush_pages] "r" (l1d_flush_pages),
> > +                 [size] "r" (size)
> > +             : "eax", "ecx");
> 
> Yet moar copied code slighlty different for no reason.
> 

Different because of the need to not prefetch the TLB

> > +
> > +done:
> > +     this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 0);
> > +     /* Make sure we clear the values before we set it again */
> > +     barrier();
> > +check_next:
> > +     if (tsk == NULL)
> > +             return;
> > +
> > +     /* Match the set/clear_bit barriers */
> > +     smp_rmb();
> 
> What for again?

Yes, not needed (the comment below sort of says that), I can remove this

> 
> > +     /* We don't need stringent checks as we opt-in/opt-out */
> > +     if (test_ti_thread_flag(&tsk->thread_info, TIF_L1D_FLUSH))
> > +             this_cpu_write(cpu_tlbstate.last_user_mm_l1d_flush, 1);
> > +}
> > +
> >  void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >              struct task_struct *tsk)
> >  {
> > @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> > mm_struct *next,
> >               trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
> >       }
> > 
> > +     l1d_flush(next, tsk);
> 
> This is really the wrong place. You want to do that:
> 
>   1) Just before return to user space
>   2) When entering a guest
> 
> and only when the previously running user space task was the one which
> requested this massive protection.
> 

Cases 1 and 2 are handled via

1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
2. L1TF fault handling

This mechanism allows for flushing not restricted to 1 or 2, the idea is to
immediately flush L1D for paranoid processes on mm switch.

> > 

> While it's worth to discuss, I'm not yet convinced that this is worth
> the trouble.
> 

Are you suggesting the mechanism is not worth building?

Balbir Singh.