[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed837e01-043b-e19b-293c-30d44df6f3a8@linux.intel.com>
Date: Fri, 10 Jul 2020 20:19:24 +0800
From: "Li, Aubrey" <aubrey.li@...ux.intel.com>
To: Vineeth Remanan Pillai <vpillai@...italocean.com>,
Nishanth Aravamudan <naravamudan@...italocean.com>,
Julien Desfossez <jdesfossez@...italocean.com>,
Peter Zijlstra <peterz@...radead.org>,
Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org
Cc: "Joel Fernandes (Google)" <joel@...lfernandes.org>,
linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lwe@...il.com>,
Aubrey Li <aubrey.intel@...il.com>,
Valentin Schneider <valentin.schneider@....com>,
Mel Gorman <mgorman@...hsingularity.net>,
Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Joel Fernandes <joelaf@...gle.com>, vineethrp@...il.com,
Chen Yu <yu.c.chen@...el.com>,
Christian Brauner <christian.brauner@...ntu.com>,
Tim Chen <tim.c.chen@...el.com>,
"Paul E . McKenney" <paulmck@...nel.org>
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of
IRQ and softirq
Hi Joel/Vineeth,
On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> From: "Joel Fernandes (Google)" <joel@...lfernandes.org>
>
> With current core scheduling patchset, non-threaded IRQ and softirq
> victims can leak data from its hyperthread to a sibling hyperthread
> running an attacker.
>
> For MDS, it is possible for the IRQ and softirq handlers to leak data to
> either host or guest attackers. For L1TF, it is possible to leak to
> guest attackers. There is no possible mitigation involving flushing of
> buffers to avoid this since the execution of attacker and victims happen
> concurrently on 2 or more HTs.
>
> The solution in this patch is to monitor the outer-most core-wide
> irq_enter() and irq_exit() executed by any sibling. In between these
> two, we mark the core to be in a special core-wide IRQ state.
>
> In the IRQ entry, if we detect that the sibling is running untrusted
> code, we send a reschedule IPI so that the sibling transitions through
> the sibling's irq_exit() to do any waiting there, till the IRQ being
> protected finishes.
>
> We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
> outer-most irq_exit(), the core is still in the special core-wide IRQ
> state, we perform a busy-wait till the core exits this state. This
> combination of per-cpu and core-wide IRQ states helps to handle any
> combination of irq_entry()s and irq_exit()s happening on all of the
> siblings of the core in any order.
>
> Lastly, we also check in the schedule loop if we are about to schedule
> an untrusted process while the core is in such a state. This is possible
> if a trusted thread enters the scheduler by way of yielding CPU. This
> would involve no transitions through the irq_exit() point to do any
> waiting, so we have to explicitly do the waiting there.
>
> Every attempt is made to prevent a busy-wait unnecessarily, and in
> testing on real-world ChromeOS usecases, it has not shown a performance
> drop. In ChromeOS, with this and the rest of the core scheduling
> patchset, we see around a 300% improvement in key press latencies into
> Google docs when Camera streaming is running simulatenously (90th
> percentile latency of ~150ms drops to ~50ms).
>
> This fetaure is controlled by the build time config option
> CONFIG_SCHED_CORE_IRQ_PAUSE and is enabled by default. There is also a
> kernel boot parameter 'sched_core_irq_pause' to enable/disable the
> feature at boot time. Default is enabled at boot time.
We saw a lot of soft lockups on the screen when we tested v6.
[ 186.527883] watchdog: BUG: soft lockup - CPU#86 stuck for 22s! [uperf:5551]
[ 186.535884] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [uperf:5444]
[ 186.555883] watchdog: BUG: soft lockup - CPU#89 stuck for 22s! [uperf:5547]
[ 187.547884] rcu: INFO: rcu_sched self-detected stall on CPU
[ 187.553760] rcu: 40-....: (14997 ticks this GP) idle=49a/1/0x4000000000000002 softirq=1711/1711 fqs=7279
[ 187.564685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 14
[ 187.564723] NMI watchdog: Watchdog detected hard LOCKUP on cpu 38
The problem is gone when we reverted this patch. We are running multiple
uperf threads(equal to cpu number) in a cgroup with coresched enabled.
This is 100% reproducible on our side.
Just wonder if anything already known before we dig into it.
Thanks,
-Aubrey
Powered by blists - more mailing lists