linux-kernel - Re: [RFC][PATCH 00/16] sched: Core scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190219151532.GA40581@gmail.com>
Date:   Tue, 19 Feb 2019 16:15:32 +0100
From:   Ingo Molnar <mingo@...nel.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        subhra.mazumdar@...cle.com,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>, kerrnel@...gle.com
Subject: Re: [RFC][PATCH 00/16] sched: Core scheduling


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Mon, Feb 18, 2019 at 12:40 PM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > If there were close to no VMEXITs, it beat smt=off, if there were lots
> > of VMEXITs it was far far worse. Supposedly hosting people try their
> > very bestest to have no VMEXITs so it mostly works for them (with the
> > obvious exception of single VCPU guests).
> >
> > It's just that people have been bugging me for this crap; and I figure
> > I'd post it now that it's not exploding anymore and let others have at.
> 
> The patches didn't look disgusting to me, but I admittedly just
> scanned through them quickly.
> 
> Are there downsides (maintenance and/or performance) when core
> scheduling _isn't_ enabled? I guess if it's not a maintenance or
> performance nightmare when off, it's ok to just give people the
> option.

So this bit is the main straight-line performance impact when the 
CONFIG_SCHED_CORE Kconfig feature is present (which I expect distros to 
enable broadly):

  +static inline bool sched_core_enabled(struct rq *rq)
  +{
  +       return static_branch_unlikely(&__sched_core_enabled) && rq->core_enabled;
  +}

   static inline raw_spinlock_t *rq_lockp(struct rq *rq)
   {
  +       if (sched_core_enabled(rq))
  +               return &rq->core->__lock
  +
          return &rq->__lock;


This should at least in principe keep the runtime overhead down to more 
NOPs and a bit bigger instruction cache footprint - modulo compiler 
shenanigans.

Here's the code generation impact on x86-64 defconfig:

   text	   data	    bss	    dec	    hex	filename
    228	     48	      0	    276	    114	sched.core.n/cpufreq.o (ex sched.core.n/built-in.a)
    228	     48	      0	    276	    114	sched.core.y/cpufreq.o (ex sched.core.y/built-in.a)

   4438	     96	      0	   4534	   11b6	sched.core.n/completion.o (ex sched.core.n/built-in.a)
   4438	     96	      0	   4534	   11b6	sched.core.y/completion.o (ex sched.core.y/built-in.a)

   2167	   2428	      0	   4595	   11f3	sched.core.n/cpuacct.o (ex sched.core.n/built-in.a)
   2167	   2428	      0	   4595	   11f3	sched.core.y/cpuacct.o (ex sched.core.y/built-in.a)

  61099	  22114	    488	  83701	  146f5	sched.core.n/core.o (ex sched.core.n/built-in.a)
  70541	  25370	    508	  96419	  178a3	sched.core.y/core.o (ex sched.core.y/built-in.a)

   3262	   6272	      0	   9534	   253e	sched.core.n/wait_bit.o (ex sched.core.n/built-in.a)
   3262	   6272	      0	   9534	   253e	sched.core.y/wait_bit.o (ex sched.core.y/built-in.a)

  12235	    341	     96	  12672	   3180	sched.core.n/rt.o (ex sched.core.n/built-in.a)
  13073	    917	     96	  14086	   3706	sched.core.y/rt.o (ex sched.core.y/built-in.a)

  10293	    477	   1928	  12698	   319a	sched.core.n/topology.o (ex sched.core.n/built-in.a)
  10363	    509	   1928	  12800	   3200	sched.core.y/topology.o (ex sched.core.y/built-in.a)

    886	     24	      0	    910	    38e	sched.core.n/cpupri.o (ex sched.core.n/built-in.a)
    886	     24	      0	    910	    38e	sched.core.y/cpupri.o (ex sched.core.y/built-in.a)

   1061	     64	      0	   1125	    465	sched.core.n/stop_task.o (ex sched.core.n/built-in.a)
   1077	    128	      0	   1205	    4b5	sched.core.y/stop_task.o (ex sched.core.y/built-in.a)

  18443	    365	     24	  18832	   4990	sched.core.n/deadline.o (ex sched.core.n/built-in.a)
  20019	   2189	     24	  22232	   56d8	sched.core.y/deadline.o (ex sched.core.y/built-in.a)

   1123	      8	     64	   1195	    4ab	sched.core.n/loadavg.o (ex sched.core.n/built-in.a)
   1123	      8	     64	   1195	    4ab	sched.core.y/loadavg.o (ex sched.core.y/built-in.a)

   1323	      8	      0	   1331	    533	sched.core.n/stats.o (ex sched.core.n/built-in.a)
   1323	      8	      0	   1331	    533	sched.core.y/stats.o (ex sched.core.y/built-in.a)

   1282	    164	     32	   1478	    5c6	sched.core.n/isolation.o (ex sched.core.n/built-in.a)
   1282	    164	     32	   1478	    5c6	sched.core.y/isolation.o (ex sched.core.y/built-in.a)

   1564	     36	      0	   1600	    640	sched.core.n/cpudeadline.o (ex sched.core.n/built-in.a)
   1564	     36	      0	   1600	    640	sched.core.y/cpudeadline.o (ex sched.core.y/built-in.a)

   1640	     56	      0	   1696	    6a0	sched.core.n/swait.o (ex sched.core.n/built-in.a)
   1640	     56	      0	   1696	    6a0	sched.core.y/swait.o (ex sched.core.y/built-in.a)

   1859	    244	     32	   2135	    857	sched.core.n/clock.o (ex sched.core.n/built-in.a)
   1859	    244	     32	   2135	    857	sched.core.y/clock.o (ex sched.core.y/built-in.a)

   2339	      8	      0	   2347	    92b	sched.core.n/cputime.o (ex sched.core.n/built-in.a)
   2339	      8	      0	   2347	    92b	sched.core.y/cputime.o (ex sched.core.y/built-in.a)

   3014	     32	      0	   3046	    be6	sched.core.n/membarrier.o (ex sched.core.n/built-in.a)
   3014	     32	      0	   3046	    be6	sched.core.y/membarrier.o (ex sched.core.y/built-in.a)

  50027	    964	     96	  51087	   c78f	sched.core.n/fair.o (ex sched.core.n/built-in.a)
  51537	   2484	     96	  54117	   d365	sched.core.y/fair.o (ex sched.core.y/built-in.a)

   3192	    220	      0	   3412	    d54	sched.core.n/idle.o (ex sched.core.n/built-in.a)
   3276	    252	      0	   3528	    dc8	sched.core.y/idle.o (ex sched.core.y/built-in.a)

   3633	      0	      0	   3633	    e31	sched.core.n/pelt.o (ex sched.core.n/built-in.a)
   3633	      0	      0	   3633	    e31	sched.core.y/pelt.o (ex sched.core.y/built-in.a)

   3794	    160	      0	   3954	    f72	sched.core.n/wait.o (ex sched.core.n/built-in.a)
   3794	    160	      0	   3954	    f72	sched.core.y/wait.o (ex sched.core.y/built-in.a)

I'd say this one is representative:

   text	   data	    bss	    dec	    hex	filename
  12235	    341	     96	  12672	   3180	sched.core.n/rt.o (ex sched.core.n/built-in.a)
  13073	    917	     96	  14086	   3706	sched.core.y/rt.o (ex sched.core.y/built-in.a)

which ~6% bloat is primarily due to the higher rq-lock inlining overhead, 
I believe.

This is roughly what you'd expect from a change wrapping all 350+ inlined 
instantiations of rq->lock uses. I.e. it might make sense to uninline it.

In terms of long term maintenance overhead, ignoring the overhead of the 
core-scheduling feature itself, the rq-lock wrappery is the biggest 
ugliness, the rest is mostly isolated.

So if this actually *works* and improves the performance of some real 
VMEXIT-poor SMT workloads and allows the enabling of HyperThreading with 
untrusted VMs without inviting thousands of guest roots then I'm 
cautiously in support of it.

> That all assumes that it works at all for the people who are clamoring 
> for this feature, but I guess they can run some loads on it eventually. 
> It's a holiday in the US right now ("Presidents' Day"), but maybe we 
> can get some numebrs this week?

Such numbers would be *very* helpful indeed.

Thanks,

	Ingo