linux-kernel - Re: [Documentation] State of CPU controller in cgroup v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160916165045.GJ5016@twins.programming.kicks-ass.net>
Date:   Fri, 16 Sep 2016 18:50:45 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>, kernel-team@...com,
        Andrew Morton <akpm@...ux-foundation.org>,
        "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
        Paul Turner <pjt@...gle.com>, Li Zefan <lizefan@...wei.com>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2

On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:

> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> > CPU affinities (because that doesn't make sense). The only way to
> > restrict it is to partition.
> >
> > 'Global' because you can partition it. If you reduce your system to
> > single CPU partitions you'll reduce to P-EDF.
> >
> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> > partition scheme, it however does support sched_affinity, but using it
> > gives 'interesting' schedulability results -- call it a historic
> > accident).
> 
> Hmm, I didn't realize that the deadline scheduler was global.  But
> ISTM requiring the use of "exclusive" to get this working is
> unfortunate.  What if a user wants two separate partitions, one using
> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
> non-RT stuff)? 

{1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
cpu parts are 'rare').

> Shouldn't we be able to have a cgroup for each of the
> DL partitions and do something to tell the deadline scheduler "here is
> your domain"?

Somewhat confused, by doing the non-overlapping domains, you do exactly
that no?

You end up with 2 (or more) independent deadline schedulers, but if
you're not running deadline tasks (like in the /system partition) you
don't care its there.

> > Note that related, but differently, we have the isolcpus boot parameter
> > which creates single CPU partitions for all listed CPUs and gives the
> > rest to the root cpuset. Ideally we'd kill this option given its a boot
> > time setting (for something which is trivially to do at runtime).
> >
> > But this cannot be done, because that would mean we'd have to start with
> > a !0 cpuset layout:
> >
> >                 '/'
> >                 load_balance=0
> >             /              \
> >         'system'        'isolated'
> >         cpus=~isolcpus  cpus=isolcpus
> >                         load_balance=0
> >
> > And start with _everything_ in the /system group (inclding default IRQ
> > affinities).
> >
> > Of course, that will break everything cgroup :-(
> >
> 
> I would actually *much* prefer this over the status quo.  I'm tired of
> my crappy, partially-working script that sits there and creates
> exactly this configuration (minus the isolcpus part because I actually
> want migration to work) on boot.  (Actually, it could have two
> automatic cgroups: /kernel and /init -- init and UMH would go in init
> and kernel threads and such would go in /kernel.  Userspace would be
> able to request that a different cgroup be used for newly-created
> kernel threads.)

So there's a problem with sticking kernel threads (and esp. kthreadd)
into !root groups. For example if you place it in a cpuset that doesn't
have all cpus, then binding your shiny new kthread to a cpu will fail.

You can fix that of course, and we used to do exactly that, but we kept
running into 'fun' cases like that.

The unbound workqueue stuff is totally arbitrary borkage though, that
can be made to work just fine, TJ didn't like it for some reason which I
really cannot remember.

Also, UMH?

> Heck, even systemd would probably prefer this.  Then it could cleanly
> expose a "slice" or whatever it's called for random kernel shit and at
> least you could configure it meaningfully.

No clue about systemd, I'm still on systems without that virus.