linux-kernel - Re: [Documentation] State of CPU controller in cgroup v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 16 Sep 2016 11:19:38 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>, kernel-team@...com,
        Andrew Morton <akpm@...ux-foundation.org>,
        "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
        Paul Turner <pjt@...gle.com>, Li Zefan <lizefan@...wei.com>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2

On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:
>
>> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
>> > CPU affinities (because that doesn't make sense). The only way to
>> > restrict it is to partition.
>> >
>> > 'Global' because you can partition it. If you reduce your system to
>> > single CPU partitions you'll reduce to P-EDF.
>> >
>> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
>> > partition scheme, it however does support sched_affinity, but using it
>> > gives 'interesting' schedulability results -- call it a historic
>> > accident).
>>
>> Hmm, I didn't realize that the deadline scheduler was global.  But
>> ISTM requiring the use of "exclusive" to get this working is
>> unfortunate.  What if a user wants two separate partitions, one using
>> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
>> non-RT stuff)?
>
> {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> cpu parts are 'rare').

There's no overlap, so they're logically exclusive, but it avoids
needing the "cpu_exclusive" parameter.  It always seemed confusing to
me that a setting on a child cgroup would strictly remove a resource
from the parent.  (To be clear: I don't have any particularly strong
objection to cpu_exclusive.  It just always seemed like a bit of a
hack that mostly duplicated what you could get by just setting the
cpusets appropriately throughout the hierarchy.)

>> > Note that related, but differently, we have the isolcpus boot parameter
>> > which creates single CPU partitions for all listed CPUs and gives the
>> > rest to the root cpuset. Ideally we'd kill this option given its a boot
>> > time setting (for something which is trivially to do at runtime).
>> >
>> > But this cannot be done, because that would mean we'd have to start with
>> > a !0 cpuset layout:
>> >
>> >                 '/'
>> >                 load_balance=0
>> >             /              \
>> >         'system'        'isolated'
>> >         cpus=~isolcpus  cpus=isolcpus
>> >                         load_balance=0
>> >
>> > And start with _everything_ in the /system group (inclding default IRQ
>> > affinities).
>> >
>> > Of course, that will break everything cgroup :-(
>> >
>>
>> I would actually *much* prefer this over the status quo.  I'm tired of
>> my crappy, partially-working script that sits there and creates
>> exactly this configuration (minus the isolcpus part because I actually
>> want migration to work) on boot.  (Actually, it could have two
>> automatic cgroups: /kernel and /init -- init and UMH would go in init
>> and kernel threads and such would go in /kernel.  Userspace would be
>> able to request that a different cgroup be used for newly-created
>> kernel threads.)
>
> So there's a problem with sticking kernel threads (and esp. kthreadd)
> into !root groups. For example if you place it in a cpuset that doesn't
> have all cpus, then binding your shiny new kthread to a cpu will fail.
>
> You can fix that of course, and we used to do exactly that, but we kept
> running into 'fun' cases like that.

Blech.  But may this *should* have that effect.  I'm sick of random
kernel crap being scheduled on my RT CPUs and on the CPUs that I
intend to be kept forcibly idle.

>
> The unbound workqueue stuff is totally arbitrary borkage though, that
> can be made to work just fine, TJ didn't like it for some reason which I
> really cannot remember.
>
> Also, UMH?

User mode helper.  Fortunately most users are gone now, but it still exists.