linux-kernel - Re: [RFC] Design proposal for upstream core-scheduling interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPhKKr8V68SGMPkMqqQE+j1dM-7MBD8XPxf1t6s-gUzwoY_BsQ@mail.gmail.com>
Date:   Mon, 24 Aug 2020 13:31:42 -0700
From:   Dhaval Giani <dhaval.giani@...il.com>
To:     Vineeth Pillai <vineethrp@...il.com>
Cc:     Joel Fernandes <joel@...lfernandes.org>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        JulienDesfossez@...gle.com,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        chris hyser <chris.hyser@...cle.com>,
        "Paul E . McKenney" <paulmck@...nel.org>, joshdon@...gle.com,
        xii@...gle.com, haoluo@...gle.com,
        Benjamin Segall <bsegall@...gle.com>
Subject: Re: [RFC] Design proposal for upstream core-scheduling interface

On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai <vineethrp@...il.com> wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling.  We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> ----------------------------------------------------------------------------------
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
>              This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
>               Can be used for specifying future flag requirements as well.
>               Current list of properties include:
>               NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
>                                     creation of a new trust group
>               SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
>                                      this same group
>               ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - <pid>: A directory per task who is am member of this trust group.
> - <pid>/properties: This file is same as the parent properties file
> but this is to override
>                     the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
>    of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
>    write, read. Or can come up with new APIs using the cookie_id which can wrap
>   t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to  cgroup v1 and might be re-implementing features
> that are already
>   provided by cgroups v1.
>
> Use Cases
> -----------------
>
> Usecase 1: Google cloud
> ---------------------------------
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The main daemon can create trusted groups in the
> fileystem and
> provide required permissions for the group. Then the init processes
> for each job can
> be added to respective groups for them to create children tasks as
> needed. Multiple
> jobs under the same customer which needs to share the core can be
> housed in one group.
>
>
> Usecase 2: Chrome browser
> ------------------------
>
> We start with one group for the first task and then set properties to
> NEW_COOKIE_FOR_CHILD.
>
> Usecase 3: chrome VMs
> ---------------------
>
> Similar to chrome browser, the VM task can make each vcpu on its own group.
>
> Usecase 4: Oracle use case
> --------------------------
> This is also similar to use case 1 with this interface. All tasks that need to
> be in the root group can be easily added by the admin.
>
> Use case 5: General virtualization
> ----------------------------------
>
> The requirement is each VM should be isolated. This can be easily done
> by creating a
> new group per VM
>
>
> Please have a look at the above proposal and let us know your
> thoughts. We shall include
> this also during the interface discussion at core scheduling MC.
>

I am inclined to say no to this. Yet another FS interface :-(. We are
just reinventing the wheel here. Let's try to stick within cgroupfs
first and see if we can make it work there.

Dhaval