linux-kernel - Re: [RFC] Design proposal for upstream core-scheduling interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cb5432d1-9909-1f16-5e26-ea77efbee713@oracle.com>
Date:   Mon, 24 Aug 2020 17:42:28 -0400
From:   chris hyser <chris.hyser@...cle.com>
To:     Joel Fernandes <joel@...lfernandes.org>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        JulienDesfossez@...gle.com, jdesfossez@...italocean.com,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, linux-kernel@...r.kernel.org,
        fweisbec@...il.com, keescook@...omium.org,
        Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>, vineethrp@...il.com,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        dhaval.giani@...il.com, paulmck@...nel.org, joshdon@...gle.com,
        xii@...gle.com, haoluo@...gle.com, bsegall@...gle.com
Subject: Re: [RFC] Design proposal for upstream core-scheduling interface



On 8/24/20 4:53 PM, chris hyser wrote:
> On 8/21/20 11:01 PM, Joel Fernandes wrote:
>> Hello!
>> Core-scheduling aims to allow making it safe for more than 1 task that trust
>> each other to safely share hyperthreads within a CPU core [1]. This results
>> in a performance improvement for workloads that can benefit from using
>> hyperthreading safely while limiting core-sharing when it is not safe.
>>
>> Currently no universally agreed set of interface exists and companies have
>> been hacking up their own interface to make use of the patches. This post
>> aims to list usecases which I got after talking to various people at Google
>> and Oracle. After which actual development of code to add interfaces can follow.
>>
>> The below text uses the terms cookie and tag interchangeably. Further, cookie
>> of 0 is assumed to indicate a trusted process - such as kernel threads or
>> system daemons. By default, if nothing is tagged then everything is
>> considered trusted since the scheduler assumes all tasks are a match for each
>> other.
>>
>> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
>> int32 is split into 2 parts, the color and the id. The color can only be set
>> by privileged processes and the id can be set by anyone. The CGroup structure
>> looks like:
>>
>>     A         B
>>    / \      / \ \
>>   C   D    E  F  G
>>
>> Here A and B are container CGroups for 2 jobs are assigned a color by a
>> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
>> E, F and G). When these sub-CGroups are spawned, they inherit the color from
>> the parent. An unprivileged user can then set an id for the sub-CGroup
>> without the knowledge of the privileged daemon if it desires to add further
>> isolation. This setting of id can be an unprivileged operation because the
>> root daemon has already isolated A and B.
>>
>> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
>> spawns a renderer. A renderer is a sandboxed process and it is assumed it
>> could run arbitrary code (Javascript etc). When a renderer is created, a
>> prctl call is made to tag the renderer. Every thread that is spawned by the
>> renderer is also tagged. Essentially this turns SMT off for the renderer, but
>> still gives a performance boost due to privileged system threads being able
>> to share a core. The tagging also forbids the renderer from sharing a core
>> with privileged system processes. In the future, we plan to allow threads to
>> share a core as well (especially once we get syscall-isolation upstreamed.
>> Patches were posted recently for the same [2]).
>>
>> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
>> tagged thus disallowing core sharing between the vCPU thread and any other
>> thread on the system. This is because such VMs may run arbitrary user code
>> and attack both the guest and the host systems sharing the core.
>>
>> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
>> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
>> to not have to share its parent's CGroup tag. In fact, it should be allowed to
>> untag the child CGroup if needed thus allowing them to share a core with
>> trusted tasks. Others have had similar requirements.
>>
>> Proposal for tagging
>> --------------------
>> We have to support both CGroup and non-CGroup users. CGroup may be overkill
>> for some and the CGroup v2 unified hierarchy may be too inflexible.
>> Regardless, we must support CGroup due its easy of use and existing users.
>>
>> For Usecase #1
>> ----------
>> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
>> to the CPU controller:
>> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
>>    tagged.  (In the kernel, the cookie will be derived from the pointer value
>>    of a ref-counted cookie object.). If reset, then the CGroup will inherit
>>    the parent CGroup's cookie if there is one.
>>
>> - color : The ref-counted object will be aligned say to a 256-byte boundary
>>    (for example), then the lower 8 bits of the pointer can be used to specify
>>    color. Together, the pointer with the color will form a cookie used by the
>>    scheduler.
>>
>> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
>> their color to be the same does not imply that the 2 groups will share a
>> core. This is key.  Also, to support usecase #4, we could add a third tag
>> value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
>> a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
>> here).
> 
> Let em think about this. This looks like it would support delegation of a cgroup subtree, which I suppose containers are 

s/em/me/