linux-kernel - Re: GPU device resource reservations with cgroups?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <26b36812-cbe6-744d-6fb7-e7aec0bf5496@quicinc.com>
Date:   Tue, 27 Sep 2022 09:13:49 -0600
From:   Jeffrey Hugo <quic_jhugo@...cinc.com>
To:     "T.J. Mercier" <tjmercier@...gle.com>
CC:     Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
        <cgroups@...r.kernel.org>, Johannes Weiner <hannes@...xchg.org>,
        <dri-devel@...ts.freedesktop.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Carl Vanderlip <quic_carlv@...cinc.com>,
        <quic_ajitpals@...cinc.com>, <quic_pkanojiy@...cinc.com>
Subject: Re: GPU device resource reservations with cgroups?

On 9/8/2022 10:44 AM, T.J. Mercier wrote:
> On Tue, Aug 16, 2022 at 1:39 PM Jeffrey Hugo <quic_jhugo@...cinc.com> wrote:
>>
>> Hello cgroup experts,
>>
>> I have a GPU device [1] that supports organizing its resources for the
>> purposes of supporting containers.  I am attempting to determine how to
>> represent this in the upstream kernel, and I wonder if it fits in cgroups.
>>
>> The device itself has a number of resource types – compute cores,
>> memory, bus replicators, semaphores, and dma channels.  Any particular
>> workload may consume some set of these resources.  For example, a
>> workload may consume two compute cores, 1GB of memory, one dma channel,
>> but no semaphores and no bus replicators.
>>
>> By default all of the resources are in a global pool.  This global pool
>> is managed by the device firmware.  Linux makes a request to the
>> firmware to load a workload.  The firmware reads the resource
>> requirements from the workload itself, and then checks the global pool.
>> If the global pool contains sufficient resources to satisfy the needs of
>> the workload, the firmware assigns the required resources from the
>> global pool to the workload.  If there are insufficient resources, the
>> workload request from Linux is rejected.
>>
>> Some users may want to share the device between multiple containers, but
>> provide device level isolation between those containers.  For example, a
>> user may have 4 workloads to run, one per container, and each workload
>> takes 1/4th of the set of compute cores.  The user would like to reserve
>> sets of compute cores for each container so that container X always has
>> the expected set of resources available, and if container Y
>> malfunctions, it cannot “steal” resources from container X.
>>
>> To support this, the firmware supports a concept of partitioning.  A
>> partition is a pool of resources which are removed from the global pool,
>> and pre-assigned to the partition’s pool.  A workload can then be run
>> from within a partition, and it consumes resources from that partition’s
>> pool instead of from the global pool.  The firmware manages creating
>> partitions and assigning resources to them.
>>
>> Partitions do not nest.
>>
> Do partitions have any significance in hardware, or are they just a
> logical concept? Does it matter which compute core / bus replicator /
> dma channel a user gets, or are they interchangeable between uses?

Logical concept.  Resources are interchangeable.

In the future, I think it is possible that NUMA comes into the picture. 
  Just like now a CPU may be closer to a particular bank of memory (DDR) 
and thus keeping a task that uses that bank of memory scheduled on the 
associated CPU is an ideal situation from the perspective of the Linux 
scheduler, a particular compute core may have specific locality to other 
resources.

I'm guessing if we were to consider such a scenario, the partition would 
be flagged to request resources which are "close" to each-other.

>> In the above user example, the user can create 4 partitions, and divide
>> up the compute cores among them.  Then the user can assign each
>> individual container their own individual partition.  Each container
>> would be limited to the resources within it’s assigned partition, but
>> also that container would have exclusive access to those resources.
>> This essentially provides isolation, and some Quality of Service (QoS).
>>
>> How this is currently implemented (in downstream), is perhaps not ideal.
>>    A privileged daemon process reads a configuration file which defines
>> the number of partitions, and the set of resources assigned to each.
>> That daemon makes requests to the firmware to create the partitions, and
>> gets a unique ID for each.  Then the daemon makes a request to the
>> driver to create a “shadow device”, which is a child dev node.  The
>> driver verifies with the firmware that the partition ID is valid, and
>> then creates the dev node.  Internally the driver associates this shadow
>> device with the partition ID so that each request to the firmware is
>> tagged with the partition ID by the driver.  This tagging allows the
>> firmware to determine that a request is targeted for a specific
>> partition.  Finally, the shadow device is passed into the container,
>> instead of the normal dev node.  The userspace within the container
>> operates the shadow device normally.
>>
>> One concern with the current implementation is that it is possible to
>> create a large number of partitions.  Since each partition is
>> represented by a shadow device dev node, this can create a large number
>> of dev nodes and exhaust the minor number space.
>>
>> I wonder if this functionality is better represented by a cgroup.
>> Instead of creating a dev node for the partition, we can just run the
>> container process within the cgroup.  However it doesn’t look like
>> cgroups have a concept of resource reservation.  It is just a limit.  If
>> that impression is accurate, then I struggle to see how to provide the
>> desired isolation as some entity not under the cgroup could consume all
>> of the device resources, leaving the containers unable to perform their
>> tasks.
> 
> Given the top-down resource distribution policy for cgroups, I think
> you'd have to have a cgroup subtree where limits for these resources
> are exclusively passed to, and maintain the placement of processes in
> the appropriate cgroup under this subtree (one per partition +
> global). The limit for these resources in all other subtrees under the
> root would need to be 0. The only trick would be to maintain the
> limit(s) on the global pool based on the sum of the limits for the
> partitions to ensure that the global pool cannot exhaust resources
> "reserved" for the partitions. If partitions don't come and go at
> runtime then that seems pretty straightforward, otherwise I could see
> the maintenance/adjustment of those limits as a source of frustration.
> 
> 
> 
>>
>> So, cgroup experts, does this sound like something that should be
>> represented by a cgroup, or is cgroup the wrong mechanism for this usecase?
>>
>> [1] -
>> https://lore.kernel.org/all/1660588956-24027-1-git-send-email-quic_jhugo@quicinc.com/