[<prev] [next>] [day] [month] [year] [list]
Message-ID: <aPtfMFfLV1l/RB0L@e133380.arm.com>
Date: Fri, 24 Oct 2025 12:12:48 +0100
From: Dave Martin <Dave.Martin@....com>
To: linux-kernel@...r.kernel.org
Cc: Tony Luck <tony.luck@...el.com>,
Reinette Chatre <reinette.chatre@...el.com>,
James Morse <james.morse@....com>,
"Chen, Yu C" <yu.c.chen@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>, Jonathan Corbet <corbet@....net>,
x86@...nel.org
Subject: [RFC] fs/resctrl: Generic schema description
Hi all,
Going forward, a single resctrl resource (such as memory bandwidth) is
likely to require multiple schemata, either because we want to add new
schemata that provide finer control, or because the hardware has
multiple controls, covering different aspects of resource allocation.
The fit between MPAM's memory bandwidth controls and the resctrl MB
schema is already awkward, and later Intel RDT features such as Region
Aware Memory Bandwidth Allocation are already pushing past what the MB
schema can describe. Both of these can involve multiple control
values and finer resolution than the 100 steps offered by the current
"MB" schema.
The previous discussion went off in a few different directions [1], so
I want to focus back onto defining an extended schema description that
aims to cover the use cases that we know about or anticipate today, and
allows for future extension as needed.
(A separate discussion is needed on how new schemata interact with
previously-defined schemata (such as the MB percentage schema).
suggest we pause that discussion for now, in the interests of getting
the schema description nailed down.)
Following on from the previous mail thread, I've tried to refine and
flesh out the proposal for schema descriptions a bit, as follows.
Proposal:
* Split resource names and schema names in resctrlfs.
Resources will be named for the unique, existing schema for each
resource.
The existing schema will keep its name (the same as the resource
name), and new schemata defined for a resource will include that
name as a prefix (at least, by default).
So, for example, we will have an MB resource with a schema called
MB (the schema that we have already). But we may go on to define
additional schemata for the MB resource, with names such MB_MAX,
etc.
* Stop adding new schema description information in the top-level
info/<resource>/ directory in resctrlfs.
For backwards compatibilty, we can keep the existing property
files under the resource info directory to describe the previously
defined resource, but we seem to need something richer going
forward.
* Add a hierarchy to list all the schemata for each resource, along
with their properties. So far, the proposal looks like this,
taking the MB resource as an example:
info/
└─ MB/
└─ resource_schemata/
├─ MB/
├─ MB_MIN/
├─ MB_MAX/
┆
Here, MB, MB_MIN and MB_MAX are all schemata for the "MB" resource.
In this proposal, what these just dummy schema names for
illustration purposes. The important thing is that they all
control aspects of the "MB" resource, and that there can be more
than one of them.
It may be appropriate to have a nested hierarchy, where some
schemata are presented as children of other schemata if they
affect the same hardware controls. For now, let's put this issue
on one side, and consider what properties should be advertsed for
each schema.
* Current properties that I think we might want are:
info/
└─ SOME_RESOURCE/
└─ resource_schemata/
├─ SOME_SCHEMA/
┆ ├─ type
├─ min
├─ max
├─ tolerance
├─ resolution
├─ scale
└─ unit
(I've tweaked the properties a bit since previous postings.
"type" replaces "map"; "scale" is now the unit multiplier;
"resolution" is now a scaling divisor -- details below.)
I assume that we expose the properties in individual files, but we
could also combine them into a single description file per schema,
per resource or (possibly) a single global file.
(I don't have a strong view on the best option.)
Either way, the following set of properties may be a reasonable
place to start:
type: the schema type, followed by optional flag specifiers:
- "scalar": a single-valued numeric control
A mandatory flag indicates how the control value written to
the schemata file is converted to an amount of resource for
hardware regulation.
The flag "linear" indicates a linear mapping.
In this case, the amount of resource E that is actually
allocated is derived from the control value C written to the
schemata file as follows:
E = C * scale * unit / resolution
Other flags values could be defined later, if we encounter
hardware with non-linear controls.
- "bitmap": a bitmap control
The optional flag "sparse" is present if the control accepts
sparse bitmaps.
In this case, E = bitmap_weight(C) * scale * unit / resolution.
As before, each bit controls access to a specific chunk of
resource in the hardware, such as a group of cache lines. All
chunks are equally sized.
(Different CTRL_MON groups may still contend within the
allocation E, when they have bits in common between their
bitmaps.)
min:
- For a scalar schema, the minimum value that can be written to
the control when writing the schemata file.
- For a bitmap schema, a bitmap of the minimum weight that the
schema accepts: if an empty bitmap is accepted, this can be 0.
Otherwise, if bitmaps with a single bit set are acceptable,
this can just have the lowest-order bit set.
Most commonly, the value will probably be "1".
For bitmap schemata, we might report this in hex. In the
interest of generic parsing, we could include a "0x" prefix if
so.
max:
- For a scalar schema, the maximum value that can be written to
the control when writing the schemata file.
- For a bitmap schema, the mask with all bits set.
Possibly reported in hex for bitmap schemata (as for "min").
tolerance:
(See below for discussion on this.)
- "0": the control is exact
- "1": the effective control value is within ±1 of the control
value written to the schemata file. (Similary, positive "n" ->
±n.)
A negative value could be used to indicate that the tolerance
is unknown. (Possibly we could also just omit the property,
though it seems better to warn userspace explicitly if we
don't know.)
Tests might make use of this parameter in order to determine
how picky to be about exact measurement results.
resolution:
- For a proportional scalar schema: the number of divisions that
the whole resource is divided into. (See below for
"proportional scalar schema.)
Typically, this will be the same as the "max" value.
- For an absolute scalar schema: the divisor applied to the
control value.
- For a bitmap schema: the size of the bitmap in bits.
scale:
- For a scalar schema: the scale-up multiplier applied to
"unit".
- For a bitmap schema: probably "1".
unit:
- The base unit of the quantity measured by the control value.
The special unit "all" denotes a proportional schema. In this
case, the resource is a finite, physical thing such as a cache
or maxed-out data throughput of a memory controller. The
entire physical resource is available for allocation, and the
control value indicates what proportion of it is allocated.
Bitmap schemata will probably all be proportional and use the
unit "all". (This applies to cache bitmaps, at least.)
Absolute schemata will require specification of the base unit
here, say, "MBps". The "scale" parameter can be used to avoid
proliferation of unit strings:
For example, {scale=1000, unit="MBps"} would be equivalent to
{scale=1, unit="GBps"}.
Note on the "tolerance" parameter:
This is a new addition. On the MPAM side, the hardware has a choice
about how to interpret the control value in some edge-case situations.
We may not reasonably be able to probe for this, so it may be useful
to warn software that there is an uncertainty margin.
We might also be able to use the "tolerance" parameter to accommodate
the rounding behaviour of the existing "MB" schema (otherwise, we
might want a special "type" for this schema, if it doesn't comply
closely enough).
If we want to deploy resctrl under virtualisation, resctrl on the host
could dynamically affect the actual amount of resource that is
available for allocation inside a VM.
Whether or not we ever want to do that, it might be useful to have a
way to warn software that the effective control values hitting the
hardware may not be entirely predictable.
Thoughts?
Cheers
---Dave
[1] Re: [PATCH] fs/resctrl,x86/resctrl: Factor mba rounding to be per-arch
https://lore.kernel.org/lkml/aNFliMZTTUiXyZzd@e133380.arm.com/
Powered by blists - more mailing lists