linux-kernel - Re: [PATCH 12/33] arm_mpam: Add the class and component structures for ris firmware described

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff73df9c-8e77-471d-b0fb-205701b4d6d3@arm.com>
Date: Mon, 8 Sep 2025 18:57:41 +0100
From: James Morse <james.morse@....com>
To: Dave Martin <Dave.Martin@....com>
Cc: linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-acpi@...r.kernel.org, devicetree@...r.kernel.org,
 D Scott Phillips OS <scott@...amperecomputing.com>,
 carl@...amperecomputing.com, lcherian@...vell.com,
 bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
 baolin.wang@...ux.alibaba.com, Jamie Iles <quic_jiles@...cinc.com>,
 Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
 dfustini@...libre.com, amitsinght@...vell.com,
 David Hildenbrand <david@...hat.com>, Koba Ko <kobak@...dia.com>,
 Shanker Donthineni <sdonthineni@...dia.com>, fenghuay@...dia.com,
 baisheng.gao@...soc.com, Jonathan Cameron <jonathan.cameron@...wei.com>,
 Rob Herring <robh@...nel.org>, Rohit Mathew <rohit.mathew@....com>,
 Rafael Wysocki <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
 Lorenzo Pieralisi <lpieralisi@...nel.org>, Hanjun Guo
 <guohanjun@...wei.com>, Sudeep Holla <sudeep.holla@....com>,
 Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley
 <conor+dt@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Danilo Krummrich <dakr@...nel.org>, Ben Horgan <ben.horgan@....com>
Subject: Re: [PATCH 12/33] arm_mpam: Add the class and component structures
 for ris firmware described

Hi Dave,

On 01/09/2025 12:09, Dave Martin wrote:
>> Subject: arm_mpam: Add the class and component structures for ris firmware described
> 
> Mangled subject line?

order words hard is.


> There is a fair intersection between the commit message and what the
> patch does, but they don't quite seem to match up.
> 
> Some key issues like locking / object lifecycle management
> and DT parsing (a bit of which, it appears, lives here too) are not
> mentioned at all.

I don't think everything needs mentioning - you have the diff for that.This should capture
the motivation, what you have and what you need to find, the grouping etc.


> In lieu of a complete rewrite, it might be best to discard the
> explanation of the various object types.  The comment in the code
> speaks for itself, and looks clearer.

Fair enough,

> [...]
> 
> On Fri, Aug 22, 2025 at 03:29:53PM +0000, James Morse wrote:
>> An MSC is a container of resources, each identified by their RIS index.
>> Some RIS are described by firmware to provide their position in the system.
>> Others are discovered when the driver probes the hardware.
>>
>> To configure a resource it needs to be found by its class, e.g. 'L2'.
>> There are two kinds of grouping, a class is a set of components, which
>> are visible to user-space as there are likely to be multiple instances
>> of the L2 cache. (e.g. one per cluster or package)
>>
>> struct mpam_components are a set of struct mpam_vmsc. A vMSC groups the
>> RIS in an MSC that control the same logical piece of hardware. (e.g. L2).
>> This is to allow hardware implementations where two controls are presented
>> as different RIS. Re-combining these RIS allows their feature bits to
>> be or-ed. This structure is not visible outside mpam_devices.c
>>
>> struct mpam_vmsc are then a set of struct mpam_msc_ris, which are not
>> visible as each L2 cache may be composed of individual slices which need
>> to be configured the same as the hardware is not able to distribute the
>> configuration.
>>
>> Add support for creating and destroying these structures.
>> A gfp is passed as the structures may need creating when a new RIS entry
>> is discovered when probing the MSC.

>> diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c
>> index 71a1fb1a9c75..5baf2a8786fb 100644
>> --- a/drivers/resctrl/mpam_devices.c
>> +++ b/drivers/resctrl/mpam_devices.c
>> @@ -20,7 +20,6 @@
> 
> [...]
> 
>> @@ -35,11 +34,483 @@
>>  static DEFINE_MUTEX(mpam_list_lock);
>>  static LIST_HEAD(mpam_all_msc);
>>  
>> -static struct srcu_struct mpam_srcu;
>> +struct srcu_struct mpam_srcu;

> Why expose this here?  This patch makes no use of the exposed symbol.

The mpam_resctrl code needs to take it when it walks these lists. I don't want to change
it then because its just additional churn.


>>  /* MPAM isn't available until all the MSC have been probed. */
>>  static u32 mpam_num_msc;
>>  
>> +/*
>> + * An MSC is a physical container for controls and monitors, each identified by
>> + * their RIS index. These share a base-address, interrupts and some MMIO
>> + * registers. A vMSC is a virtual container for RIS in an MSC that control or
>> + * monitor the same thing. Members of a vMSC are all RIS in the same MSC, but
>> + * not all RIS in an MSC share a vMSC.
>> + * Components are a group of vMSC that control or monitor the same thing but
>> + * are from different MSC, so have different base-address, interrupts etc.
>> + * Classes are the set components of the same type.
>> + *
>> + * The features of a vMSC is the union of the RIS it contains.
>> + * The features of a Class and Component are the common subset of the vMSC
>> + * they contain.
>> + *
>> + * e.g. The system cache may have bandwidth controls on multiple interfaces,
>> + * for regulating traffic from devices independently of traffic from CPUs.
>> + * If these are two RIS in one MSC, they will be treated as controlling
>> + * different things, and will not share a vMSC/component/class.
>> + *
>> + * e.g. The L2 may have one MSC and two RIS, one for cache-controls another
>> + * for bandwidth. These two RIS are members of the same vMSC.
>> + *
>> + * e.g. The set of RIS that make up the L2 are grouped as a component. These
>> + * are sometimes termed slices. They should be configured the same, as if there
>> + * were only one.
>> + *
>> + * e.g. The SoC probably has more than one L2, each attached to a distinct set
>> + * of CPUs. All the L2 components are grouped as a class.
>> + *
>> + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list,
>> + * then linked via struct mpam_ris to a vmsc, component and class.
>> + * The same MSC may exist under different class->component->vmsc paths, but the
>> + * RIS index will be unique.
>> + */
> 
> This description of the structures and how they relate to each other
> seems OK (bearing in mind that I am already familiar with this stuff --
> I can't speak for other people).

Great!


> [...]
> 
>> +#define add_to_garbage(x)				\
>> +do {							\
>> +	__typeof__(x) _x = x;				\
> 
> Nit:
> 
> = (x)
> 
> (for the paranoid)

Fixed,


>> +	(_x)->garbage.to_free = (_x);			\
> 
> _x->garbage.to_free = _x;
> 
> (_x is an identifier, not a macro argument.  It can't get re-parsed as
> something else -- assuming that there is not a #define for _x, but then
> all bets would be off anyway.)

>> +	llist_add(&(_x)->garbage.llist, &mpam_garbage);	\
> 
> &_x->...

Fixed,


> [...]
> 
>> +static void mpam_ris_destroy(struct mpam_msc_ris *ris)
>> +{
>> +	struct mpam_vmsc *vmsc = ris->vmsc;
>> +	struct mpam_msc *msc = vmsc->msc;
>> +	struct platform_device *pdev = msc->pdev;
>> +	struct mpam_component *comp = vmsc->comp;
>> +	struct mpam_class *class = comp->class;
>> +
>> +	lockdep_assert_held(&mpam_list_lock);
>> +
>> +	cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity);
>> +	cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity);
> 
> This is not the inverse of the cpumask_or()s in mpam_ris_create_locked(),
> unless the the ris associated with each class and each component have
> strictly disjoint affinity masks.  Is that checked anywhere, or should
> it be impossible by construction?

They should be disjoint. These bitmaps are built from firmware description of the cache
hierarchy. I don't think its possible to describe a situation where there are overlaps.

You can build a nonsense cache hierarchy, e.g. where CPU-0's L3 is CPU-6's L2, but if you
do the scheduler is going to complain when it tries to chose the scheduler domains. I
think this should be filed under "you've got bigger problems".  There is a check that
catches this in mpam_resctrl_pick_caches(), to see that all the CPUs are accounted for,
which is to avoid tasks that get lucky with task-placement managing to escape their
resource limit.


> But, thinking about it:
> 
> I wonder why we ever really need to do the teardown.  If we get an
> error interrupt then we can just go into a sulk, spam dmesg a bit, put
> the hardware into the most vanilla state that we can, and refuse to
> manipulate it further.  But this only happens in the case of a software
> or hardware *bug* (or, in a future world where we might implement
> virtualisation, an uncontainable MPAM error triggered by a guest -- for
> which tearing down the host MPAM would be an overreaction).

The good news is guests can't escape the PARTID virtualisation that the CPU does, so any
mess a guest manages to make is confined to that guest's PARTID range.


> Trying to cleanly tear the MPAM driver down after such an error seems a
> bit futile.
> 
> The MPAM resctrl glue could eventually be made into a module (though
> not needed from day 1) -- which would allow for unloading resctrlfs if
> that is eventually module-ised.  I think this wouldn't require the MPAM
> devices backend to be torn down at any point, though (?)

It would certainly be optional. kernfs->resctrl->mpam is the reason all this has to be
built-in. If that changes I'd aim for this to be a module.

All this free()ing was added so that the driver doesn't end up sitting on memory when it
isn't providing any usable feature. I have seen a platform where the error interrupt goes
off during boot, (I suspect firmware configures an out-of-range PARTID). On such a
platform any memory that isn't free()d is a waste.

But I agree its a small amount of memory.


> If we can simplify or eliminate the teardown, does it simplify the
> locking at all?  The garbage collection logic can also be dispensed
> with if there is never any garbage.

It wouldn't simplify the locking, only remove that deferred free()ing which is needed
because of SRCU.


> Since MSCs etc. never disappear from the hardware, it feels like it
> ought not to be necessary ever to remove items from any of these lists
> except when trying to do a teardown (?)

Unbinding the driver from an MSC is another case where this may be triggered via
mpam_msc_drv_remove(). If you look at the whole thing, mpam_ris_destroy() pokes
mpam_resctrl_teardown_class() to see if resctrl needs to be torn down.

I don't anticipate folk actually needing to do that. One Reasons is for VFIO - but this
kind of stuff has a performance impact on the hypervisor, so its unlikely to ever allow a
guest direct access to this kind of thing. Another reason is to load a more specific
driver, which sounds unlikely.


Ultimately this memory free-ing code is here because its the right thing to do.
I'd prefer to keep it as making this a loadable module would mean we have to do this.


> (Putting the hardware into a quiecent state is not the same thing as
> tearing down the data structures -- we do want to quiesce MPAM when
> shutting down the kernel, as least for the kexec scenario.)

> [...]
> 
>> +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx,
>> +				  enum mpam_class_types type, u8 class_id,
>> +				  int component_id, gfp_t gfp)
>> +{
>> +	int err;
>> +	struct mpam_vmsc *vmsc;
>> +	struct mpam_msc_ris *ris;
>> +	struct mpam_class *class;
>> +	struct mpam_component *comp;
>> +
>> +	lockdep_assert_held(&mpam_list_lock);
>> +
>> +	if (test_and_set_bit(ris_idx, msc->ris_idxs))
>> +		return -EBUSY;
> 
> Is it impossible by construction to get in here with an out-of-range
> ris_idx?
> 
> To avoid the callers (i.e., ACPI) needing to understand the internal
> limitations of this code, maybe it is worth having a check here (even
> if technically redundant).

It's possible - but only if you mess up the firmware tables.
I'll add a check for this as its easy enough.


Thanks,

James