linux-kernel - Re: [PATCH 05/33] ACPI / PPTT: Find cache level by cache-id

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db59580d-1013-45d4-bf62-b2ed955b22e4@arm.com>
Date: Wed, 10 Sep 2025 20:29:22 +0100
From: James Morse <james.morse@....com>
To: Dave Martin <Dave.Martin@....com>
Cc: linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-acpi@...r.kernel.org, devicetree@...r.kernel.org,
 shameerali.kolothum.thodi@...wei.com,
 D Scott Phillips OS <scott@...amperecomputing.com>,
 carl@...amperecomputing.com, lcherian@...vell.com,
 bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
 baolin.wang@...ux.alibaba.com, Jamie Iles <quic_jiles@...cinc.com>,
 Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
 dfustini@...libre.com, amitsinght@...vell.com,
 David Hildenbrand <david@...hat.com>, Rex Nie <rex.nie@...uarmicro.com>,
 Koba Ko <kobak@...dia.com>, Shanker Donthineni <sdonthineni@...dia.com>,
 fenghuay@...dia.com, baisheng.gao@...soc.com,
 Jonathan Cameron <jonathan.cameron@...wei.com>, Rob Herring
 <robh@...nel.org>, Rohit Mathew <rohit.mathew@....com>,
 Rafael Wysocki <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
 Lorenzo Pieralisi <lpieralisi@...nel.org>, Hanjun Guo
 <guohanjun@...wei.com>, Sudeep Holla <sudeep.holla@....com>,
 Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley
 <conor+dt@...nel.org>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Danilo Krummrich <dakr@...nel.org>
Subject: Re: [PATCH 05/33] ACPI / PPTT: Find cache level by cache-id

Hi Dave,

On 05/09/2025 17:27, Dave Martin wrote:
> On Thu, Aug 28, 2025 at 04:58:05PM +0100, James Morse wrote:
>> On 27/08/2025 11:50, Dave Martin wrote:
>>> On Fri, Aug 22, 2025 at 03:29:46PM +0000, James Morse wrote:
>>>> The MPAM table identifies caches by id. The MPAM driver also wants to know
>>>> the cache level to determine if the platform is of the shape that can be
>>>> managed via resctrl. Cacheinfo has this information, but only for CPUs that
>>>> are online.
>>>>
>>>> Waiting for all CPUs to come online is a problem for platforms where
>>>> CPUs are brought online late by user-space.
>>>>
>>>> Add a helper that walks every possible cache, until it finds the one
>>>> identified by cache-id, then return the level.
>>>> Add a cleanup based free-ing mechanism for acpi_get_table().
>>
>>> Does this mean that the early secondaries must be spread out across the
>>> whole topology so that everything can be probed?
>>>
>>> (i.e., a random subset is no good?)
>>
>> For the mpam driver - it needs to see each cache with mpam hardware, which means a CPU
>> associated with each cache needs to be online. Random is fine - provided you get lucky.
> 
> "Fine" = "not dependent on luck".  So, random is not fine.

As in it doesn't matter which CPUs - as long as you manage to represent each cache.

I really don't care if people configure their platform such that the mpam driver can't
complete its probing. Everything continues to work, you just don't get to use the unusable
features.

I've only really seen it done in VMs, which are most likely to have a single global MSC
because the thing has to be emulated.


>>> If so, is this documented somewhere, such as in booting.rst?
>>
>> booting.rst is for the bootloader.
>> Late secondaries is a bit of a niche sport, I've only seen it commonly done in VMs.
>> Most platforms so far have their MPAM controls on a global L3, so this requirement doesn't
>> make much of a difference.
>>
>> The concern is that if resctrl gets probed after user-space has started, whatever
>> user-space service is supposed to set it up will have concluded its not supported. Working
>> with cache-ids for offline CPUs means you don't have to bring all the CPUs online - only
>> enough so that every piece of hardware is reachable.
>>
>>
>>> Maybe this is not a new requirement -- it's not an area that I'm very
>>> familiar with.
>>
>> Hard to say - its a potentially surprising side effect of glomming OS accessible registers
>> onto the side of hardware that can be automatically powered off. (PSCI CPU_SUSPEND).
>>
>> I did try getting cacheinfo to populate all the CPUs at boot, regardless of whether they
>> were online. Apparently that doesn't work for PowerPC where the properties of CPUs can
>> change while they are offline. (presumably due to RAS or a firmware update)

> So, it sounds like there is a requirement, but we don't document it,
> and if the requirement is not met then the user is presented with an
> obscure failure in the MPAM driver.  This seems a bit unhelpful?

Not a failure. MPAM isn't yet available. Bring the rest of the system up and it will
spring into life. The same goes for any device you keep turned off.


> I'm not saying booting.rst is the right place for this -- maybe the
> appropriate document doesn't exist yet.
> 
> I wonder whether the required property is reasonable and general enough
> that it should be treated as a kernel boot requirement.

It's not a boot requirement. Linux boots just fine.


> Or, we require caches to be symmetric for non-early CPUs and reject
> those that don't match when they try to come online (similarly to
> the way cpufeatures deals with mismatches).

MPAM isn't an interesting enough feature to reject CPUs!
We can't check the the MSC properties until we can schedule, (thank the 'hide it in
firmware' folk for that one), meaning the CPU has to be online before we realise the
MPAM properties are mismatched.

The best approach here is to wait until everything has been seen before declaring that
we know what is going on. The architecture even calls this out as something that needs
doing, meaning the firmware tables have to describe all possible MSC.

I agree its not nice - but it is what MPAM is.
I think the people clever enough to manage late online-ing secondaries will work this out.


>>>> diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
>>>> index 8f9b9508acba..660457644a5b 100644
>>>> --- a/drivers/acpi/pptt.c
>>>> +++ b/drivers/acpi/pptt.c
>>>> @@ -907,3 +907,67 @@ int find_acpi_cpu_topology_hetero_id(unsigned int cpu)

>>
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		acpi_cpu_id = get_acpi_id_for_cpu(cpu);
>>>> +		cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
>>>> +		if (!cpu_node)
>>>> +			return -ENOENT;
>>>> +		num_levels = acpi_count_levels(table, cpu_node, NULL);
>>>
>>> Is the initial call to acpi_count_levels() really needed here?
>>>
>>> It feels a bit like we end up enumerating the whole topology two or
>>> three times here; once to count how many levels there are, and then
>>> again to examine the nodes, and once more inside acpi_find_cache_node().
>>>
>>> Why can't we just walk until we run out of levels?
>>
>> This is looking for a unified cache - and we don't know where those start.
>> We could walk the first 100 caches, and stop once we start getting unified caches, then
>> they stop again ... but this seemed simpler.
> 
> I'm still a bit confused.
> 
> We start at level one, and then trace parents until we hit a unified
> cache or run out of levels.
> 
> Why do we need to know a priori how many levels there are, when the
> way to determine that is part of the same procedure we're already doing
> (i.e., start at level one and trace parents until we run out of levels)?

| level = 1;
| acpi_find_cache_node(table, acpi_cpu_id, ACPI_PPTT_CACHE_TYPE_UNIFIED,
|			level, &cpu_node);

Fails. Do we stop the loop, or try level=2?

level = 1 fails because the L1 (probably) isn't a unified cache. Is L2? We don't know.


It's simpler to know how many levels there are, and walk the lot, than it is to try and
work out where the unified bit of the hierarchy starts - and start walking from there.

Yes the PPTT parsing could be different, but this is the kind of shape it has. Doing it
like this is in-keeping with the rest of it.


>>> I may be missing some details of how these functions interact -- if
>>> this is only run at probe time, compact, well-factored code is
>>> more important than making things as fast as possible.
> 
> (This still stands.)

This is all done at probe time, and never called again.
Nothing else in here caches values - it derives it all from the table every time so that
its stateless.


James