linux-kernel - Re: [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo in the resctrl subsystem.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b446254-a2c8-4f01-93bf-3a348d474820@intel.com>
Date: Tue, 27 May 2025 21:49:10 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: qinyuntan <qinyuntan@...ux.alibaba.com>, Tony Luck <tony.luck@...el.com>
CC: "H . Peter Anvin" <hpa@...or.com>, <linux-kernel@...r.kernel.org>,
	<x86@...nel.org>
Subject: Re: [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo
 in the resctrl subsystem.

Hi Qinyun Tan,

On 5/27/25 8:32 PM, qinyuntan wrote:
> On 5/28/25 7:36 AM, Reinette Chatre wrote:
>> On 5/26/25 12:37 AM, Qinyun Tan wrote:


>>> first online CPU of a NUMA node. If this CPU later goes offline, the
>>> shared_cpu_map (managed by the cache subsystem) is cleared, leading to
>>
>> "is cleared" sounds like the shared_cpu_map is empty. Looking at
>> cache_shared_cpu_map_remove() it seems that the worst case is when the
>> CPU goes offline the shared_cpu_map only contains the offline CPU itself.
>> If there remains a reference to that shared_cpu_map then it will thus be
>> to a cpumask with an offline CPU. Are you seeing other scenarios?
>>
> Yes, you are right. Once this CPU goes offline, its shared_cpu_map
> will only include itself. If we then try to call this offline CPU
> using smp_call_on_cpu, it will result in a failure.

Thank you for confirming. Could you please update this changelog to reflect
this detail? Doing so will remove confusion and make the change easier to
consume by making it obvious to reader what the problem is.
 
>>> incorrect or undefined behavior when accessed via rdt_mon_domain.
>>
>> Could you please elaborate what "incorrect and undefined behavior" you
>> encountered?
>> It looks like reading the top level event would not be able to read
>> an event from one (or more?) of the online domains since the shared_cpu_map
>> will contain an offline CPU causing smp_call_on_cpu() to return with a failure
>> that is not caught ... the symptom may this be that there are no errors but
>> data is wrong?
> 
> Yes, there won't be any errors here, but when reading the top-level events, it may not retrieve any values.
> 
> For example, in the SNC-3 mode, suppose cpus 0, 40, and 80 are the
> firtst online cpus on node0, node1, and node2 respectively. If cpus
> 0, 40, and 80 are all offline, At this point, reading "the top level
> event" will result in a zero.

This is SNC-3 mode with a single socket example where CPU 0 cannot
go offline. I thus do not see this happening for the reason you provide below.

I *can* see how this happens on a second socket when the first online CPU
of the first node of that (the second) socket goes offline.
 
> Why hasn’t this issue been discovered earlier? The reason is that CPU0 is always the first online CPU on node0. The cacheinfo stored in the first rdt_mon_domain corresponds to CPU0. When rdtgroup_mondata_show reads the top-level event, it iterates through all rdt_mon_domain entries, using if (d->ci->id == domid) to find the first rdt_mon_domain that shares the resource. It then selects a CPU from the corresponding cacheinfo to perform the monitoring group data read operation. In a single-socket environment, all CPUs typically share the L3 Cache, which means this traversal action will usually lead directly to CPU0's cacheinfo. Additionally, since the mainline kernel no longer supports taking CPU0 offline, that cacheinfo remains valid indefinitely.
> 
>>
>>>
>>> 2. Lifecycle dependency: The cacheinfo structure's lifecycle is managed
>>> by the cache subsystem, making it unsafe for resctrl to hold
>>> long-term references.
>>
>> This is not obvious to me. Could you please elaborate how resctrl could
>> have a reference to a removed structure?
> As mentioned above, although the cacheinfo of each CPU is not freed
> in the latest mainline, the shared_cpu_map within the cacheinfo will
> be modified as CPUs go online or offline. Each rdt_mon_domain
> directly references the cacheinfo of the first online CPU in the
> node, and the shared_cpu_map is used in various processes. This
> situation is bound to lead to some problems.

Claiming that it is unsafe for resctrl to hold a reference implies that
resctrl uses an invalid pointer. This is not the case here. The pointer
is valid, but the data pointed to by it does not support resctrl's usage. I
thus do not believe that this "lifecycle dependency" point is a valid
motivation for this change.
 
>>
>>>
>>> To resolve these issues and align with design principles:
>>>
>>> 1. Replace direct cacheinfo references in struct rdt_mon_domain and struct
>>> rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).
>>>
>>> 2. Use hdr.cpu_mask (already maintained by resctrl) to replace
>>> shared_cpu_map logic for determining valid CPUs for RMID counter reads
>>> via the MSR interface.
>>
>> I think it will help to explain why it is ok now to use hdr.cpu_mask instead
>> of the shared_cpu_map. In support of this you could mention that the original
>> solution aimed to read the counters on any CPU associated with the L3 cache
>> that the sub-numa domain forms part of, but this solution uses the
>> cpumask of one of the sub-numa domains that is known to be associated with
>> the L3 cache. This is a reduced set of CPUs from previously intended but
>> known to be accurate. Alternative may be to build a local cpumask from the
>> all the sub-numa domains but it is not clear to me how much this will
>> improve things.
>>
>> Considering this I do not think the references are "unnecessary" as the
>> subject claims since the solution does not replace the original cpumask with
>> an identical one.
> 
> Thanks a lot, you are correct. hdr.cpu_mask is a subset of
> shared_cpu_map, and we can almost use hdr.cpu_mask to replace the
> functionality of shared_cpu_map.
> 
> In fact, in resctrl, the only purpose of using cpu_mask is to select
> a CPU that shares the cache resource for performing monitoring group
> data reads. Therefore, I think there is no need to build a local
> cpumask from all the sub-NUMA domains in this context.

One issue I can think of here is when there is a usage where the user does 
task isolation on the numa node boundary. Let's consider the SNC-3 example
again with node3, node4, and node5 on the second socket, "L3 cache ID 1".
If all CPUs on node3 are in tick_nohz_full_mask while none of the node4 and
node5 CPUs are in tick_nohz_full_mask then one of node3's CPUs will get
an unnecessary IPI.

> 
> I will provide a more detailed description in my commit log moving
> forward.

Thank you very much.

Reinette