linux-kernel - Re: [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo in the resctrl subsystem.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <176f45a3-76d7-4e35-a668-6e845f168110@linux.alibaba.com>
Date: Wed, 28 May 2025 14:40:32 +0800
From: qinyuntan <qinyuntan@...ux.alibaba.com>
To: Reinette Chatre <reinette.chatre@...el.com>,
 Tony Luck <tony.luck@...el.com>
Cc: "H . Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
 x86@...nel.org
Subject: Re: [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo
 in the resctrl subsystem.

Hi Reinette Chatre,

On 5/28/25 2:37 PM, qinyuntan wrote:
> Hi Reinette Chatre,
> 
> On 5/28/25 12:49 PM, Reinette Chatre wrote:
>> Hi Qinyun Tan,
>>
>> On 5/27/25 8:32 PM, qinyuntan wrote:
>>> On 5/28/25 7:36 AM, Reinette Chatre wrote:
>>>> On 5/26/25 12:37 AM, Qinyun Tan wrote:
>>
>>
>>>>> first online CPU of a NUMA node. If this CPU later goes offline, the
>>>>> shared_cpu_map (managed by the cache subsystem) is cleared, leading to
>>>>
>>>> "is cleared" sounds like the shared_cpu_map is empty. Looking at
>>>> cache_shared_cpu_map_remove() it seems that the worst case is when the
>>>> CPU goes offline the shared_cpu_map only contains the offline CPU 
>>>> itself.
>>>> If there remains a reference to that shared_cpu_map then it will 
>>>> thus be
>>>> to a cpumask with an offline CPU. Are you seeing other scenarios?
>>>>
>>> Yes, you are right. Once this CPU goes offline, its shared_cpu_map
>>> will only include itself. If we then try to call this offline CPU
>>> using smp_call_on_cpu, it will result in a failure.
>>
>> Thank you for confirming. Could you please update this changelog to 
>> reflect
>> this detail? Doing so will remove confusion and make the change easier to
>> consume by making it obvious to reader what the problem is.
> Thank you for your patient review, I will update this changelog in the 
> next patch.
> 
>>>>> incorrect or undefined behavior when accessed via rdt_mon_domain.
>>>>
>>>> Could you please elaborate what "incorrect and undefined behavior" you
>>>> encountered?
>>>> It looks like reading the top level event would not be able to read
>>>> an event from one (or more?) of the online domains since the 
>>>> shared_cpu_map
>>>> will contain an offline CPU causing smp_call_on_cpu() to return with 
>>>> a failure
>>>> that is not caught ... the symptom may this be that there are no 
>>>> errors but
>>>> data is wrong?
>>>
>>> Yes, there won't be any errors here, but when reading the top-level 
>>> events, it may not retrieve any values.
>>>
>>> For example, in the SNC-3 mode, suppose cpus 0, 40, and 80 are the
>>> firtst online cpus on node0, node1, and node2 respectively. If cpus
>>> 0, 40, and 80 are all offline, At this point, reading "the top level
>>> event" will result in a zero.
>>
>> This is SNC-3 mode with a single socket example where CPU 0 cannot
>> go offline. I thus do not see this happening for the reason you 
>> provide below.
>>
>> I *can* see how this happens on a second socket when the first online CPU
>> of the first node of that (the second) socket goes offline.
> 
>>> Why hasn’t this issue been discovered earlier? The reason is that 
>>> CPU0 is always the first online CPU on node0. The cacheinfo stored in 
>>> the first rdt_mon_domain corresponds to CPU0. When 
>>> rdtgroup_mondata_show reads the top-level event, it iterates through 
>>> all rdt_mon_domain entries, using if (d->ci->id == domid) to find the 
>>> first rdt_mon_domain that shares the resource. It then selects a CPU 
>>> from the corresponding cacheinfo to perform the monitoring group data 
>>> read operation. In a single-socket environment, all CPUs typically 
>>> share the L3 Cache, which means this traversal action will usually 
>>> lead directly to CPU0's cacheinfo. Additionally, since the mainline 
>>> kernel no longer supports taking CPU0 offline, that cacheinfo remains 
>>> valid indefinitely.
>>>
>>>>
>>>>>
>>>>> 2. Lifecycle dependency: The cacheinfo structure's lifecycle is 
>>>>> managed
>>>>> by the cache subsystem, making it unsafe for resctrl to hold
>>>>> long-term references.
>>>>
>>>> This is not obvious to me. Could you please elaborate how resctrl could
>>>> have a reference to a removed structure?
>>> As mentioned above, although the cacheinfo of each CPU is not freed
>>> in the latest mainline, the shared_cpu_map within the cacheinfo will
>>> be modified as CPUs go online or offline. Each rdt_mon_domain
>>> directly references the cacheinfo of the first online CPU in the
>>> node, and the shared_cpu_map is used in various processes. This
>>> situation is bound to lead to some problems.
>>
>> Claiming that it is unsafe for resctrl to hold a reference implies that
>> resctrl uses an invalid pointer. This is not the case here. The pointer
>> is valid, but the data pointed to by it does not support resctrl's 
>> usage. I
>> thus do not believe that this "lifecycle dependency" point is a valid
>> motivation for this change.My description is indeed inaccurate. I will 
>> adjust it in the next patch. 
> Thanks.

Sorry, I accidentally pressed backspace twice. It should be:

'''
My description is indeed inaccurate. I will adjust it in the next patch.
Thanks.
'''

> 
>>>>
>>>>>
>>>>> To resolve these issues and align with design principles:
>>>>>
>>>>> 1. Replace direct cacheinfo references in struct rdt_mon_domain and 
>>>>> struct
>>>>> rmid_read with the cacheinfo ID (a unique identifier for the L3 
>>>>> cache).
>>>>>
>>>>> 2. Use hdr.cpu_mask (already maintained by resctrl) to replace
>>>>> shared_cpu_map logic for determining valid CPUs for RMID counter reads
>>>>> via the MSR interface.
>>>>
>>>> I think it will help to explain why it is ok now to use hdr.cpu_mask 
>>>> instead
>>>> of the shared_cpu_map. In support of this you could mention that the 
>>>> original
>>>> solution aimed to read the counters on any CPU associated with the 
>>>> L3 cache
>>>> that the sub-numa domain forms part of, but this solution uses the
>>>> cpumask of one of the sub-numa domains that is known to be 
>>>> associated with
>>>> the L3 cache. This is a reduced set of CPUs from previously intended 
>>>> but
>>>> known to be accurate. Alternative may be to build a local cpumask 
>>>> from the
>>>> all the sub-numa domains but it is not clear to me how much this will
>>>> improve things.
>>>>
>>>> Considering this I do not think the references are "unnecessary" as the
>>>> subject claims since the solution does not replace the original 
>>>> cpumask with
>>>> an identical one.
>>>
>>> Thanks a lot, you are correct. hdr.cpu_mask is a subset of
>>> shared_cpu_map, and we can almost use hdr.cpu_mask to replace the
>>> functionality of shared_cpu_map.
>>>
>>> In fact, in resctrl, the only purpose of using cpu_mask is to select
>>> a CPU that shares the cache resource for performing monitoring group
>>> data reads. Therefore, I think there is no need to build a local
>>> cpumask from all the sub-NUMA domains in this context.
>>
>> One issue I can think of here is when there is a usage where the user 
>> does
>> task isolation on the numa node boundary. Let's consider the SNC-3 
>> example
>> again with node3, node4, and node5 on the second socket, "L3 cache ID 1".
>> If all CPUs on node3 are in tick_nohz_full_mask while none of the 
>> node4 and
>> node5 CPUs are in tick_nohz_full_mask then one of node3's CPUs will get
>> an unnecessary IPI.
>>
> You are right, how about this? First, obtain any cpu in hdr.cpu_mask, 
> and then use the cacheinfo shared_cpu_map of this cpu:
> 
> diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c
> index 9337787461d2d..d43f438465ad0 100644
> --- a/fs/resctrl/ctrlmondata.c
> +++ b/fs/resctrl/ctrlmondata.c
> @@ -596,7 +596,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void 
> *arg)
>          struct rdtgroup *rdtgrp;
>          struct rdt_resource *r;
>          struct mon_data *md;
> -       int domid, ret = 0;
> +       struct cacheinfo *ci;
> +       int domid, cpu, ret = 0;
> 
>          rdtgrp = rdtgroup_kn_lock_live(of->kn);
>          if (!rdtgrp) {
> @@ -625,8 +626,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void 
> *arg)
>                  list_for_each_entry(d, &r->mon_domains, hdr.list) {
>                          if (d->ci_id == domid) {
>                                  rr.ci_id = d->ci_id;
> +                               cpu = cpumask_any(&d->hdr.cpu_mask)
> +                               ci = get_cpu_cacheinfo_level(cpu, 
> RESCTRL_L3_CACHE);
> +                               if (!ci)
> +                                       continue;
>                                  mon_event_read(&rr, r, NULL, rdtgrp,
> -                                              &d->hdr.cpu_mask, evtid, 
> false);
> +                                              &ci->shared_cpu_map, 
> evtid, false);
>                                  goto checkresult;
>                          }
>                  }
> 
>>>
>>> I will provide a more detailed description in my commit log moving
>>> forward.
>>
>> Thank you very much.
>>
>> Reinette
> 
> Thanks
> Qinyun Tan
>