linux-kernel - Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6863c369-706a-452d-a413-4d55a1c5861e@intel.com>
Date: Wed, 23 Apr 2025 08:47:34 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: "Luck, Tony" <tony.luck@...el.com>, Fenghua Yu <fenghuay@...dia.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>, James Morse
	<james.morse@....com>, Babu Moger <babu.moger@....com>, Drew Fustini
	<dfustini@...libre.com>, Dave Martin <Dave.Martin@....com>, "Anil
 Keshavamurthy" <anil.s.keshavamurthy@...el.com>,
	<linux-kernel@...r.kernel.org>, <patches@...ts.linux.dev>
Subject: Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can
 be read from any CPU

Hi Peter,

On 4/23/25 6:27 AM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, Apr 22, 2025 at 8:20 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>>
>> Hi Tony,
>>
>> On 4/21/25 1:28 PM, Luck, Tony wrote:
>>> On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
>>>>> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>>>>>                     goto out;
>>>>>             }
>>>>>             d = container_of(hdr, struct rdt_mon_domain, hdr);
>>>>> -           mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
>>>>> +           mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
>>>>> +           mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
>>>>
>>>> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
>>>> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
>>>>
>>>>      cpumask_any_housekeeping()
>>>>      {
>>>>              ...
>>>>              if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
>>>>                      cpu = cpumask_any(mask);
>>>>              ...
>>>>      }
>>>>
>>>> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
>>>> online mask that may not be the current CPU.
>>>>
>>>> fwiw ... there are some optimizations planned in this area that I have not yet studied:
>>>> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
>>>
>>> I remember Peter complaining[1] about extra context switches when
>>> cpumask_any_housekeeping() was introduced, but it seems that the
>>> discussion died with no fix applied.
>>
>> The initial complaint was indeed that reading individual events is slower.
>>
>> The issue is that the intended use case read from many files at frequent
>> intervals and thus becomes vulnerable to any changes in this area that
>> really is already a slow path (reading from a file ... taking a mutex ...).
>>
>> Instead of working on shaving cycles off this path the discussion transitioned
>> to resctrl providing better support for the underlying use case. I
>> understood that this is being experimented with [2] and last I heard it
>> looks promising.
>>
>>>
>>> The blocking problem is that ARM may not be able to read a counter
>>> on a tick_nohz CPU because it may need to sleep.
> 
> If I hadn't already turned my attention to optimizing bulk counter
> reads, I might have mentioned that the change Tony referred to is
> broken on MPAM implementations because the MPAM
> resctrl_arch_rmid_read() cannot wait for its internal mutex with
> preemption disabled.
> 
>>>
>>> Do we need more options for events:
>>>
>>> 1) Must be read on a CPU in the right domain  // Legacy
>>> 2) Can be read from any CPU                   // My addtion
>>> 3) Must be read on a "housekeeping" CPU               // James' code in upstream
>>> 4) Cannot be read on a tick_nohz CPU          // Could be combined with 1 or 2?
>>
>> I do not see needing additional complexity here. I think it will be simpler
>> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
>> open code that supports the particular usage. As I understand it is prohibited
>> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
>> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
>> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
>> should suffice to determine what remains of given mask after accounting for all the
>> NO_HZ CPUs if tick_nohz_full_enabled().
> 
> Can you clarify what you mean by "all CPUs"? It's not difficult for

I mentioned this in the context of this patch that adds support for
events that can be ready from *any* CPU. The CPU reading the event data
need not be in the domain for which data is being read so all CPUs
on the system are available to the flow supporting these events. Since
all CPUs on the system cannot be in tick_nohz_full_mask there will always
be a CPU available to read this type of event that can be read from any CPU.

I made it way too complicated with this though. Tony proposed something
much better and simpler [1].

> all CPUs in an L3 domain to be in tick_nohz_full_mask on AMD
> implementations, where there are many small L3 domains (~8 CPUs each)
> in a socket.
> 
> Google makes use of isolation along this domain boundary on AMD
> platforms in some products and these users prefer to read counters
> using IPIs because they are concerned about introducing context
> switches to the isolated part of the system. In these configurations,
> there is typically only one RMID in that domain, so few of these IPIs
> are needed. (Note that these are different users from the ones I had
> described before who spawn large numbers of containers not limited to
> any domains and want to read the MBM counters for all the RMIDs on all
> the domains frequently.)
> 

Thank you for this insight. There is no change planned for reading
event counters for those events that need to be read from their
domain. Tony's recent proposal [1] moves the handling of these new
style of events to a separate branch.

Reinette

[1] https://lore.kernel.org/lkml/DS7PR11MB607763D8B912A60A3574D2BAFCBA2@DS7PR11MB6077.namprd11.prod.outlook.com/