linux-kernel - Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34fd8713-3430-4e27-a2c2-fd8839f90f5a@intel.com>
Date: Thu, 7 Nov 2024 11:14:54 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: <fenghua.yu@...el.com>, <babu.moger@....com>, <bp@...en8.de>,
	<dave.hansen@...ux.intel.com>, <eranian@...gle.com>, <hpa@...or.com>,
	<james.morse@....com>, <linux-kernel@...r.kernel.org>, <mingo@...hat.com>,
	<nert.pinx@...il.com>, <tan.shaopeng@...itsu.com>, <tglx@...utronix.de>,
	<tony.luck@...el.com>, <x86@...nel.org>
Subject: Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi Peter,

On 11/7/24 6:26 AM, Peter Newman wrote:
> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@...gle.com> wrote:
>>
>> Hi Reinette,
>>
>> On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre <reinette.chatre@...el.com> wrote:
> 
>>> This sounds as though user space is essentially duplicating what the
>>> MBM overflow handler currently does, which is to run a worker in each domain
>>> to collect MBM data every second from every RMID for both MBM events.
>>>
>>> * What are the requirements of this use case?
>>
>> Accurate, per-RMID MBps data, ideally at 1-second resolution if the
>> overhead can be tolerable.
> 
> Sorry, forgot about the assignable counters issue...
> 
> On AMD we'll have to cycle the available event counters through the
> groups in order to get valid bandwidth counts.

ack

> 
>>
>>> * Is there some benefit to user space in reading individual counters?

My question was poorly worded. What I would like to understand is if there
is a requirement for user space to keep being able to quickly query individual counters
like "query <event> of <monitor group>" or would it be acceptable for
user space to get a dump of event values of all monitor groups in a single query
and then post-process to dig out individual counters? 

>> The interface is available today and can retrieve the data with
>> somewhat acceptable performance. After applying this change (or
>> selecting a kernel version before the MPAM changes), call-graph
>> profiling showed that the remaining overhead of reading one counter at
>> a time from userspace on AMD Zen2 was around 20%, spacing each series
>> of 250 local reads by 1 second.
>>
>> With the 1.22M figure I quoted below for a single domain from
>> userspace, this comes out to 488 usec at 2.5 Ghz, which is manageable.
>> Even if userspace serializes all of its per-domain reads manually to
>> avoid contention on the global rdtgroup_mutex, the results should be
>> consistent as long as the domains are always serialized in the same
>> order.
>>
>> (apologies that my figures focus on AMD, but its high MBM domain
>> counts make it the most sensitive to read performance)
>>
>> Also, if all of the counter-reading work is performed directly by the
>> thread, the time spent collecting the information is easier to
>> attribute to the management software rather than appearing as an
>> increase in kernel overhead.
>>
>> Clearly not optimal, but an acceptable baseline.

Thank you for giving this insight.

>>
>>> * Would it perhaps be acceptable to provide user space with a cached snapshot
>>>   of all the MBM counters in a single query?
>>>
>>> User space can use a single read to obtain values of an event for all RMIDs
>>> on a domain without a single IPI if the architectural state from the overflow handler
>>> is exposed. It could also be possible to present data from all domains in that single
>>> read.
>>>
>>> One complication I can think of is that data from the different domains may have
>>> been collected up to a second apart. Is this something this use case can tolerate?
>>
>> This +/- 1-second drift would apply to the denominator of the mbps
>> calculation, so it could cause some very large errors. To produce
>> accurate mbps numbers from cached MBM count data, each sample would
>> need to be accompanied by a timestamp.

Reading more it does seem that making the raw cached state available to user space
would introduce too many problematic corner cases.

>>> For example,
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>> If the use case cannot tolerate data up to a second old then resctrl could add new files
>>> in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
>>> from each domain and read all events sequentially (which is essentially mbm_overflow()).
>>>
>>> For example,
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>>>           <rdtgroup nameA> <MBM total count>
>>>           <rdtgroup nameB> <MBM total count>
>>>           ...
>>>
>>> As I understand an interface like above would be an improvement over
>>> what can be achieved by user space by optimizing queries to existing interface.
>>>
>>
>> Yes, this is an option now that the ABMC work is establishing a naming
>> scheme for groups. It would also significantly cut down on the number
>> of open file descriptors needed.

ack.

>>
>> Tony had prototyped an MBM rate event[1], which cached a MBps value
>> per group/domain produced by the overflow workers. If the workers
>> produce the mbps samples immediately after directly reading the
>> counters, then this should be the best case in terms of precision and
>> introduce very little additional system overhead. Also, userspace
>> reading a rate sample that's 1 second old would be a lot less harmful
>> than performing an MBps calculation from a count sample that's 1
>> second old.

I looked at that implementation. Please do note that the implementation
appears to be a PoC that does not handle the corner cases where issues may
arise. For example, when it reads the event values in the overflow handler
the rate is updated even if there was an error during the counter read.
The moment a counter read encounters an error it impacts the measured
rate so this will need more thought.

>> Perhaps combining the per-second bandwidth rate cache with a
>> per-domain file for each MBM event to aggregate the data for all
>> groups will be the optimally-performing solution?

I do not see a problem with exposing a per-domain file for each MBM event
that aggregates the data of all groups. For best accuracy I expect that
this file will be created on demand, querying hardware counters at the time
user space makes the request. This may indeed result in output like below
(naming used is just a sample for this discussion):

         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
           <rdtgroup nameA> <MBM total count>
           <rdtgroup nameB> <MBM total count>
           <rdtgroup nameC> Error
           <rdtgroup nameD> Unavailable
           <rdtgroup nameE> Unassigned
	   ...

As I understand from your earlier description this essentially moves the
counter reading work from a user space thread to the kernel. There are
options to do this work, most disruptive can be done with a
smp_call_function_any() to read all the domain's counters from a CPU
in the domain without any preemption, less disruptive can be done
with smp_call_on_cpu(). Some cond_resched() could be included if
the number of events needing to be read starts impacting other parts of
the kernel. You already indicated that reading the counters from user space
takes 488usec, which is very far from what will trigger the softlockup
detector though.
 
> Factoring ABMC into this, we'd have to decide the interval at which
> we're comfortable with cycling the available event counters through
> the group list in order to get valid MBps samples for every group
> often enough.
> 
> A counter will have to stay assigned long enough to get two valid
> counts before an MBps value can be reported. If the regular MBps
> samples is what we want, maybe we just want a mode where the overflow
> workers would also drive the counter assignments too in order to
> produce regular samples from all groups.

The assignable counter implementation currently depends on user space
assigning counters. In this design the events that do not have counters
assigned return "Unassigned". The "rate" value can also return
"Unassigned" in this existing design. "Unassigned" is one scenario that
needs to be handled, there is also hardware errors and MSR returning
"Unavailable". 

We can surely consider a new mode that does not allow user space to assign
counters but instead lets resctrl manage counter assignments to support
rate events when assignable counters are supported.

I see a couple of features being discussed in parallel:
- A new per-domain file for each MBM event that aggregates the data of all groups.
- A new "rate" event built on top of user assigned counters.
- A new "rate" event built on top of resctrl assigned counters.

Which best support the use case you have in mind?

Reinette