linux-kernel - Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ee63634-3b55-4427-8283-8e3d38105f41@intel.com>
Date: Wed, 13 Nov 2024 21:40:35 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: Peter Newman <peternewman@...gle.com>
CC: <fenghua.yu@...el.com>, <babu.moger@....com>, <bp@...en8.de>,
	<dave.hansen@...ux.intel.com>, <eranian@...gle.com>, <hpa@...or.com>,
	<james.morse@....com>, <linux-kernel@...r.kernel.org>, <mingo@...hat.com>,
	<nert.pinx@...il.com>, <tan.shaopeng@...itsu.com>, <tglx@...utronix.de>,
	<tony.luck@...el.com>, <x86@...nel.org>
Subject: Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter
 reads

Hi Peter,

On 11/13/24 5:28 AM, Peter Newman wrote:
> On Thu, Nov 7, 2024 at 8:15 PM Reinette Chatre
> <reinette.chatre@...el.com> wrote:
>> On 11/7/24 6:26 AM, Peter Newman wrote:
>>> On Thu, Nov 7, 2024 at 12:01 PM Peter Newman <peternewman@...gle.com> wrote:
>>>>
> 
>>>> Tony had prototyped an MBM rate event[1], which cached a MBps value
>>>> per group/domain produced by the overflow workers. If the workers
>>>> produce the mbps samples immediately after directly reading the
>>>> counters, then this should be the best case in terms of precision and
>>>> introduce very little additional system overhead. Also, userspace
>>>> reading a rate sample that's 1 second old would be a lot less harmful
>>>> than performing an MBps calculation from a count sample that's 1
>>>> second old.
>>
>> I looked at that implementation. Please do note that the implementation
>> appears to be a PoC that does not handle the corner cases where issues may
>> arise. For example, when it reads the event values in the overflow handler
>> the rate is updated even if there was an error during the counter read.
>> The moment a counter read encounters an error it impacts the measured
>> rate so this will need more thought.
>>
>>>> Perhaps combining the per-second bandwidth rate cache with a
>>>> per-domain file for each MBM event to aggregate the data for all
>>>> groups will be the optimally-performing solution?
>>
>> I do not see a problem with exposing a per-domain file for each MBM event
>> that aggregates the data of all groups. For best accuracy I expect that
>> this file will be created on demand, querying hardware counters at the time
>> user space makes the request. This may indeed result in output like below
>> (naming used is just a sample for this discussion):
>>
>>          # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>>            <rdtgroup nameA> <MBM total count>
>>            <rdtgroup nameB> <MBM total count>
>>            <rdtgroup nameC> Error
>>            <rdtgroup nameD> Unavailable
>>            <rdtgroup nameE> Unassigned
>>            ...
>>
>> As I understand from your earlier description this essentially moves the
>> counter reading work from a user space thread to the kernel. There are
>> options to do this work, most disruptive can be done with a
>> smp_call_function_any() to read all the domain's counters from a CPU
>> in the domain without any preemption, less disruptive can be done
>> with smp_call_on_cpu(). Some cond_resched() could be included if
>> the number of events needing to be read starts impacting other parts of
>> the kernel. You already indicated that reading the counters from user space
>> takes 488usec, which is very far from what will trigger the softlockup
>> detector though.
>>
>>> Factoring ABMC into this, we'd have to decide the interval at which
>>> we're comfortable with cycling the available event counters through
>>> the group list in order to get valid MBps samples for every group
>>> often enough.
>>>
>>> A counter will have to stay assigned long enough to get two valid
>>> counts before an MBps value can be reported. If the regular MBps
>>> samples is what we want, maybe we just want a mode where the overflow
>>> workers would also drive the counter assignments too in order to
>>> produce regular samples from all groups.
>>
>> The assignable counter implementation currently depends on user space
>> assigning counters. In this design the events that do not have counters
>> assigned return "Unassigned". The "rate" value can also return
>> "Unassigned" in this existing design. "Unassigned" is one scenario that
>> needs to be handled, there is also hardware errors and MSR returning
>> "Unavailable".
>>
>> We can surely consider a new mode that does not allow user space to assign
>> counters but instead lets resctrl manage counter assignments to support
>> rate events when assignable counters are supported.
>>
>> I see a couple of features being discussed in parallel:
>> - A new per-domain file for each MBM event that aggregates the data of all groups.
>> - A new "rate" event built on top of user assigned counters.
>> - A new "rate" event built on top of resctrl assigned counters.
>>
>> Which best support the use case you have in mind?
> 
> After discussing with my users, it sounds like "all of the above".
> 
> They like the idea of resctrl automatically dealing with counter
> assignments for them, but they would also like to retain enough
> control to provide higher resolution data for groups that concern them
> more. The auto-assignment is nice in that they would get reliable
> bandwidth numbers on AMD with very little development effort and the
> assigning work will be done very efficiently, but eventually they will
> want more control.

It is not obvious to me what the expectations/assumptions are regarding
efficiency. One efficiency benefit I can think of is related to what you
suggested earlier where it is the overflow handler that can do the assignment.
By doing so resctrl can ensure counters are assigned at edges of time range being
measured by overflow handler.

"eventually they will want more control" sounds like there are more requirements
that we need to be made aware of and makes me hesitant to add complicated
automation for nothing.

> They also asked whether they would be able to switch between
> resctrl-assigned and user-assigned at runtime. I think the importance
> of this would depend on how efficient the mbm_assign_control interface
> ends up being. If it will necessarily result in floods of IPIs when
> cycling counters on a large system, they will be eager to not use it
> whenever they can avoid it.

It could be possible to use both (resctrl-assigned and user-assigned)
at the same time. For example, mbm_assign_control could use flag to indicate
which event needs to be counted and whether counter can be shared or not.
For example, we already have "t" for dedicated "MBM total", and there
could theoretically by "T" for shared "MBM total". A very basic
solution could be for "T" counters to be assigned for two runs
of the overflow handler and then round-robin to the next group of
"T" counters.

There are some sharp corners when implementing something like this.
Tony already pointed out one scenario where reading a "shared counter"
monitor group's mbm_local_bytes/mbm_total_bytes
becomes unreliable due to counter being reset at each re-assignment.

On the more complicated side resctrl could perhaps learn more about
how perf multiplexes the limited PMC.

I do think this will be an enhancement of existing counter assignment
that Babu is working on and I do not know how much your users know
about it or experimented with it. You already mentioned that the
users would want more control so it may be most flexible
to leave counter assignment to user space with the planned
mbm_assign_control interface?
 
> The rate assignment events are easier to deal with because they can
> simply be retrieved from memory without the caller needing to worry
> about what domain they're reading from, so I don't believe we will
> ever want to deal with cached count values paired with timestamps. On
> systems with assignable counters, I don't know how much of a problem
> the varying update rate will be. The aggregation files reading from
> memory are cheap to poll every second, so I don't know how big of an
> issue it is for most of the groups to report some sort of N/A-value
> for bandwidth most of the time. They won't have any difficulty
> remembering how long ago they last saw a valid mbps value because
> they're already histogramming all the metrics they watch.

This part seems most related to issue at hand. I understand that
an aggregate file with rate data is most ideal. You snipped my question
whether reading individual counters are required but thinking about this
more it may indeed also be useful to have a per monitor group rate file
containing the rate of that monitor group.

Would existing struct mbm_state.prev_bw be sufficient as a rate exposed to user
space? At this time it is only computed for consumption by the software controller
but we can explore calling mbm_bw_count() at every iteration of the overflow
handler irrespective of software controller.
Please note two things:
- Tony is already adding enhancement for bandwidth to be computed for total as well as
  local MBM bandwidth ([2])
- Computing running bandwidth irrespective of software controller will need mbm_bw_count()
  to be more robust since it (as Tony highligted in [1]) assumes that 
  the preceding counter read always succeeds. 

Reinette

[1] https://lore.kernel.org/all/ZzUvA2XE01U25A38@agluck-desk3/
[2] https://lore.kernel.org/all/20241029172832.93963-3-tony.luck@intel.com/