[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCg97cLVVAcacnarp+880xjsedEWGJPXhYpy4P7=ky4MZw@mail.gmail.com>
Date: Tue, 25 Feb 2025 18:11:44 +0100
From: Peter Newman <peternewman@...gle.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: "Moger, Babu" <bmoger@....com>, Dave Martin <Dave.Martin@....com>, Babu Moger <babu.moger@....com>,
corbet@....net, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, tony.luck@...el.com, x86@...nel.org,
hpa@...or.com, paulmck@...nel.org, akpm@...ux-foundation.org,
thuth@...hat.com, rostedt@...dmis.org, xiongwei.song@...driver.com,
pawan.kumar.gupta@...ux.intel.com, daniel.sneddon@...ux.intel.com,
jpoimboe@...nel.org, perry.yuan@....com, sandipan.das@....com,
kai.huang@...el.com, xiaoyao.li@...el.com, seanjc@...gle.com,
xin3.li@...el.com, andrew.cooper3@...rix.com, ebiggers@...gle.com,
mario.limonciello@....com, james.morse@....com, tan.shaopeng@...itsu.com,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
maciej.wieczor-retman@...el.com, eranian@...gle.com
Subject: Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Reinette,
On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
<reinette.chatre@...el.com> wrote:
>
> Hi Peter,
>
> On 2/21/25 5:12 AM, Peter Newman wrote:
> > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> > <reinette.chatre@...el.com> wrote:
> >> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>> <reinette.chatre@...el.com> wrote:
> >>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>> <reinette.chatre@...el.com> wrote:
> >>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>> <reinette.chatre@...el.com> wrote:
> >>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>
> >>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>
> >>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>
> >>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>> <value>
> >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>
> >>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>
> >>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>> is low enough to be of concern.
> >>>>>>>
> >>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>> for.
> >>>>>>
> >>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>
> >>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>> customers.
> >>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>
> >>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>> event-set for applying to a single counter rather than as individual
> >>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>> event names.
> >>>>
> >>>> Thank you for clarifying.
> >>>>
> >>>>>
> >>>>> In the letters as events model, choosing the events assigned to a
> >>>>> group wouldn't be enough information, since we would want to control
> >>>>> which events should share a counter and which should be counted by
> >>>>> separate counters. I think the amount of information that would need
> >>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>
> >>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>> writes in ABMC would look like...
> >>>>>
> >>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>
> >>>>> (per domain)
> >>>>> group 0:
> >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>> group 1:
> >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>> ...
> >>>>>
> >>>>
> >>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>> example and above the counter configuration appears to be global. You do mention
> >>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>> configuration is a requirement?
> >>>
> >>> If it's global and we want a particular group to be watched by more
> >>> counters, I wouldn't want this to result in allocating more counters
> >>> for that group in all domains, or allocating counters in domains where
> >>> they're not needed. I want to encourage my users to avoid allocating
> >>> monitoring resources in domains where a job is not allowed to run so
> >>> there's less pressure on the counters.
> >>>
> >>> In Dave's proposal it looks like global configuration means
> >>> globally-defined "named counter configurations", which works because
> >>> it's really per-domain assignment of the configurations to however
> >>> many counters the group needs in each domain.
> >>
> >> I think I am becoming lost. Would a global configuration not break your
> >> view of "event-set applied to a single counter"? If a counter is configured
> >> globally then it would not make it possible to support the full configurability
> >> of the hardware.
> >> Before I add more confusion, let me try with an example that builds on your
> >> earlier example copied below:
> >>
> >>>>> (per domain)
> >>>>> group 0:
> >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>> group 1:
> >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>> ...
> >>
> >> Since the above states "per domain" I rewrite the example to highlight that as
> >> I understand it:
> >>
> >> group 0:
> >> domain 0:
> >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 1: VictimBW,LclNTWr,RmtNTWr
> >> domain 1:
> >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >> domain 0:
> >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW,LclNTWr,RmtNTWr
> >> domain 1:
> >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>
> >> You mention that you do not want counters to be allocated in domains that they
> >> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >> in domain 1, resulting in:
> >>
> >> group 0:
> >> domain 0:
> >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >> domain 0:
> >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW,LclNTWr,RmtNTWr
> >> domain 1:
> >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>
> >> With counter 0 and counter 1 available in domain 1, these counters could
> >> theoretically be configured to give group 1 more data in domain 1:
> >>
> >> group 0:
> >> domain 0:
> >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 1: VictimBW,LclNTWr,RmtNTWr
> >> group 1:
> >> domain 0:
> >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW,LclNTWr,RmtNTWr
> >> domain 1:
> >> counter 0: LclFill,RmtFill
> >> counter 1: LclNTWr,RmtNTWr
> >> counter 2: LclSlowFill,RmtSlowFill
> >> counter 3: VictimBW
> >>
> >> The counters are shown with different per-domain configurations that seems to
> >> match with earlier goals of (a) choose events counted by each counter and
> >> (b) do not allocate counters in domains where they are not needed. As I
> >> understand the above does contradict global counter configuration though.
> >> Or do you mean that only the *name* of the counter is global and then
> >> that it is reconfigured as part of every assignment?
> >
> > Yes, I meant only the *name* is global. I assume based on a particular
> > system configuration, the user will settle on a handful of useful
> > groupings to count.
> >
> > Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >
> > # define global configurations (in ABMC terms), not necessarily in this
> > # syntax and probably not in the mbm_assign_control file.
> >
> > r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > w=VictimBW,LclNTWr,RmtNTWr
> >
> > # legacy "total" configuration, effectively r+w
> > t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >
> > /group0/0=t;1=t
> > /group1/0=t;1=t
> > /group2/0=_;1=t
> > /group3/0=rw;1=_
> >
> > - group2 is restricted to domain 0
> > - group3 is restricted to domain 1
> > - the rest are unrestricted
> > - In group3, we decided we need to separate read and write traffic
> >
> > This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >
>
> I see. Thank you for the example.
>
> resctrl supports per-domain configurations with the following possible when
> using mbm_total_bytes_config and mbm_local_bytes_config:
>
> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>
> /group0/0=t;1=t
> /group1/0=t;1=t
>
> Even though the flags are identical in all domains, the assigned counters will
> be configured differently in each domain.
>
> With this supported by hardware and currently also supported by resctrl it seems
> reasonable to carry this forward to what will be supported next.
The hardware supports both a per-domain mode, where all groups in a
domain use the same configurations and are limited to two events per
group and a per-group mode where every group can be configured and
assigned freely. This series is using the legacy counter access mode
where only counters whose BwType matches an instance of QOS_EVT_CFG_n
in the domain can be read. If we chose to read the assigned counter
directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
rather than asking the hardware to find the counter by RMID, we would
not be limited to 2 counters per group/domain and the hardware would
have the same flexibility as on MPAM.
(I might have said something confusing in my last messages because I
had forgotten that I switched to the extended assignment mode when
prototyping with soft-ABMC and MPAM.)
Forcing all groups on a domain to share the same 2 counter
configurations would not be acceptable for us, as the example I gave
earlier is one I've already been asked about.
I'm worried about requiring support for domain-level
mbm_total_bytes_config and mbm_local_bytes_config files to be carried
forward, because this conflicts with the configuration being per
group/domain. (i.e., what would be read back from the domain files if
per-group customizations have already been applied?)
>
> >>
> >>>> Until now I viewed counter configuration separate from counter assignment,
> >>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and
> >>>> mbm_local_bytes_config before they are assigned. That is still per-domain
> >>>> counter configuration though, not per-counter.
> >>>>
> >>>>> I assume packing all of this info for a group's desired counter
> >>>>> configuration into a single line (with 32 domains per line on many
> >>>>> dual-socket AMD configurations I see) would be difficult to look at,
> >>>>> even if we could settle on a single letter to represent each
> >>>>> universally.
> >>>>>
> >>>>>>
> >>>>>> My goal is for resctrl to have a user interface that can as much as possible
> >>>>>> be ready for whatever may be required from it years down the line. Of course,
> >>>>>> I may be wrong and resctrl would never need to support more than 26 events per
> >>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events
> >>>>>> and how could resctrl support that?
> >>>>>>
> >>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier
> >>>>>> the interface I used as demonstration may become unwieldy to parse on a system
> >>>>>> with many domains that supports many events. This is a concern for me. Any suggestions
> >>>>>> will be appreciated, especially from you since I know that you are very familiar with
> >>>>>> issues related to large scale use of resctrl interfaces.
> >>>>>
> >>>>> It's mainly just the unwieldiness of all the information in one file.
> >>>>> It's already at the limit of what I can visually look through.
> >>>>
> >>>> I agree.
> >>>>
> >>>>>
> >>>>> I believe that shared assignments will take care of all the
> >>>>> high-frequency and performance-intensive batch configuration updates I
> >>>>> was originally concerned about, so I no longer see much benefit in
> >>>>> finding ways to textually encode all this information in a single file
> >>>>> when it would be more manageable to distribute it around the
> >>>>> filesystem hierarchy.
> >>>>
> >>>> This is significant. The motivation for the single file was to support
> >>>> the "high-frequency and performance-intensive" usage. Would "shared assignments"
> >>>> not also depend on the same files that, if distributed, will require many
> >>>> filesystem operations?
> >>>> Having the files distributed will be significantly simpler while also
> >>>> avoiding the file size issue that Dave Martin exposed.
> >>>
> >>> The remaining filesystem operations will be assigning or removing
> >>> shared counter assignments in the applicable domains, which would
> >>> normally correspond to mkdir/rmdir of groups or changing their CPU
> >>> affinity. The shared assignments are more "program and forget", while
> >>> the exclusive assignment approach requires updates for every counter
> >>> (in every domain) every few seconds to cover a large number of groups.
> >>>
> >>> When they want to pay extra attention to a particular group, I expect
> >>> they'll ask for exclusive counters and leave them assigned for a while
> >>> as they collect extra data.
> >>
> >> The single file approach is already unwieldy. The demands that will be
> >> placed on it to support the usages currently being discussed would make this
> >> interface even harder to use and manage. If the single file is not required
> >> then I think we should go back to smaller files distributed in resctrl.
> >> This may not even be an either/or argument. One way to view mbm_assign_control
> >> could be as a way for user to interact with the distributed counter
> >> related files with a single file system operation. Although, without
> >> knowing how counter configuration is expected to work this remains unclear.
> >
> > If we do both interfaces and the multi-file model gives us more
> > capability to express configurations, we could find situations where
> > there are configurations we cannot represent when reading back from
> > mbm_assign_control, or updates through mbm_assign_control have
> > ambiguous effects on existing configurations which were created with
> > other files.
>
> Right. My assumption was that the syntax would be identical.
>
> >
> > However, the example I gave above seems to be adequately represented
> > by a minor extension to mbm_assign_control and we all seem to
>
> To confirm what you mean with "minor extension to mbm_assign_control",
> is this where the flags are associated with counter configurations? At this
> time this is done separately from mbm_assign_control with the hardcoded "t"
> and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes
> respectively. I think it would be simpler to keep these configurations
> separate from mbm_assign_control. How it would look without better
> understanding of MPAM is not clear to me at this time, unless if the
> requirement is to enhance support for ABMC and BMEC. I do see that
> this can be added later to build on what is supported by mbm_assign_control
> with the syntax in this version.
As I explained above, I was looking at this from the perspective of
the extended event assignment mode.
Thanks,
-Peter
Powered by blists - more mailing lists