[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04e47d0e-6447-451e-98e4-7ea65187d370@amd.com>
Date: Mon, 10 Mar 2025 20:44:36 -0500
From: "Moger, Babu" <bmoger@....com>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: babu.moger@....com, Peter Newman <peternewman@...gle.com>,
"Chatre, Reinette" <reinette.chatre@...el.com>,
Dave Martin <Dave.Martin@....com>, corbet@....net, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, paulmck@...nel.org, akpm@...ux-foundation.org,
thuth@...hat.com, rostedt@...dmis.org, xiongwei.song@...driver.com,
pawan.kumar.gupta@...ux.intel.com, daniel.sneddon@...ux.intel.com,
jpoimboe@...nel.org, perry.yuan@....com, sandipan.das@....com,
kai.huang@...el.com, xiaoyao.li@...el.com, seanjc@...gle.com,
xin3.li@...el.com, andrew.cooper3@...rix.com, ebiggers@...gle.com,
mario.limonciello@....com, james.morse@....com, tan.shaopeng@...itsu.com,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
maciej.wieczor-retman@...el.com, eranian@...gle.com
Subject: Re: [PATCH v11 00/23] x86/resctrl : Support AMD Assignable Bandwidth
Monitoring Counters (ABMC)
Hi Tony,
On 3/10/2025 6:22 PM, Luck, Tony wrote:
> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>> Hi Peter,
>>>
>>> On 3/5/25 04:40, Peter Newman wrote:
>>>> Hi Babu,
>>>>
>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>>>
>>>>>>> Hi Peter/Reinette,
>>>>>>>
>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@....com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>> Hi Reinette,
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@...el.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>
>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>
>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>
>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>
>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>
>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>
>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>
>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>
>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>
>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>
>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>
>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>
>>>>>>>>> It is documented below.
>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>
>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>
>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>> just two.
>>>>>>>>>
>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>
>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>> using it as is.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>
>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>
>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>
>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>> supported.
>>>>>>>>
>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>
>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>
>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>
>>>>>>> Here’s what we’ve learned so far:
>>>>>>>
>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>> evolves.
>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>
>>>>>>>
>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>
>>>>>>> # define event configurations
>>>>>>>
>>>>>>> ========================================================
>>>>>>> Bits Mnemonics Description
>>>>>>> ==== ========================================================
>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>> ==== ========================================================
>>>>>>>
>>>>>>> #Define flags based on combination of above event types.
>>>>>>>
>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>> v = VictimBW
>>>>>>>
>>>>>>> Peter suggested the following format earlier :
>>>>>>>
>>>>>>> /group0/0=t;1=t
>>>>>>> /group1/0=t;1=t
>>>>>>> /group2/0=_;1=t
>>>>>>> /group3/0=rw;1=_
>>>>>>
>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>> and cleaner to implement.
>>>>>>
>>>>>> Roughly what I had in mind:
>>>>>>
>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>> names for the assignable configurations rather than being restricted
>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>> we can specify the set of events the config should represent. I think
>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>> values. Moving forward we could come up with portable names for common
>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>> want specific events and don't care about portability.
>>>>>
>>>>>
>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>
>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>
>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>
>>>>> How many configurations should we allow? Do we know?
>>>>
>>>> Do we need an upper limit?
>>>
>>> I think so. This needs to be maintained in some data structure. We can
>>> start with 2 default configurations for now.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>> configure all domains the same for a group.
>>>>>
>>>>> What is the difference between shared and exclusive?
>>>>
>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>> each domain will be scheduled round-robin to the groups requesting
>>>> shared access to a counter. In my tests, I assigned the counters long
>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>> aggregation files[2].
>>>>
>>>> These do not need to be implemented immediately, but knowing that they
>>>> work addresses the overhead and scalability concerns of reassigning
>>>> counters and reading their values.
>>>
>>> Ok. Lets focus on exclusive assignments for now.
>>>
>>>>
>>>>>
>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>
>>>>> There should be a more efficient way to handle this.
>>>>>
>>>>> Initially, we started with a group-level file for this interface, but it
>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>
>>>> I had rejected it due to the high-frequency of access of a large
>>>> number of files, which has since been addressed by shared assignment
>>>> (or automatic reassignment) and aggregated mbps files.
>>>
>>> I think we should address this as well. Creating three extra files for
>>> each group isn’t ideal when there are more efficient alternatives.
>>>
>>>>
>>>>>
>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>
>>>>> That was another problem we need to address.
>>>>
>>>> This is not a requirement I was aware of. If the user forgot where
>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>> can read multiple sysfs nodes to remind themselves.
>>>
>>> I suggest, we should provide users with an option to list the assignments
>>> of all groups in a single command. As the number of groups increases, it
>>> becomes cumbersome to query each group individually.
>>>
>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>> for this purpose. More details on this below.
>>>
>>>>>
>>>>>
>>>>>>
>>>>>> The configuration names listed in assign_* would result in files of
>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>> which the count values can be read.
>>>>>>
>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> LclFill
>>>>>> LclNTWr
>>>>>> LclSlowFill
>>>>>
>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>
>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>> only looking at struct kernfs_syscall_ops
>>>>
>>>>>
>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>> LclFill <-rename these to generic names.
>>>>> LclNTWr
>>>>> LclSlowFill
>>>>>
>>>>
>>>> I think portable and non-portable event names should both be available
>>>> as options. There are simple bandwidth measurement mechanisms that
>>>> will be applied in general, but when they turn up an issue, it can
>>>> often lead to a more focused investigation, requiring more precise
>>>> events.
>>>
>>> I aggree. We should provide both portable and non-portable event names.
>>>
>>> Here is my draft proposal based on the discussion so far and reusing some
>>> of the current interface. Idea here is to start with basic assigment
>>> feature with options to enhance it in the future. Feel free to
>>> comment/suggest.
>>>
>>> 1. Event configurations will be in
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>
>>> There will be two pre-defined configurations by default.
>>>
>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>
>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill, LclNTWr, LclSlowFill
>>>
>>> 2. Users will have options to update these configurations.
>>>
>>> #echo "LclFill, LclNTWr, RmtFill" >
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
> This part seems odd to me. Now the "mbm_local_bytes" files aren't
> reporting "local_bytes" any more. They report something different,
> and users only know if they come to check the options currently
> configured in this file. Changing the contents without changing
> the name seems confusing to me.
It is the same behaviour right now with BMEC. It is configurable.
By default it is mbm_local_bytes, but users can configure whatever they
want to monitor using /info/L3_MON/mbm_local_bytes_config.
We can continue the same behaviour with ABMC, but the configuration will
be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>
>>>
>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill, LclNTWr, RmtFill
>>>
>>> 3. The default configurations will be used when user mounts the resctrl.
>>>
>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>> mkdir /sys/fs/resctrl/test/
>>>
>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>> e: Exclusive
>>> s: Shared
>>> u: Unassigned
>>>
>>> Exclusive mode is supported now. Shared mode will be supported in the
>>> future.
>>>
>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> to list the assignment state of all the groups.
>>>
>>> Format:
>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> test//mbm_total_bytes:0=e;1=e
>>> test//mbm_local_bytes:0=e;1=e
>>> //mbm_total_bytes:0=e;1=e
>>> //mbm_local_bytes:0=e;1=e
>>>
>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>
>>> Format:
>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>
>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> test//mbm_total_bytes:0=u;1=u
>>> test//mbm_local_bytes:0=u;1=u
>>> //mbm_total_bytes:0=e;1=e
>>> //mbm_local_bytes:0=e;1=e
>>>
>>> The corresponding events will be read in
>>>
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>
>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>> mbm_local_bytes) will be supported.
>>>
>>> 8. In the future, there will be options to create multiple configurations
>>> and corresponding directory will be created in
>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>
> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
> directory? Like this:
>
> # echo "LclFill, LclNTWr, RmtFill" >
> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>
> This seems OK (dependent on the user picking meaningful names for
> the set of attributes picked ... but if they want to name this
> monitor file "brian" then they have to live with any confusion
> that they bring on themselves).
>
> Would this involve an extension to kernfs? I don't see a function
> pointer callback for file creation in kernfs_syscall_ops.
>
>>>
>>
>> I know you are all busy with multiple series going on parallel. I am still
>> waiting for the inputs on this. It will be great if you can spend some time
>> on this to see if we can find common ground on the interface.
>>
>> Thanks
>> Babu
>
> -Tony
>
thanks
Babu
Powered by blists - more mailing lists