[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCgfi8qgJi_yijJ5u933faMvCZCjdBOGCKxOd4uqwEJhyw@mail.gmail.com>
Date: Fri, 30 Jan 2026 14:04:29 +0100
From: Peter Newman <peternewman@...gle.com>
To: Ben Horgan <ben.horgan@....com>
Cc: James Morse <james.morse@....com>, amitsinght@...vell.com, baisheng.gao@...soc.com,
baolin.wang@...ux.alibaba.com, carl@...amperecomputing.com,
dave.martin@....com, david@...nel.org, dfustini@...libre.com,
fenghuay@...dia.com, gshan@...hat.com, jonathan.cameron@...wei.com,
kobak@...dia.com, lcherian@...vell.com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, punit.agrawal@....qualcomm.com,
quic_jiles@...cinc.com, reinette.chatre@...el.com, rohit.mathew@....com,
scott@...amperecomputing.com, sdonthineni@...dia.com,
tan.shaopeng@...itsu.com, xhao@...ux.alibaba.com, catalin.marinas@....com,
will@...nel.org, corbet@....net, maz@...nel.org, oupton@...nel.org,
joey.gouly@....com, suzuki.poulose@....com, kvmarm@...ts.linux.dev
Subject: Re: [PATCH v3 29/47] arm_mpam: resctrl: Pick classes for use as mbm counters
Hi Ben,
On Mon, Jan 26, 2026 at 5:00 PM Ben Horgan <ben.horgan@....com> wrote:
>
> Hi Peter, James,
>
> On 1/19/26 12:47, Peter Newman wrote:
> > Hi James,
> >
> > On Mon, Jan 19, 2026 at 1:04 PM James Morse <james.morse@....com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 15/01/2026 15:49, Peter Newman wrote:
> >>> On Mon, Jan 12, 2026 at 6:02 PM Ben Horgan <ben.horgan@....com> wrote:
> >>>> From: James Morse <james.morse@....com>
> >>>>
> >>>> resctrl has two types of counters, NUMA-local and global. MPAM has only
> >>>> bandwidth counters, but the position of the MSC may mean it counts
> >>>> NUMA-local, or global traffic.
> >>>>
> >>>> But the topology information is not available.
> >>>>
> >>>> Apply a heuristic: the L2 or L3 supports bandwidth monitors, these are
> >>>> probably NUMA-local. If the memory controller supports bandwidth monitors,
> >>>> they are probably global.
> >>
> >>> Are remote memory accesses not cached? How do we know an MBWU monitor
> >>> residing on a cache won't count remote traffic?
> >>
> >> It will, yes you get double counting. Is forbidding both mbm_total and mbm_local preferable?
> >>
> >> I think this comes from 'total' in mbm_total not really having the obvious meaning of the
> >> word:
> >> If I have CPUs in NUMA-A and no memory controllers, then NUMA-B has no CPUs, and all the
> >> memory-controllers.
> >> With MPAM: we've only got one bandwidth counter, it doesn't know where the traffic goes
> >> after the MSC. mbm-local on the L3 would reflect all the bandwidth, and mbm-total on the
> >> memory-controllers would have the same number.
> >> I think on x86 mbm_local on the CPUs would read zero as zero traffic went to the 'local'
> >> memory controller, and mbm_total would reflect all the memory bandwidth. (so 'total'
> >> really means 'other')
> >
> > Our software is going off the definition from the Intel SDM:
> >
> > "This event monitors the L3 external bandwidth satisfied by the local
> > memory. In most platforms that support this event, L3 requests are
> > likely serviced by a memory system with non-uniform memory
> > architecture. This allows bandwidth to off-package memory resources to
> > be tracked by subtracting local from total bandwidth (for instance,
> > bandwidth over QPI to a memory controller on another physical
> > processor could be tracked by subtraction).
>
> Indeed we should base our discussion on the event definition in the
> Intel SDM. For our reference, the description for the external bandwidth
> monitoring event (mbm_total) is:
>
> "This event monitors the L3 total external bandwidth to the next level
> of the cache hierarchy, including all demand and prefetch misses from
> the L3 to the next hierarchy of the memory system. In most platforms,
> this represents memory bandwidth."
>
> >
> > On NUMA-capable hardware that can support this event where all memory
> > is local, mbm_local == mbm_total, but in practice you can't read them
> > at the same time from userspace, so if you read mbm_total first,
> > you'll probably get a small negative result for remote bandwidth.
> >
> >>
> >> I think what MPAM is doing here is still useful as a system normally has both CPUs and
> >> memory controllers in the NUMA nodes, and you can use this to spot a control/monitor group
> >> on a NUMA-node that is hammering all the memory (outlier mbm_local), or the same where a
> >> NUMA-node's memory controller is getting hammered by all the NUMA nodes (outlier
> >> mbm_total)
> >>
> >> I've not heard of a platform with both memory bandwidth monitors at L3 and the memory
> >> controller, so this may be a theoretical issue.
> >>
> >> Shall we only expose one of mbm-local/total to prevent this being seen by user-space?
> >
> > I believe in the current software design, MPAM is only able to support
> > mbm_total, as an individual MSC (or class of MSCs with the same
> > configuration) can't separate traffic by destination, so it must be
> > the combined value. On a hardware design where MSCs were placed such
> > that one only counts local traffic and another only counts remote, the
> > resctrl MPAM driver would have to understand the hardware
> > configuration well enough to be able to produce counts following
> > Intel's definition of mbm_local and mbm_total.
>
> On a system with MSC measuring memory bandwidth on the L3 caches these
> MSC will measure all bandwidth to the next level of the memory hierarchy
> which matches the definition of mbm_total. (We assume any MSC on an L3
> is at the egress even though acpi/dt doesn't distinguish ingress and
> egress.)
>
> For MSC on memory controllers then they don't distinguish which L3 cache
> the traffic came from and so unless there is a single L3 then we can't
> use these memory bandwidth monitors as they count neither mbm_local nor
> mbm_total. When there is a single L3 (and no higher level caches) then
> it would match both mbm_total and mbm_local.
The text you quoted from Intel was in the context of the L3. I assume
if such an event were implemented at a different level of the memory
system, it would continue to refer to downstream bandwidth.
>
> Hence, I agree we should just use mbm_total and update the heuristics
> such that if the MSC are at the memory only consider them if there are
> no higher caches and a single L3.
That should be ok for now. If I see a system where this makes MBWU
counters inaccessible, we'll continue the discussion then.
>
> The introduction of ABMC muddies the waters as the "event_filter" file
> defines the meaning of mbm_local and mbm_total. In order to handle this
> file properly with MPAM, fs/resctrl changes are needed. We could either
> make "event_filter" show the bits that correspond to the mbm counter and
> unchangeable or decouple the "event_filter" part of ABMC from the
> counter assignment part. As more work is needed to not break abi here
> I'll drop the ABMC patches from the next respin of this series.
I would prefer if you can just leave out the event_filter or make it
unconfigurable on MPAM. The rest of the counter assignment seems to
work well.
Longer term, the event_filter interface is supposed to give us the
ability to define and name our own counter events, but we'll have to
find a way past the decision to define the event filters in terms
copy-pasted from an AMD manual.
Thanks,
-Peter
Powered by blists - more mailing lists