linux-kernel - Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCioRrjwZPYDdkAApHAecqZVA_Z4rLctjmcpEaV04eq9Ag@mail.gmail.com>
Date: Thu, 7 Nov 2024 12:01:50 +0100
From: Peter Newman <peternewman@...gle.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: fenghua.yu@...el.com, babu.moger@....com, bp@...en8.de, 
	dave.hansen@...ux.intel.com, eranian@...gle.com, hpa@...or.com, 
	james.morse@....com, linux-kernel@...r.kernel.org, mingo@...hat.com, 
	nert.pinx@...il.com, tan.shaopeng@...itsu.com, tglx@...utronix.de, 
	tony.luck@...el.com, x86@...nel.org
Subject: Re: [PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads

Hi Reinette,

On Thu, Nov 7, 2024 at 2:10 AM Reinette Chatre
<reinette.chatre@...el.com> wrote:
>
> Hi Peter,
>
> On 11/6/24 7:43 AM, Peter Newman wrote:
> > Performance-conscious users may use threads bound to CPUs within a
> > specific monitoring domain to ensure that all bandwidth counters can be
> > read efficiently. The hardware counters are only accessible to CPUs
> > within the domain, so requests from CPUs outside the domain are
> > forwarded to a kernel worker or IPI handler, incurring a substantial
> > performance penalty on each read.  Recently, this penalty was observed
> > to be paid by local reads as well.
> >
> > To support blocking implementations of resctrl_arch_rmid_read(),
> > mon_event_read() switched to smp_call_on_cpu() in most cases to read
> > event counters using a kernel worker thread. Unlike
> > smp_call_function_any(), which optimizes to a local function call when
> > the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
> > work unconditionally.
> >
> > Introduce resctrl_arch_event_read_blocks() to allow the implementation
> > to indicate whether reading a particular event counter blocks. Use this
> > to limit the usage of smp_call_on_cpu() to only the counters where it is
> > actually needed. This reverts to the previous behavior of always using
> > smp_call_function_any() for all x86 implementations.
> >
> > This is significant when supporting configurations such as a dual-socket
> > AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
> > counters for all groups on all domains requires 32768 (32*256*2) counter
> > reads. The resolution of global, per-group MBM data which can be
> > provided is therefore sensitive to the cost of each counter read.
> > Furthermore, redirecting this much work to IPI handlers or worker
> > threads at a regular interval is disruptive to the present workload.
> >
> > The test program fastcat, which was introduced in an earlier path, was
> > used to simulate the impact of this change on an optimized event
> > counter-reading procedure. The goal is to maximize the frequency at
> > which MBM counters can be dumped, so the benchmark determines the cost
> > of an additional global MBM counter sample.
> >
> > The total number of cycles needed to read all local and total MBM
>
> This optimization proposal aims to reduce the number of cycles used by
> the code that is eventually called after user space reads from a file.
> As you already noted, this causes a regression in this exact optimization area
> in one out of the four scenarios tested.
>
> As I understand this is understood to be a slow path and there are many things
> that can impact the number of cycles such a query takes.
>
> I wonder if we could perhaps instead change focus from shaving cycles (after
> obtaining a mutex ... ) from a slow path to understand what the use case
> is so that resctrl could perhaps obtain a better interface that supports the
> use case better overall?

Sounds great.

>
> > counters for a large number of monitoring groups was collected using the
> > perf tool. The average over 100 iterations is given, with a 1-second
> > sleep between iterations to better represent the intended use case. The
> > test was run bound to the CPUs of a single MBM domain, once targeting
> > counters in the local domain and again for counters in a remote domain.
>
> This sounds as though user space is essentially duplicating what the
> MBM overflow handler currently does, which is to run a worker in each domain
> to collect MBM data every second from every RMID for both MBM events.
>
> * What are the requirements of this use case?

Accurate, per-RMID MBps data, ideally at 1-second resolution if the
overhead can be tolerable.

> * Is there some benefit to user space in reading individual counters?

The interface is available today and can retrieve the data with
somewhat acceptable performance. After applying this change (or
selecting a kernel version before the MPAM changes), call-graph
profiling showed that the remaining overhead of reading one counter at
a time from userspace on AMD Zen2 was around 20%, spacing each series
of 250 local reads by 1 second.

With the 1.22M figure I quoted below for a single domain from
userspace, this comes out to 488 usec at 2.5 Ghz, which is manageable.
Even if userspace serializes all of its per-domain reads manually to
avoid contention on the global rdtgroup_mutex, the results should be
consistent as long as the domains are always serialized in the same
order.

(apologies that my figures focus on AMD, but its high MBM domain
counts make it the most sensitive to read performance)

Also, if all of the counter-reading work is performed directly by the
thread, the time spent collecting the information is easier to
attribute to the management software rather than appearing as an
increase in kernel overhead.

Clearly not optimal, but an acceptable baseline.

> * Would it perhaps be acceptable to provide user space with a cached snapshot
>   of all the MBM counters in a single query?
>
> User space can use a single read to obtain values of an event for all RMIDs
> on a domain without a single IPI if the architectural state from the overflow handler
> is exposed. It could also be possible to present data from all domains in that single
> read.
>
> One complication I can think of is that data from the different domains may have
> been collected up to a second apart. Is this something this use case can tolerate?

This +/- 1-second drift would apply to the denominator of the mbps
calculation, so it could cause some very large errors. To produce
accurate mbps numbers from cached MBM count data, each sample would
need to be accompanied by a timestamp.

>
>
> For example,
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_00
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_snapshot/mbm_total_bytes_01
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
> If the use case cannot tolerate data up to a second old then resctrl could add new files
> in info/L3_MON that will take the mutex once and trigger a *single* IPI to a CPU
> from each domain and read all events sequentially (which is essentially mbm_overflow()).
>
> For example,
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_00
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
>         # cat /sys/fs/resctrl/info/L3_MON/mbm_current/mbm_total_bytes_01
>           <rdtgroup nameA> <MBM total count>
>           <rdtgroup nameB> <MBM total count>
>           ...
>
> As I understand an interface like above would be an improvement over
> what can be achieved by user space by optimizing queries to existing interface.
>

Yes, this is an option now that the ABMC work is establishing a naming
scheme for groups. It would also significantly cut down on the number
of open file descriptors needed.

Tony had prototyped an MBM rate event[1], which cached a MBps value
per group/domain produced by the overflow workers. If the workers
produce the mbps samples immediately after directly reading the
counters, then this should be the best case in terms of precision and
introduce very little additional system overhead. Also, userspace
reading a rate sample that's 1 second old would be a lot less harmful
than performing an MBps calculation from a count sample that's 1
second old.

Perhaps combining the per-second bandwidth rate cache with a
per-domain file for each MBM event to aggregate the data for all
groups will be the optimally-performing solution?

Thanks!
-Peter

[1] https://lore.kernel.org/lkml/20230620033702.33344-3-tony.luck@intel.com/#Z31fs:resctrl2:arch:x86:rdt_mbm_total_rate.c