linux-kernel - Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can be read from any CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALPaoCimCmSyeejR9FCLcitquwenmOo0-0PVngUMtmSr_syy-A@mail.gmail.com>
Date: Wed, 23 Apr 2025 15:27:01 +0200
From: Peter Newman <peternewman@...gle.com>
To: Reinette Chatre <reinette.chatre@...el.com>
Cc: "Luck, Tony" <tony.luck@...el.com>, Fenghua Yu <fenghuay@...dia.com>, 
	Maciej Wieczor-Retman <maciej.wieczor-retman@...el.com>, James Morse <james.morse@....com>, 
	Babu Moger <babu.moger@....com>, Drew Fustini <dfustini@...libre.com>, 
	Dave Martin <Dave.Martin@....com>, Anil Keshavamurthy <anil.s.keshavamurthy@...el.com>, 
	linux-kernel@...r.kernel.org, patches@...ts.linux.dev
Subject: Re: [PATCH v3 10/26] fs/resctrl: Improve handling for events that can
 be read from any CPU

Hi Reinette,

On Tue, Apr 22, 2025 at 8:20 PM Reinette Chatre
<reinette.chatre@...el.com> wrote:
>
> Hi Tony,
>
> On 4/21/25 1:28 PM, Luck, Tony wrote:
> > On Fri, Apr 18, 2025 at 03:54:02PM -0700, Reinette Chatre wrote:
> >>> @@ -619,7 +622,8 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> >>>                     goto out;
> >>>             }
> >>>             d = container_of(hdr, struct rdt_mon_domain, hdr);
> >>> -           mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false);
> >>> +           mask = md->any_cpu ? cpu_online_mask : &d->hdr.cpu_mask;
> >>> +           mon_event_read(&rr, r, d, rdtgrp, mask, evtid, false);
> >>
> >> I do not think this accomplishes the goal of this patch. Looking at mon_event_read() it calls
> >> cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU) before any of the smp_*() calls.
> >>
> >>      cpumask_any_housekeeping()
> >>      {
> >>              ...
> >>              if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
> >>                      cpu = cpumask_any(mask);
> >>              ...
> >>      }
> >>
> >> cpumask_any() is just cpumask_first() so it will pick the first CPU in the
> >> online mask that may not be the current CPU.
> >>
> >> fwiw ... there are some optimizations planned in this area that I have not yet studied:
> >> https://lore.kernel.org/lkml/20250407153856.133093-1-yury.norov@gmail.com/
> >
> > I remember Peter complaining[1] about extra context switches when
> > cpumask_any_housekeeping() was introduced, but it seems that the
> > discussion died with no fix applied.
>
> The initial complaint was indeed that reading individual events is slower.
>
> The issue is that the intended use case read from many files at frequent
> intervals and thus becomes vulnerable to any changes in this area that
> really is already a slow path (reading from a file ... taking a mutex ...).
>
> Instead of working on shaving cycles off this path the discussion transitioned
> to resctrl providing better support for the underlying use case. I
> understood that this is being experimented with [2] and last I heard it
> looks promising.
>
> >
> > The blocking problem is that ARM may not be able to read a counter
> > on a tick_nohz CPU because it may need to sleep.

If I hadn't already turned my attention to optimizing bulk counter
reads, I might have mentioned that the change Tony referred to is
broken on MPAM implementations because the MPAM
resctrl_arch_rmid_read() cannot wait for its internal mutex with
preemption disabled.

> >
> > Do we need more options for events:
> >
> > 1) Must be read on a CPU in the right domain  // Legacy
> > 2) Can be read from any CPU                   // My addtion
> > 3) Must be read on a "housekeeping" CPU               // James' code in upstream
> > 4) Cannot be read on a tick_nohz CPU          // Could be combined with 1 or 2?
>
> I do not see needing additional complexity here. I think it will be simpler
> to just replace use of cpumask_any_housekeeping() in mon_event_read() with
> open code that supports the particular usage. As I understand it is prohibited
> for all CPUs to be in tick_nohz_full_mask so it looks to me as though the
> existing "if (tick_nohz_full_cpu(cpu))" should never be true (since no CPU is being excluded).
> Also, since mon_event_read() has no need to exclude CPUs, just a cpumask_andnot()
> should suffice to determine what remains of given mask after accounting for all the
> NO_HZ CPUs if tick_nohz_full_enabled().

Can you clarify what you mean by "all CPUs"? It's not difficult for
all CPUs in an L3 domain to be in tick_nohz_full_mask on AMD
implementations, where there are many small L3 domains (~8 CPUs each)
in a socket.

Google makes use of isolation along this domain boundary on AMD
platforms in some products and these users prefer to read counters
using IPIs because they are concerned about introducing context
switches to the isolated part of the system. In these configurations,
there is typically only one RMID in that domain, so few of these IPIs
are needed. (Note that these are different users from the ones I had
described before who spawn large numbers of containers not limited to
any domains and want to read the MBM counters for all the RMIDs on all
the domains frequently.)

Thanks,
-Peter