[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCjwrp+5B3pQ6C43Mv=x9ZtsoQbvjAqXR+Ujtqb_AQVdcw@mail.gmail.com>
Date: Wed, 6 Nov 2024 16:52:42 +0100
From: Peter Newman <peternewman@...gle.com>
To: Fenghua Yu <fenghua.yu@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>, Tony Luck <tony.luck@...el.com>, Babu Moger <babu.moger@....com>,
James Morse <james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>,
Shaopeng Tan <tan.shaopeng@...itsu.com>, linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads
Hi Fenghua,
On Tue, Nov 5, 2024 at 12:25 PM Peter Newman <peternewman@...gle.com> wrote:
>
> Hi Fenghua,
>
> On Mon, Nov 4, 2024 at 11:36 PM Fenghua Yu <fenghua.yu@...el.com> wrote:
> >
> > Hi, Peter,
> >
> > On 10/31/24 07:25, Peter Newman wrote:
>
> > > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > index 200d89a640270..daaff1cfd3f24 100644
> > > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > > return;
> > > }
> > >
> > > + /*
> > > + * If a performance-conscious caller has gone to the trouble of binding
> > > + * their thread to the monitoring domain of the event counter, ensure
> > > + * that the counters are read directly. smp_call_on_cpu()
> > > + * unconditionally uses a work queue to read the counter, substantially
> > > + * increasing the cost of the read.
> > > + *
> > > + * Preemption must be disabled to prevent a migration out of the domain
> > > + * after the CPU is checked, which would result in reading the wrong
> > > + * counters. Note that this makes the (slow) remote path a little slower
> > > + * by requiring preemption to be reenabled when redirecting the request
> > > + * to another domain was in fact necessary.
> > > + *
> > > + * In the case where all eligible target CPUs are nohz_full and
> > > + * smp_call_function_any() is used, keep preemption disabled to avoid
> > > + * the cost of reenabling it twice in the same read.
> > > + */
> > > + cpu = get_cpu();
> > > + if (cpumask_test_cpu(cpu, cpumask)) {
> > > + mon_event_count(rr);
> > > + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> > > + put_cpu();
> > > + return;
> > > + }
> >
> > This fast path code is a duplicate part of smp_call_funcion_any().
> >
> > In nohz_full() case, the fast path doesn't gain much and even hurts
> > remote domain performance:
> > 1. On local domain, it may gain a little bit because it has a few lines
> > less than directly calling smp_call_function_any(). But the gain is
> > minor due to a lines less code, not due to heavy weight queued work.
> >
> > 2. On remote domain, it degrades performance because get_cpu() and
> > put_cpu() are both called twice: one in the fast path code and one in
> > smp_call_function_any(). As you mentioned earlier, put_cpu() impacts
> > performance. I think get_cpu() has same impact too.
>
> get_cpu() and put_cpu() nest, so only the put_cpu() that reduces the
> preempt count to 0 will call into the scheduler. See the source
> comment I had added below.
>
> But... note that below smp_call_on_cpu() is now called with preemption
> disabled. (Looks like I only benchmarked and never ran a debug
> build...) I will have to change the patch to make sure put_cpu() is
> called before smp_call_on_cpu().
>
>
> >
> > The fast path only gains in none nohz_full() case.
> >
> > So maybe it's better to move the fast path code into the non nohz_full()
> > case? With this change, you may have the following benefits:
> >
> > 1. No performance impact on nohz_full() case (either local or remote
> > domain).
> > 2. Improve performance on non nohz_full() case as you intended in this
> > patch.
> > 3. The fast path focuses on fixing the right performance bottleneck.
>
> The consequence of reusing the current-cpu-in-mask check in
> smp_call_function_any() is that if the check fails, it could cause
> resctrl_arch_rmid_read() to fail by invoking it in an IPI handler when
> it would have succeeded if invoked on a kernel worker, undoing James's
> original work.
I noted in my earlier email to Reinette that this is already a problem
in the fast path case I added.
I ended up creating a resctrl_arch_event_read_blocks() hook for asking
the implementation whether a particular event actually blocks so that
the workqueue is only used when it's really necessary:
https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/
With this, I was able to follow your suggestion of using
smp_call_function_any() for the local counter read.
However, it's worth noting that this ended up making remote reads
faster on AMD but slower on Intel for some reason. I assume it's a
flaw in my benchmark where the IPIs were cross-socket on Intel while
they were only cross-chiplet on AMD.
Thanks!
-Peter
Powered by blists - more mailing lists