linux-kernel - Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALPaoCjwrp+5B3pQ6C43Mv=x9ZtsoQbvjAqXR+Ujtqb_AQVdcw@mail.gmail.com>
Date: Wed, 6 Nov 2024 16:52:42 +0100
From: Peter Newman <peternewman@...gle.com>
To: Fenghua Yu <fenghua.yu@...el.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H . Peter Anvin" <hpa@...or.com>, Tony Luck <tony.luck@...el.com>, Babu Moger <babu.moger@....com>, 
	James Morse <james.morse@....com>, Martin Kletzander <nert.pinx@...il.com>, 
	Shaopeng Tan <tan.shaopeng@...itsu.com>, linux-kernel@...r.kernel.org, eranian@...gle.com
Subject: Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

Hi Fenghua,

On Tue, Nov 5, 2024 at 12:25 PM Peter Newman <peternewman@...gle.com> wrote:
>
> Hi Fenghua,
>
> On Mon, Nov 4, 2024 at 11:36 PM Fenghua Yu <fenghua.yu@...el.com> wrote:
> >
> > Hi, Peter,
> >
> > On 10/31/24 07:25, Peter Newman wrote:
>
> > > diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > index 200d89a640270..daaff1cfd3f24 100644
> > > --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> > > @@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> > >               return;
> > >       }
> > >
> > > +     /*
> > > +      * If a performance-conscious caller has gone to the trouble of binding
> > > +      * their thread to the monitoring domain of the event counter, ensure
> > > +      * that the counters are read directly. smp_call_on_cpu()
> > > +      * unconditionally uses a work queue to read the counter, substantially
> > > +      * increasing the cost of the read.
> > > +      *
> > > +      * Preemption must be disabled to prevent a migration out of the domain
> > > +      * after the CPU is checked, which would result in reading the wrong
> > > +      * counters. Note that this makes the (slow) remote path a little slower
> > > +      * by requiring preemption to be reenabled when redirecting the request
> > > +      * to another domain was in fact necessary.
> > > +      *
> > > +      * In the case where all eligible target CPUs are nohz_full and
> > > +      * smp_call_function_any() is used, keep preemption disabled to avoid
> > > +      * the cost of reenabling it twice in the same read.
> > > +      */
> > > +     cpu = get_cpu();
> > > +     if (cpumask_test_cpu(cpu, cpumask)) {
> > > +             mon_event_count(rr);
> > > +             resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
> > > +             put_cpu();
> > > +             return;
> > > +     }
> >
> > This fast path code is a duplicate part of smp_call_funcion_any().
> >
> > In nohz_full() case, the fast path doesn't gain much and even hurts
> > remote domain performance:
> > 1. On local domain, it may gain a little bit because it has a few lines
> > less than directly calling smp_call_function_any(). But the gain is
> > minor due to a lines less code, not due to heavy weight queued work.
> >
> > 2. On remote domain, it degrades performance because get_cpu() and
> > put_cpu() are both called twice: one in the fast path code and one in
> > smp_call_function_any(). As you mentioned earlier, put_cpu() impacts
> > performance. I think get_cpu() has same impact too.
>
> get_cpu() and put_cpu() nest, so only the put_cpu() that reduces the
> preempt count to 0 will call into the scheduler. See the source
> comment I had added below.
>
> But... note that below smp_call_on_cpu() is now called with preemption
> disabled. (Looks like I only benchmarked and never ran a debug
> build...) I will have to change the patch to make sure put_cpu() is
> called before smp_call_on_cpu().
>
>
> >
> > The fast path only gains in none nohz_full() case.
> >
> > So maybe it's better to move the fast path code into the non nohz_full()
> > case? With this change, you may have the following benefits:
> >
> > 1. No performance impact on nohz_full() case (either local or remote
> > domain).
> > 2. Improve performance on non nohz_full() case as you intended in this
> > patch.
> > 3. The fast path focuses on fixing the right performance bottleneck.
>
> The consequence of reusing the current-cpu-in-mask check in
> smp_call_function_any() is that if the check fails, it could cause
> resctrl_arch_rmid_read() to fail by invoking it in an IPI handler when
> it would have succeeded if invoked on a kernel worker, undoing James's
> original work.

I noted in my earlier email to Reinette that this is already a problem
in the fast path case I added.

I ended up creating a resctrl_arch_event_read_blocks() hook for asking
the implementation whether a particular event actually blocks so that
the workqueue is only used when it's really necessary:

https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/

With this, I was able to follow your suggestion of using
smp_call_function_any() for the local counter read.

However, it's worth noting that this ended up making remote reads
faster on AMD but slower on Intel for some reason. I assume it's a
flaw in my benchmark where the IPIs were cross-socket on Intel while
they were only cross-chiplet on AMD.

Thanks!
-Peter