[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d05bce5-b145-3df3-7445-02aa31ca877c@arm.com>
Date: Wed, 8 Mar 2023 16:09:46 +0000
From: James Morse <james.morse@....com>
To: Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
linux-kernel@...r.kernel.org
Cc: Fenghua Yu <fenghua.yu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
H Peter Anvin <hpa@...or.com>,
Babu Moger <Babu.Moger@....com>,
shameerali.kolothum.thodi@...wei.com,
D Scott Phillips OS <scott@...amperecomputing.com>,
carl@...amperecomputing.com, lcherian@...vell.com,
bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
xingxin.hx@...nanolis.org, baolin.wang@...ux.alibaba.com,
Jamie Iles <quic_jiles@...cinc.com>,
Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of
sending an IPI
Hi Reinette,
On 06/03/2023 11:33, James Morse wrote:
> On 02/02/2023 23:47, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>> the number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>> for each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>
>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>> this isn't possible.
>>>
>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>> wait, instead of sending an IPI. This function is only used in response to
>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>
>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>> the monitor-allocation in monitor.c.
>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> index 1df0e3262bca..4ee3da6dced7 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>> rr->val = 0;
>>> rr->first = first;
>>>
>>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>
>> This would be problematic for the use cases where single tasks are run on
>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>> it may never run. Real-time environments are target usage of resctrl (with examples
>> in the documentation).
>
> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
> this breaks!
>
> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
> limbo or overflow timer work.
>
> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
> nohz-full.
>
> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
> If I can reproduce this problem, ...
... I haven't been able to reproduce this.
With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
user-space as much as possible.
This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
tasks 100% in user-space isn't a realistic use-case.
Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
scheduled on those CPUs, it is expected to run eventually.
Thanks,
James
[0] https://lore.kernel.org/all/20130823110254.GU31370@twins.programming.kicks-ass.net/
Powered by blists - more mailing lists