linux-kernel - Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0814c380-b5f1-be8b-f03f-e6fcb8fa0821@intel.com>
Date:   Fri, 10 Mar 2023 12:06:17 -0800
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     James Morse <james.morse@....com>, <x86@...nel.org>,
        <linux-kernel@...r.kernel.org>
CC:     Fenghua Yu <fenghua.yu@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        <shameerali.kolothum.thodi@...wei.com>,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        <carl@...amperecomputing.com>, <lcherian@...vell.com>,
        <bobo.shaobowang@...wei.com>, <tan.shaopeng@...itsu.com>,
        <xingxin.hx@...nanolis.org>, <baolin.wang@...ux.alibaba.com>,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>, <peternewman@...gle.com>
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of
 sending an IPI

Hi James,

On 3/8/2023 8:09 AM, James Morse wrote:
> Hi Reinette,
> 
> On 06/03/2023 11:33, James Morse wrote:
>> On 02/02/2023 23:47, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>>> the number implemented is up to the manufacturer. This means when there are
>>>> fewer monitors than needed, they need to be allocated and freed.
>>>>
>>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>>> for each slice may need performing from different CPUs.
>>>>
>>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>>
>>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>>> this isn't possible.
>>>>
>>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>>> wait, instead of sending an IPI. This function is only used in response to
>>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>>
>>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>>> the monitor-allocation in monitor.c.
> 
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> index 1df0e3262bca..4ee3da6dced7 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>>  	rr->val = 0;
>>>>  	rr->first = first;
>>>>  
>>>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>>> +	smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>>
>>> This would be problematic for the use cases where single tasks are run on
>>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>>> it may never run. Real-time environments are target usage of resctrl (with examples
>>> in the documentation).
>>
>> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
>> this breaks!
>>
>> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
>> limbo or overflow timer work.
>>
>> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
>> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
>> nohz-full.
>>
>> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
>> If I can reproduce this problem,  ...
> 
> ... I haven't been able to reproduce this.
> 
> With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
> smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
> user-space as much as possible.
> 
> This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
> CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
> tasks 100% in user-space isn't a realistic use-case.
> 
> Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
> CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
> scheduled on those CPUs, it is expected to run eventually.

>From what I understand the email you point to, and I assume your testing,
used the system defaults (SCHED_FIFO gets 0.95s out of 1s).

Users are not constrained by these defaults. Please see
Documentation/scheduler/sched-rt-group.rst

It is thus possible for tightly controlled task to have a CPU dedicated to
it for great lengths or even forever. Ideally written in a way to manage power
and thermal constraints.

In the current behavior, users can use resctrl to read the data at any time
and expect to understand consequences of such action. 

It seems to me that there may be scenarios under which this change could
result in a read of data to never return? As you indicated it is expected
to run eventually, but that would be dictated by the RT scheduling period
that can be up to about 35 minutes (or "no limit" prompting me to make the
"never return" statement).

I do not see at this time that limbo/overflow should avoid these CPUs. Limbo
could be avoided from user space. I have not hear about overflow impacting
such workloads negatively.

Reinette