lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cabf6ebd-1fba-d73f-b281-12f70d53ef55@arm.com>
Date:   Fri, 28 Jul 2023 17:32:33 +0100
From:   James Morse <james.morse@....com>
To:     Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org
Cc:     Fenghua Yu <fenghua.yu@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        shameerali.kolothum.thodi@...wei.com,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        carl@...amperecomputing.com, lcherian@...vell.com,
        bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
        xingxin.hx@...nanolis.org, baolin.wang@...ux.alibaba.com,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
        dfustini@...libre.com
Subject: Re: [PATCH v4 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to
 sleep

Hi Reinette,

On 15/06/2023 23:13, Reinette Chatre wrote:
> On 5/25/2023 11:01 AM, James Morse wrote:
>> MPAM's cache occupancy counters can take a little while to settle once
>> the monitor has been configured. The maximum settling time is described
>> to the driver via a firmware table. The value could be large enough
>> that it makes sense to sleep. To avoid exposing this to resctrl, it
>> should be hidden behind MPAM's resctrl_arch_rmid_read().
>>
>> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
>> to sleep. In this case resctrl_arch_rmid_read() should return an error
>> if it needs to sleep. This will only affect MPAM platforms where
>> the cache occupancy counter isn't available immediately, nohz_full is
>> in use, and there are there are no housekeeping CPUs in the necessary
>> domain.
>>
>> There are three callers of resctrl_arch_rmid_read():
>> __mon_event_count() and __check_limbo() are both called from a
>> non-migrateable context. mon_event_read() invokes __mon_event_count()
>> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
>> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
>> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
>> also adds work to a per-cpu workqueue.
>>
>> The remaining call is add_rmid_to_limbo() which is called in response
>> to a user-space syscall that frees an rmid. This opportunistically
>> reads the llc occupancy counter on the current domain to see if the
>> RMID is over the dirty threshold. This has to disable preemption to
>> avoid reading the wrong domain's value. Disabling pre-emption here
>> prevents resctrl_arch_rmid_read() from sleeping.
>>
>> add_rmid_to_limbo() walks each domain, but only reads the counter
>> on one domain. If the system has more than one domain, the RMID will
>> always be added to the limbo list. If the RMIDs usage was not over the
>> threshold, it will be removed from the list when __check_limbo() runs.
>> Make this the default behaviour. Free RMIDs are always added to the
>> limbo list for each domain.
>>
>> The user visible effect of this is that a clean RMID is not available
>> for re-allocation immediately after 'rmdir()' completes, this behaviour
>> was never portable as it never happened on a machine with multiple
>> domains.
>>
>> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
>> with interrupts unmasked. Document this is the expected behaviour, and
>> add a might_sleep() annotation to catch changes that won't work on arm64.


>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 6ba40495589a..fb33100e172b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -491,7 +481,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>>  		set_bit(idx, d->rmid_busy_llc);
>>  		entry->busy++;
>>  	}
>> -	put_cpu();
>>  
>>  	if (entry->busy)
>>  		rmid_limbo_count++;
> 
> Would entry->busy ever be 0 after this change?

Good point - the domains list can never be empty!


>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index ff7452f644e4..b961936decfa 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -234,7 +234,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>>   * @eventid:		eventid to read, e.g. L3 occupancy.
>>   * @val:		result of the counter read in bytes.
>>   *
>> - * Call from process context on a CPU that belongs to domain @d.
>> + * Some architectures need to sleep when first programming some of the counters.
>> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
>> + *  for a short period of time). Call from a non-migrateable process context on
>> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
>> + * schedule_work_on(). This function can be called with interrupts masked,
>> + * e.g. using smp_call_function_any(), but may concistently return an error.
>>   *
> 
> concistently -> consistently?
> 
>>   * Return:
>>   * 0 on success, or -EIO, -EINVAL etc on error.
>> @@ -243,6 +248,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>  			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
>>  			   u64 *val);
>>  
>> +/**
>> + * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
>> + *
>> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP, this function will generate a
>> + * warning when resctrl_arch_rmid_read() is called from an invalid context.
> 
> No need to say "this function" 
> Could this comment be more specific about which contexts are invalid instead of
> just referring to it as "invalid context". This seems to expect the reader to
> already know what an invalid context would be. 

I thought the implementation was enough - the comment was intended to say this does
nothing unless you select that build option. I've changed it to:
| * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
| * resctrl_arch_rmid_read() is called with preemption disabled.


>> + */
>> +static inline void resctrl_arch_rmid_read_context_check(void)
>> +{
>> +	if (!irqs_disabled())
>> +		might_sleep();
>> +}

> Could you please elaborate why the "!irqs_disabled()" is needed?

It's a question of what the interface should support.
It needs to support being called via IPI because that is going to happen on NOHZ_FULL
platforms. In this case the platforms where the MPAM code needs to sleep will have to
return an error. For folk that see this in practice: they need at least one housekeeping
CPU per domain.

Outside that the function shouldn't be called with pre-emption disabled, as that would
prevent it from sleeping, but the lockdep annotations that check for pre-emption also go
off when interrupts are masked, hence the slightly funny check.

It's documented against resctrl_arch_rmid_read():
| * Some architectures need to sleep when first programming some of the counters.
| * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
| *  for a short period of time). Call from a non-migrateable process context on
| * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
| * schedule_work_on(). This function can be called with interrupts masked,
| * e.g. using smp_call_function_any(), but may consistently return an error.


Thanks,

James

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ