linux-kernel - Re: [PATCH v6 24/24] x86/resctrl: Separate arch and fs resctrl locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d4282af-7237-0c26-7c48-e2b9e2fb7795@intel.com>
Date:   Tue, 3 Oct 2023 14:28:42 -0700
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     James Morse <james.morse@....com>, <x86@...nel.org>,
        <linux-kernel@...r.kernel.org>
CC:     Fenghua Yu <fenghua.yu@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        <shameerali.kolothum.thodi@...wei.com>,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        <carl@...amperecomputing.com>, <lcherian@...vell.com>,
        <bobo.shaobowang@...wei.com>, <tan.shaopeng@...itsu.com>,
        <xingxin.hx@...nanolis.org>, <baolin.wang@...ux.alibaba.com>,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>, <peternewman@...gle.com>,
        <dfustini@...libre.com>, <amitsinght@...vell.com>
Subject: Re: [PATCH v6 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi James,

On 9/14/2023 10:21 AM, James Morse wrote:
> resctrl has one mutex that is taken by the architecture specific code,
> and the filesystem parts. The two interact via cpuhp, where the
> architecture code updates the domain list. Filesystem handlers that
> walk the domains list should not run concurrently with the cpuhp
> callback modifying the list.
> 
> Exposing a lock from the filesystem code means the interface is not
> cleanly defined, and creates the possibility of cross-architecture
> lock ordering headaches. The interaction only exists so that certain
> filesystem paths are serialised against CPU hotplug. The CPU hotplug
> code already has a mechanism to do this using cpus_read_lock().
> 
> MPAM's monitors have an overflow interrupt, so it needs to be possible
> to walk the domains list in irq context. RCU is ideal for this,
> but some paths need to be able to sleep to allocate memory.
> 
> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
> of a cpuhp callback, cpus_read_lock() must always be taken first.
> rdtgroup_schemata_write() already does this.
> 
> Most of the filesystem code's domain list walkers are currently
> protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
> The exceptions are rdt_bit_usage_show() and the mon_config helpers
> which take the lock directly.
> 
> Make the domain list protected by RCU. An architecture-specific
> lock prevents concurrent writers. rdt_bit_usage_show() could
> walk the domain list using RCU, but to keep all the filesystem
> operations the same, this is changed to call cpus_read_lock().
> The mon_config helpers send multiple IPIs, take the cpus_read_lock()
> in these cases.
> 
> The other filesystem list walkers need to be able to sleep.
> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
> cpuhp callbacks can't be invoked when file system operations are
> occurring.
> 
> Add lockdep_assert_cpus_held() in the cases where the
> rdtgroup_kn_lock_live() call isn't obvious.

One place that does not seem to have this annotation that
I think is needed is within get_domain_from_cpu(). Starting
with this series it is called from resctrl_offline_cpu()
called via CPU hotplug code. From now on extra care needs to be
taken when trying to call it from anywhere else.

> 
> Resctrl's domain online/offline calls now need to take the
> rdtgroup_mutex themselves.
> 
> Reviewed-by: Shaopeng Tan <tan.shaopeng@...itsu.com>
> Tested-by: Shaopeng Tan <tan.shaopeng@...itsu.com>
> Tested-By: Peter Newman <peternewman@...gle.com>
> Signed-off-by: James Morse <james.morse@....com>
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 1a10f567bbe5..8fd0510d767b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -25,8 +25,15 @@
>  #include <asm/resctrl.h>
>  #include "internal.h"
>  
> -/* Mutex to protect rdtgroup access. */
> -DEFINE_MUTEX(rdtgroup_mutex);
> +/*
> + * rdt_domain structures are kfree()d when their last CPU goes offline,
> + * and allocated when the first CPU in a new domain comes online.
> + * The rdt_resource's domain list is updated when this happens. Readers of
> + * the domain list must either take cpus_read_lock(), or rely on an RCU
> + * read-side critical section, to avoid observing concurrent modification.
> + * All writers take this mutex:
> + */
> +static DEFINE_MUTEX(domain_list_lock);
>  

I assume that you have not followed the SNC work. Please note that in 
that work the domain list is split between a monitoring domain list and
control domain list. I expect this lock would cover both and both would
be rcu lists?


...

> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index b4ed4e1b4938..0620dfc72036 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -209,6 +209,9 @@ static int parse_line(char *line, struct resctrl_schema *s,
>  	struct rdt_domain *d;
>  	unsigned long dom_id;
>  
> +	/* Walking r->domains, ensure it can't race with cpuhp */
> +	lockdep_assert_cpus_held();
> +
>  	if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP &&
>  	    (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) {
>  		rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n");
> @@ -313,6 +316,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid)
>  	struct rdt_domain *d;
>  	u32 idx;
>  
> +	/* Walking r->domains, ensure it can't race with cpuhp */
> +	lockdep_assert_cpus_held();
> +
>  	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
>  		return -ENOMEM;
>  
> @@ -378,11 +384,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  		return -EINVAL;
>  	buf[nbytes - 1] = '\0';
>  
> -	cpus_read_lock();
>  	rdtgrp = rdtgroup_kn_lock_live(of->kn);
>  	if (!rdtgrp) {
>  		rdtgroup_kn_unlock(of->kn);
> -		cpus_read_unlock();
>  		return -ENOENT;
>  	}
>  	rdt_last_cmd_clear();
> @@ -444,7 +448,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
>  out:
>  	rdt_staged_configs_clear();
>  	rdtgroup_kn_unlock(of->kn);
> -	cpus_read_unlock();
>  	return ret ?: nbytes;
>  }
>  
> @@ -464,6 +467,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo
>  	bool sep = false;
>  	u32 ctrl_val;
>  
> +	/* Walking r->domains, ensure it can't race with cpuhp */
> +	lockdep_assert_cpus_held();
> +
>  	seq_printf(s, "%*s:", max_name_width, schema->name);
>  	list_for_each_entry(dom, &r->domains, list) {
>  		if (sep)
> @@ -535,7 +541,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  	int cpu;
>  
>  	/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> -	lockdep_assert_held(&rdtgroup_mutex);
> +	lockdep_assert_cpus_held();
>  

Only now is that comment accurate. Could it be moved to this patch?

...

> @@ -2948,6 +2975,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn,
>  	struct rdt_domain *dom;
>  	int ret;
>  
> +	/* Walking r->domains, ensure it can't race with cpuhp */
> +	lockdep_assert_cpus_held();
> +
>  	list_for_each_entry(dom, &r->domains, list) {
>  		ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp);
>  		if (ret)
> @@ -3766,7 +3796,8 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
>  	kfree(d->mbm_local);
>  }
>  
> -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +static void _resctrl_offline_domain(struct rdt_resource *r,
> +				    struct rdt_domain *d)
>  {
>  	lockdep_assert_held(&rdtgroup_mutex);
>  
> @@ -3801,6 +3832,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
>  	domain_destroy_mon_state(d);
>  }
>  
> +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	mutex_lock(&rdtgroup_mutex);
> +	_resctrl_offline_domain(r, d);
> +	mutex_unlock(&rdtgroup_mutex);
> +}
> +

This seems unnecessary. Why not keep resctrl_offline_domain() as-is and just
take the lock within it?

>  static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>  {
>  	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
> @@ -3832,7 +3870,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
>  	return 0;
>  }
>  
> -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +static int _resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
>  {
>  	int err;
>  
> @@ -3870,12 +3908,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
>  	return 0;
>  }
>  
> +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
> +{
> +	int err;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +	err = _resctrl_online_domain(r, d);
> +	mutex_unlock(&rdtgroup_mutex);
> +
> +	return err;
> +}
> +

Same here.

>  void resctrl_online_cpu(unsigned int cpu)
>  {
> -	lockdep_assert_held(&rdtgroup_mutex);
> -
> +	mutex_lock(&rdtgroup_mutex);
>  	/* The CPU is set in default rdtgroup after online. */
>  	cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
> +	mutex_unlock(&rdtgroup_mutex);
>  }
>  
>  static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)

...

Reinette