linux-kernel - Re: [PATCH v6 24/24] x86/resctrl: Separate arch and fs resctrl locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d5842ad2-12d2-5f1f-2fcc-0bec49646d06@arm.com>
Date:   Wed, 25 Oct 2023 18:55:54 +0100
From:   James Morse <james.morse@....com>
To:     Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org
Cc:     Fenghua Yu <fenghua.yu@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        shameerali.kolothum.thodi@...wei.com,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        carl@...amperecomputing.com, lcherian@...vell.com,
        bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
        xingxin.hx@...nanolis.org, baolin.wang@...ux.alibaba.com,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
        dfustini@...libre.com, amitsinght@...vell.com
Subject: Re: [PATCH v6 24/24] x86/resctrl: Separate arch and fs resctrl locks

Hi Reinette,

On 03/10/2023 22:28, Reinette Chatre wrote:
> On 9/14/2023 10:21 AM, James Morse wrote:
>> resctrl has one mutex that is taken by the architecture specific code,
>> and the filesystem parts. The two interact via cpuhp, where the
>> architecture code updates the domain list. Filesystem handlers that
>> walk the domains list should not run concurrently with the cpuhp
>> callback modifying the list.
>>
>> Exposing a lock from the filesystem code means the interface is not
>> cleanly defined, and creates the possibility of cross-architecture
>> lock ordering headaches. The interaction only exists so that certain
>> filesystem paths are serialised against CPU hotplug. The CPU hotplug
>> code already has a mechanism to do this using cpus_read_lock().
>>
>> MPAM's monitors have an overflow interrupt, so it needs to be possible
>> to walk the domains list in irq context. RCU is ideal for this,
>> but some paths need to be able to sleep to allocate memory.
>>
>> Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part
>> of a cpuhp callback, cpus_read_lock() must always be taken first.
>> rdtgroup_schemata_write() already does this.
>>
>> Most of the filesystem code's domain list walkers are currently
>> protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live().
>> The exceptions are rdt_bit_usage_show() and the mon_config helpers
>> which take the lock directly.
>>
>> Make the domain list protected by RCU. An architecture-specific
>> lock prevents concurrent writers. rdt_bit_usage_show() could
>> walk the domain list using RCU, but to keep all the filesystem
>> operations the same, this is changed to call cpus_read_lock().
>> The mon_config helpers send multiple IPIs, take the cpus_read_lock()
>> in these cases.
>>
>> The other filesystem list walkers need to be able to sleep.
>> Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the
>> cpuhp callbacks can't be invoked when file system operations are
>> occurring.
>>
>> Add lockdep_assert_cpus_held() in the cases where the
>> rdtgroup_kn_lock_live() call isn't obvious.

> One place that does not seem to have this annotation that
> I think is needed is within get_domain_from_cpu(). Starting
> with this series it is called from resctrl_offline_cpu()
> called via CPU hotplug code. From now on extra care needs to be
> taken when trying to call it from anywhere else.

Excellent! This shows that the overflow/limbo threads are now exposed to CPUs going
offline while they run - I'll fix that.

But, this gets called via IPI from rdt_ctrl_update(), and lockdep can't know who the IPI
came from to check the lock was held, so it triggers false positives. This one will look a
bit funny:
|       /*
|        * Walking r->domains, ensure it can't race with cpuhp.
|        * Because this is called via IPI by rdt_ctrl_update(), assertions
|        * about locks this thread holds will lead to false positives. Check
|        * someone is holding the CPUs lock.
|        */
|       if (IS_ENABLED(CONFIG_LOCKDEP))
|               lockdep_is_cpus_held();


>> Resctrl's domain online/offline calls now need to take the
>> rdtgroup_mutex themselves.

>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>> index 1a10f567bbe5..8fd0510d767b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>> @@ -25,8 +25,15 @@
>>  #include <asm/resctrl.h>
>>  #include "internal.h"
>>  
>> -/* Mutex to protect rdtgroup access. */
>> -DEFINE_MUTEX(rdtgroup_mutex);
>> +/*
>> + * rdt_domain structures are kfree()d when their last CPU goes offline,
>> + * and allocated when the first CPU in a new domain comes online.
>> + * The rdt_resource's domain list is updated when this happens. Readers of
>> + * the domain list must either take cpus_read_lock(), or rely on an RCU
>> + * read-side critical section, to avoid observing concurrent modification.
>> + * All writers take this mutex:
>> + */
>> +static DEFINE_MUTEX(domain_list_lock);
>>  
> 
> I assume that you have not followed the SNC work. Please note that in 
> that work the domain list is split between a monitoring domain list and
> control domain list. I expect this lock would cover both and both would
> be rcu lists?

It's on my list to read through, but too much arm stuff comes up for me to get to it.
I agree that one write-lock to protect both RCU lists makes sense, those would only ever
be modified together. The case I have for needing to walk the list without taking a lock
only applies to the monitors - but keeping the rules the same makes it easier to think about.


>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> index b4ed4e1b4938..0620dfc72036 100644
>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c

>> @@ -535,7 +541,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>  	int cpu;
>>  
>>  	/* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
>> -	lockdep_assert_held(&rdtgroup_mutex);
>> +	lockdep_assert_cpus_held();
>>  
> 
> Only now is that comment accurate. Could it be moved to this patch?

Before this patch resctrl_arch_offline_cpu() took the mutex, if this thread held the
mutex, then cpuhp would get blocked in resctrl_arch_offline_cpu() until it was released.
What has changed is how that mutual-exclusion is provided, but the comment describes why
mutual-exclusion is needed.



>> @@ -3801,6 +3832,13 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
>>  	domain_destroy_mon_state(d);
>>  }
>>  
>> +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
>> +{
>> +	mutex_lock(&rdtgroup_mutex);
>> +	_resctrl_offline_domain(r, d);
>> +	mutex_unlock(&rdtgroup_mutex);
>> +}
>> +

> This seems unnecessary. Why not keep resctrl_offline_domain() as-is and just
> take the lock within it?

For offline there is nothing in it, but ....


>> @@ -3870,12 +3908,23 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
>>  	return 0;
>>  }
>>  
>> +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
>> +{
>> +	int err;
>> +
>> +	mutex_lock(&rdtgroup_mutex);
>> +	err = _resctrl_online_domain(r, d);
>> +	mutex_unlock(&rdtgroup_mutex);
>> +
>> +	return err;
>> +}
>> +
> 
> Same here.

resctrl_online_domain() has four exit paths, like this they can just return an error, and
the locking is taken care of here to keep the churn down.
But it's just preference - I've changed it to do this with a handful of gotos.


Thanks,

James