linux-kernel - Re: [PATCH v7 33/49] x86/resctrl: resctrl_exit() teardown resctrl but leave the mount point

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c9c4103-45e9-4fc9-93bd-5d6d48b27d79@intel.com>
Date: Fri, 7 Mar 2025 11:04:17 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: James Morse <james.morse@....com>, <x86@...nel.org>,
	<linux-kernel@...r.kernel.org>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, H Peter Anvin <hpa@...or.com>, Babu Moger
	<Babu.Moger@....com>, <shameerali.kolothum.thodi@...wei.com>, "D Scott
 Phillips OS" <scott@...amperecomputing.com>, <carl@...amperecomputing.com>,
	<lcherian@...vell.com>, <bobo.shaobowang@...wei.com>,
	<tan.shaopeng@...itsu.com>, <baolin.wang@...ux.alibaba.com>, Jamie Iles
	<quic_jiles@...cinc.com>, Xin Hao <xhao@...ux.alibaba.com>,
	<peternewman@...gle.com>, <dfustini@...libre.com>, <amitsinght@...vell.com>,
	David Hildenbrand <david@...hat.com>, Rex Nie <rex.nie@...uarmicro.com>,
	"Dave Martin" <dave.martin@....com>, Koba Ko <kobak@...dia.com>, Shanker
 Donthineni <sdonthineni@...dia.com>, <fenghuay@...dia.com>, Shaopeng Tan
	<tan.shaopeng@...fujitsu.com>, Tony Luck <tony.luck@...el.com>
Subject: Re: [PATCH v7 33/49] x86/resctrl: resctrl_exit() teardown resctrl but
 leave the mount point

Hi James,

On 3/7/25 9:54 AM, James Morse wrote:
> Hi Reinette,
> 
> On 07/03/2025 04:45, Reinette Chatre wrote:
>> On 2/28/25 11:58 AM, James Morse wrote:
>>> resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
>>> resctrl can't be built as a module, and the kernfs helpers are not exported
>>> so this is unlikely to change. MPAM has an error interrupt which indicates
>>> the MPAM driver has gone haywire. Should this occur tasks could run with
>>> the wrong control values, leading to bad performance for important tasks.
>>> The MPAM driver needs a way to tell resctrl that no further configuration
>>> should be attempted.
>>>
>>> Using resctrl_exit() for this leaves the system in a funny state as
>>> resctrl is still mounted, but cannot be un-mounted because the sysfs
>>> directory that is typically used has been removed. Dave Martin suggests
>>> this may cause systemd trouble in the future as not all filesystems
>>> can be unmounted.
>>>
>>> Add calls to remove all the files and directories in resctrl, and
>>> remove the sysfs_remove_mount_point() call that leaves the system
>>> in a funny state. When triggered, this causes all the resctrl files
>>> to disappear. resctrl can be unmounted, but not mounted again.
> 
>> (copying v6 discussion here)
> 
>> On 3/6/25 11:28 AM, James Morse wrote:
>>> On 01/03/2025 02:35, Reinette Chatre wrote:
>>>> On 2/28/25 11:54 AM, James Morse wrote:
>>>>> On 20/02/2025 04:42, Reinette Chatre wrote:
>>
>>>>>> It is difficult for me to follow the kernfs reference counting required
>>>>>> to make this work. Specifically, the root kn is "destroyed" here but it
>>>>>> is required to stick around until unmount when the rest of the files
>>>>>> are removed.
>>>>>
>>>>> This drops resctrl's reference to all of the files, which would make the files disappear.
>>>>> unmount is what calls kernfs_kill_sb(), which gets rid of the root of the filesystem.
>>>>
>>>> My concern is mostly with the kernfs_remove() calls in the rdt_kill_sb()->rmdir_all_sub()
>>>> flow. For example:
>>>> 	kernfs_remove(kn_info);
>>>> 	kernfs_remove(kn_mongrp);
>>>> 	kernfs_remove(kn_mondata);
>>>>
>>>> As I understand the above require the destroyed root to still be around.
> 
>>> Right - because rdt_get_tree() has these global pointers into the hierarchy, but doesn't
>>> take a reference. rmdir_all_sub() relies on always being called before
>>> rdtgroup_destroy_root().
> 
>> Is this a known issue then?
> 
> Its just a subtle thing that resctrl was relying on, and I didn't spot. Thanks for
> pointing it out!
> 
> 
>> Since I am not able to use your test I created something new
>> after thinking there would be no response to my comment and indeed on unmount:
>> [  293.707228] BUG: KASAN: slab-use-after-free in kernfs_remove+0x87/0xa0
>> [  293.714718] Read of size 8 at addr ff11000309d88f30 by task umount/3793
>>> The point hack would be for rdtgroup_destroy_root() to NULL out those global pointers, (I
>>> note they are left dangling) - that would make a subsequent call to rmdir_all_sub() harmless.
>>>
>>> A better fix would be to pull out all the filesystem relevant parts from rdt_kill_sb(),
>>> make that safe for multiple calls and get resctrl_exit() to call that.
>>> A call to rdt_kill_sb() after resctrl_exit() would just cleanup the super-block.
>>> This will leave things in a more predictable state.
> 
> 
>> Why just the filesystem relevant parts?
> 
> The stated aim is to prevent any further configuration of the hardware.
> We can change that to 'unmount as much as possible'. I didn't think of it like that as I
> couldn't find an in-kernel way of triggering a umount(). (and the mount may be copied into
> numerous namespaces)

deactivate_locked_super() looked promising when I started digging but I
quickly found myself in unfamiliar territory.

> 
> 
>> Although, you also state "resctrl_exit() would just
>> cleanup the super-block" that sounds like you are thinking about pulling out all reset work.
>> This sounds reasonable to me. It really feels more appropriate to do proper cleanup and
>> not just wipe the root while leaving everything else underneath it.
> 
> After this discussion, my new proposal is to do this:
> -------%<-------
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 0d74a6d98dba..cee4604a59c0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3063,6 +3063,21 @@ static void rmdir_all_sub(void)
>         kernfs_remove(kn_mondata);
>  }
> 
> +static void resctrl_fs_teardown(void)
> +{
> +       lockdep_assert_held(&rdtgroup_mutex);
> +
> +       /* Cleared by rdtgroup_destroy_root() */
> +       if (!rdtgroup_default.kn)
> +               return;
> +
> +       rmdir_all_sub();
> +       rdt_pseudo_lock_release();
> +       rdtgroup_default.mode = RDT_MODE_SHAREABLE;
> +       schemata_list_destroy();
> +       closid_exit();
> +       rdtgroup_destroy_root();
> +}
> +
>  static void rdt_kill_sb(struct super_block *sb)
>  {
>         struct rdt_resource *r;
> @@ -3076,11 +3091,8 @@ static void rdt_kill_sb(struct super_block *sb)
>         for_each_alloc_capable_rdt_resource(r)
>                 resctrl_arch_reset_all_ctrls(r);
> 
> -       rmdir_all_sub();
> -       rdt_pseudo_lock_release();
> -       rdtgroup_default.mode = RDT_MODE_SHAREABLE;
> -       schemata_list_destroy();
> -       rdtgroup_destroy_root();
> +       resctrl_fs_teardown();
> +
>         if (resctrl_arch_alloc_capable())
>                 resctrl_arch_disable_alloc();
>         if (resctrl_arch_mon_capable())
>                 resctrl_arch_disable_mon();
> -       closid_exit();
>         resctrl_mounted = false;
>         kernfs_kill_sb(sb);
>         mutex_unlock(&rdtgroup_mutex);
> -------%<-------
> 
> and call resctrl_fs_teardown() from resctrl_exit().
> 
> This leaves a bunch of resctrl_arch_ calls behind, the reason is its the arch code that
> calls resctrl_exit(), so it probably doesn't need to be told to disable things. For MPAM,
> the work of resctrl_arch_reset_all_ctrls() will already have been done - and all these
> calls will fail because the MPAM driver is blocking further access to the hardware.

As I understand it the overflow and limbo handlers will keep running after a resctrl fs teardown
done on error interrupt. As you mention in changelog this work is done because "The MPAM
driver needs a way to tell resctrl that no further configuration should be attempted.".
I believe these handlers may thus still try to interact (but not technically "configure"?)
with the hardware at this point ... is this ok?

rdt_disable_ctx() is an odd one to keep with the resctrl_arch_ calls that remain in rdt_kill_sb().
It may be a candidate for resctrl_fs_teardown() but unfortunately rdt_disable_ctx()
blurs fs/arch parts. What I am focusing on is the set_mba_sc(false) within it. Seems like this
may mean that the software controller will keep running unnecessarily. 

Thank you

Reinette