linux-kernel - Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0aff0325-e410-4b14-aa69-adfabd0ac0b0@arm.com>
Date: Fri, 28 Feb 2025 19:53:21 +0000
From: James Morse <james.morse@....com>
To: Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
 linux-kernel@...r.kernel.org
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, H Peter Anvin <hpa@...or.com>,
 Babu Moger <Babu.Moger@....com>, shameerali.kolothum.thodi@...wei.com,
 D Scott Phillips OS <scott@...amperecomputing.com>,
 carl@...amperecomputing.com, lcherian@...vell.com,
 bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
 baolin.wang@...ux.alibaba.com, Jamie Iles <quic_jiles@...cinc.com>,
 Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
 dfustini@...libre.com, amitsinght@...vell.com,
 David Hildenbrand <david@...hat.com>, Rex Nie <rex.nie@...uarmicro.com>,
 Dave Martin <dave.martin@....com>, Koba Ko <kobak@...dia.com>,
 Shanker Donthineni <sdonthineni@...dia.com>
Subject: Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by
 replacing mon_data_bits

Hi Reinette,

On 20/02/2025 05:40, Reinette Chatre wrote:
> On 2/7/25 10:18 AM, James Morse wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated when the kernfs
>> file is created, and free'd when the monitor directory is rmdir'd.

>> Readers and writers must hold the rdtgroup_mutex, and readers should
>> check for a NULL pointer to protect against an open file preventing
>> the kernfs file from being free'd immediately after the rmdir call.

> The last sentence is difficult to parse and took me many reads. I see
> two major parts to this statement and if I understand correctly the current
> implementation combined with this patch does not support either.
> (a) "checking for a NULL pointer from readers"
>     The reader is rdtgroup_mondata_show() and it starts by calling:
> 		rdtgrp = rdtgroup_kn_lock_live(of->kn);
>     As I understand, on return of rdtgroup_kn_lock_live() the kernfs node
>     "of->kn" may no longer exist. This seems to be an issue with current code
>     also.
>     Considering this, it seems to me that checking if of->kn->priv is NULL
>     may be futile if of->kn may no longer exist.

Certainly true.
Because the lifetime is different to the existing pointer-abuse version, I just added the
checks to be on the safe side.

I'll rip this out.


>     I think this also needs a reference to the data needed by the file or
>     the data needs to be stashed away before the call to
>     kernfs_break_active_protection().

I've tried to hit this problem, and been unable. I'm happy to write it off as theoretical.

In particular:
* rmdir a control group while holding the mbm_local_bytes file open for reading. Any read
after the parent control group has been destroyed gets -ENODEV, even though though
/proc/<pid>/fd shows the fd as open for reading. The kernel in question had lockdep and
kasan enabled)
* take all the CPUs in a domain offline while holding the mbm_local_bytes file open for
reading. Again, read attempts get -ENODEV.


> (b) "...being free'd immediately after the rmdir call"
>     I believe this refers to expectation that one task may have the file open
>     while another removes the resource group directory ("rmdir") with the
>     assumption that the associated struct mon_data is removed during handling
>     of rmdir.

This is what I was worried about - and it seemed worth chucking in a NULL check just in
case. Trying a bit harder to hit it - it now seems theoretical.


>     In this implementation the monitoring data file's struct mon_data
>     is only removed when a monitoring domain goes offline.

>     That is, when the
>     resource group remains intact while the monitoring data files associated with
>     one domain is removed (for example when all CPUs associated with that domain
>     goes offline). The "rmdir" to remove a resource group does not call this code
>     (mon_rmdir_one_subdir()), nor does the cleanup of the default resource group's
>     "kn_mondata".  

Huh, its the path via user-space calling rmdir() that I was worried about. I hadn't
spotted that there are two of these and they aren't joined up!

This would leak the priv pointer when the user-space path via rmdir() just leaves the
cleanup to kernfs.

Fixing this produces even more spaghetti as domain-offline manipulates one domain in all
rdtgroup, whereas rmdir manipulates all domains in on rdtgroup. Its going to be noisy to
merge these two paths.


A simpler approach is to use the event kn->priv pointers in the default control group as
the canonical copy, which also saves memory. For mbm_total in a domain, every control and
monitor group has the same values in struct mon_data_bits - the RMID is found by walking
up the tree to find the struct rdtgroup.
As user-space can't rmdir the default control group, we only need to free it for
domain-offline, when we know all the files for that domain are going to be removed - which
we can rely on to avoid doing it in a particular order.


> I am trying to get a handle on the different lifetimes and if I understand
> correctly this implementation does not attempt to keep the struct mon_data
> accessible as long as the file is open.

No, but I think that concern is theoretical...

> I do not think I've discovered all the implications of this yet.


Thanks,

James