[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0aff0325-e410-4b14-aa69-adfabd0ac0b0@arm.com>
Date: Fri, 28 Feb 2025 19:53:21 +0000
From: James Morse <james.morse@....com>
To: Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
linux-kernel@...r.kernel.org
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, H Peter Anvin <hpa@...or.com>,
Babu Moger <Babu.Moger@....com>, shameerali.kolothum.thodi@...wei.com,
D Scott Phillips OS <scott@...amperecomputing.com>,
carl@...amperecomputing.com, lcherian@...vell.com,
bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
baolin.wang@...ux.alibaba.com, Jamie Iles <quic_jiles@...cinc.com>,
Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
dfustini@...libre.com, amitsinght@...vell.com,
David Hildenbrand <david@...hat.com>, Rex Nie <rex.nie@...uarmicro.com>,
Dave Martin <dave.martin@....com>, Koba Ko <kobak@...dia.com>,
Shanker Donthineni <sdonthineni@...dia.com>
Subject: Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by
replacing mon_data_bits
Hi Reinette,
On 20/02/2025 05:40, Reinette Chatre wrote:
> On 2/7/25 10:18 AM, James Morse wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated when the kernfs
>> file is created, and free'd when the monitor directory is rmdir'd.
>> Readers and writers must hold the rdtgroup_mutex, and readers should
>> check for a NULL pointer to protect against an open file preventing
>> the kernfs file from being free'd immediately after the rmdir call.
> The last sentence is difficult to parse and took me many reads. I see
> two major parts to this statement and if I understand correctly the current
> implementation combined with this patch does not support either.
> (a) "checking for a NULL pointer from readers"
> The reader is rdtgroup_mondata_show() and it starts by calling:
> rdtgrp = rdtgroup_kn_lock_live(of->kn);
> As I understand, on return of rdtgroup_kn_lock_live() the kernfs node
> "of->kn" may no longer exist. This seems to be an issue with current code
> also.
> Considering this, it seems to me that checking if of->kn->priv is NULL
> may be futile if of->kn may no longer exist.
Certainly true.
Because the lifetime is different to the existing pointer-abuse version, I just added the
checks to be on the safe side.
I'll rip this out.
> I think this also needs a reference to the data needed by the file or
> the data needs to be stashed away before the call to
> kernfs_break_active_protection().
I've tried to hit this problem, and been unable. I'm happy to write it off as theoretical.
In particular:
* rmdir a control group while holding the mbm_local_bytes file open for reading. Any read
after the parent control group has been destroyed gets -ENODEV, even though though
/proc/<pid>/fd shows the fd as open for reading. The kernel in question had lockdep and
kasan enabled)
* take all the CPUs in a domain offline while holding the mbm_local_bytes file open for
reading. Again, read attempts get -ENODEV.
> (b) "...being free'd immediately after the rmdir call"
> I believe this refers to expectation that one task may have the file open
> while another removes the resource group directory ("rmdir") with the
> assumption that the associated struct mon_data is removed during handling
> of rmdir.
This is what I was worried about - and it seemed worth chucking in a NULL check just in
case. Trying a bit harder to hit it - it now seems theoretical.
> In this implementation the monitoring data file's struct mon_data
> is only removed when a monitoring domain goes offline.
> That is, when the
> resource group remains intact while the monitoring data files associated with
> one domain is removed (for example when all CPUs associated with that domain
> goes offline). The "rmdir" to remove a resource group does not call this code
> (mon_rmdir_one_subdir()), nor does the cleanup of the default resource group's
> "kn_mondata".
Huh, its the path via user-space calling rmdir() that I was worried about. I hadn't
spotted that there are two of these and they aren't joined up!
This would leak the priv pointer when the user-space path via rmdir() just leaves the
cleanup to kernfs.
Fixing this produces even more spaghetti as domain-offline manipulates one domain in all
rdtgroup, whereas rmdir manipulates all domains in on rdtgroup. Its going to be noisy to
merge these two paths.
A simpler approach is to use the event kn->priv pointers in the default control group as
the canonical copy, which also saves memory. For mbm_total in a domain, every control and
monitor group has the same values in struct mon_data_bits - the RMID is found by walking
up the tree to find the struct rdtgroup.
As user-space can't rmdir the default control group, we only need to free it for
domain-offline, when we know all the files for that domain are going to be removed - which
we can rely on to avoid doing it in a particular order.
> I am trying to get a handle on the different lifetimes and if I understand
> correctly this implementation does not attempt to keep the struct mon_data
> accessible as long as the file is open.
No, but I think that concern is theoretical...
> I do not think I've discovered all the implications of this yet.
Thanks,
James
Powered by blists - more mailing lists