lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e1816cbf-e2a7-44cf-92f9-bbd24d9e264b@intel.com>
Date: Thu, 6 Mar 2025 21:03:56 -0800
From: Reinette Chatre <reinette.chatre@...el.com>
To: James Morse <james.morse@....com>, <x86@...nel.org>,
	<linux-kernel@...r.kernel.org>
CC: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	Borislav Petkov <bp@...en8.de>, H Peter Anvin <hpa@...or.com>, Babu Moger
	<Babu.Moger@....com>, <shameerali.kolothum.thodi@...wei.com>, "D Scott
 Phillips OS" <scott@...amperecomputing.com>, <carl@...amperecomputing.com>,
	<lcherian@...vell.com>, <bobo.shaobowang@...wei.com>,
	<tan.shaopeng@...itsu.com>, <baolin.wang@...ux.alibaba.com>, Jamie Iles
	<quic_jiles@...cinc.com>, Xin Hao <xhao@...ux.alibaba.com>,
	<peternewman@...gle.com>, <dfustini@...libre.com>, <amitsinght@...vell.com>,
	David Hildenbrand <david@...hat.com>, Rex Nie <rex.nie@...uarmicro.com>,
	"Dave Martin" <dave.martin@....com>, Koba Ko <kobak@...dia.com>, Shanker
 Donthineni <sdonthineni@...dia.com>, <fenghuay@...dia.com>
Subject: Re: [PATCH v7 37/49] x86/resctrl: Expand the width of dom_id by
 replacing mon_data_bits

Hi James,

On 2/28/25 11:59 AM, James Morse wrote:
> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
> the domain-id, and is packed into the mon_data_bits union bitfield.
> The width of cache-id in this field is 14 bits.
> 
> Expanding the union would break 32bit x86 platforms as this union is
> stored as the kernfs kn->priv pointer. This saved allocating memory
> for the priv data storage.
> 
> The firmware on MPAM platforms have used the PPTT cache-id field to
> expose the interconnect's id for the cache, which is sparse and uses
> more than 14 bits. Use of this id is to enable PCIe direct cache
> injection hints. Using this feature with VFIO means the value provided
> by the ACPI table should be exposed to user-space.
> 
> To support cache-id values greater than 14 bits, convert the
> mon_data_bits union to a structure. This is allocated for the default
> control group when the kernfs event files are created, and free'd when
> the monitor directory is rmdir'd when the domain goes offline.
> All other control and monitor groups lookup the struct mon_data allocated
> for the default control group, and use this.
> This simplifies the lifecycle of this structure as the default control
> group cannot be rmdir()d by user-space, so only needs to consider
> domain-offline, which removes all the event files corresponding to a
> domain while holding rdtgroup_mutex - which prevents concurrent
> readers. mkdir_mondata_subdir_allrdtgrp() must special case the default
> control group to ensure it is created first.

This is novel.

> 
> Signed-off-by: James Morse <james.morse@....com>
> ---
> Previously the MPAM tree repainted the cache-id to compact them,
> argue-ing there was no other user. With VFIO use of this PCIe feature,
> this is no longer an option.
> 
> Changes since v6:
>  * Added the get/put helpers.
>  * Special case the creation of the mondata files for the default control
>    group.
>  * Removed wording about files living longer than expected, the corresponding
>    error handling is wrapped in WARN_ON_ONCE() as this indicates a bug.
> ---

...

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aecd3fa734cd..443635d195f0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -3114,6 +3114,110 @@ static struct file_system_type rdt_fs_type = {
>  	.kill_sb		= rdt_kill_sb,
>  };
>  
> +/**
> + * mon_get_default_kn_priv() - Get the mon_data priv data for this event from
> + *                             the default control group.

Since this involves monitoring this would technically be the default "monitoring"
group (throughout). 

> + * Called when monitor event files are created for a domain.
> + * When called with the default control group, the structure will be allocated.

A bit difficult to parse. Assuming there is a re-spin, how about something like:
"Allocate the structure when @rdtgrp is the default group."


> + * This happens at mount time, before other control or monitor groups are
> + * created.
> + * This simplifies the lifetime management for rmdir() versus domain-offline
> + * as the default control group lives forever, and only one group needs to be
> + * special cased.
> + *
> + * @r:      The resource for the event type being created.
> + * @d:	    The domain for the event type being created.

Stray tab makes for inconsisent spacing.

> + * @mevt:   The event type being created.
> + * @rdtgrp: The rdtgroup for which the monitor file is being created,
> + *          used to determine if this is the default control group.
> + * @do_sum: Whether the SNC sub-numa node monitors are being created.

do_sum can be true or false when it comes to the SNC files (more below). 

> + */
> +static struct mon_data *mon_get_default_kn_priv(struct rdt_resource *r,
> +						struct rdt_mon_domain *d,
> +						struct mon_evt *mevt,
> +						struct rdtgroup *rdtgrp,
> +						bool do_sum)
> +{
> +	struct kernfs_node *kn_dom, *kn_evt;
> +	struct mon_data *priv;
> +	bool snc_mode;
> +	char name[32];
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	snc_mode = r->mon_scope == RESCTRL_L3_NODE;
> +	if (!do_sum)
> +		sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id);

When in SNC mode the "mon_<resource name>_ files always use d->ci->id as the domain id.

> +	else
> +		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
> +


The mon_sub_<resource name>_ directories are always created when in SNC mode, they do
not exist on non SNC enabled systems. The mon_<resource name>_ directories exists on
both SNC enabled and non-SNC/SNC disabled systems. The mon_<resource name>_ directories
on SNC enabled system will have "do_sum" set. I think what you may be trying to do
here is something like:

	if (!snc_mode) { /* do_sum is not relevant */
		sprintf(name, "mon_%s_%02d", r->name, d->hdr.id);
	} else if (snc_mode && do_sum) {
		sprintf(name, "mon_%s_%02d", r->name, d->ci->id);
	} else { /* snc_mode && !do_sum */
		sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id);
	}

> +	kn_dom = kernfs_find_and_get(kn_mondata, name);
> +	if (!kn_dom)
> +		return NULL;

It seems like this either assumes the directories are on the same level or assumes
kernfs_find_and_get() does a recursive find. As I understand kernfs_find_and_get()
does not do a recursive find.  On SNC enabled systems the mon_sub_<resource name>_
directories are subdirectories of the mon_<resource name>_ directories.
Example of how hierarchy may look is at:
https://lore.kernel.org/all/20240628215619.76401-9-tony.luck@intel.com/

With all of the above I do not think this will work on an SNC enabled
system ... to confirm this I tried it out and it is not possible to mount
resctrl on an SNC enabled system and the WARN_ON_ONCE() this patch adds to
mon_add_all_files() is hit.

> +
> +	kn_evt = kernfs_find_and_get(kn_dom, mevt->name);

Note the "...and_get..." in kernfs_find_and_get() that gets a reference to
the kn before returning it. I expect that this work will have symmetry when it
comes to get/put of references but I see four new calls to kernfs_find_and_get() but
no new matching kernfs_put() to release the new references. It looks like
kernfs_find_and_get() is just used to figure out what the kn is so the references
need not be kept around for long.

> +
> +	/* Is this the creation of the default groups monitor files? */
> +	if (!kn_evt && rdtgrp == &rdtgroup_default) {
> +		priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +		if (!priv)
> +			return NULL;
> +		priv->rid = r->rid;
> +		priv->domid = do_sum ? d->ci->id : d->hdr.id;
> +		priv->sum = do_sum;
> +		priv->evtid = mevt->evtid;
> +		return priv;
> +	}
> +
> +	if (!kn_evt)
> +		return NULL;
> +
> +	return kn_evt->priv;
> +}
> +
> +/**
> + * mon_put_default_kn_priv_all() - Potentially free the mon_data priv data for
> + *                                 all events from the default control group.
> + * Put the mon_data priv data for all events for a particular domain.
> + * When called with the default control group, the priv structure previously
> + * allocated will be kfree()d. This should only be done as part of taking a
> + * domain offline.
> + * Only a domain offline will 'rmdir' monitor files in the default control
> + * group. After domain offline releases rdtgrp_mutex, all references will
> + * have been removed.
> + *
> + * @rdtgrp:  The rdtgroup for which the monitor files are being removed,
> + *           used to determine if this is the default control group.
> + * @name:    The name of the domain or SNC sub-numa domain which is being
> + *           taken offline.

This is a bit confusing since domains do not have names. How about (please feel
free to improve):
"Name of directory containing monitoring files that is in process of being removed."

> + */
> +static void mon_put_default_kn_priv_all(struct rdtgroup *rdtgrp, char *name)
> +{
> +	struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
> +	struct kernfs_node *kn_dom, *kn_evt;
> +	struct mon_evt *mevt;
> +
> +	lockdep_assert_held(&rdtgroup_mutex);
> +
> +	if (rdtgrp != &rdtgroup_default)
> +		return;
> +
> +	kn_dom = kernfs_find_and_get(kn_mondata, name);
> +	if (!kn_dom)
> +		return;

I expect this will always fail when @name is a mon_sub_* directory.

> +
> +	list_for_each_entry(mevt, &r->evt_list, list) {
> +		kn_evt = kernfs_find_and_get(kn_dom, mevt->name);
> +		if (!kn_evt)
> +			continue;
> +		if (!kn_evt->priv)
> +			continue;
> +
> +		kfree(kn_evt->priv);
> +		kn_evt->priv = NULL;
> +	}
> +}
> +
>  static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>  		       void *priv)
>  {
> @@ -3135,19 +3239,25 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>  	return ret;
>  }
>  
> -static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname)
> +static void mon_rmdir_one_subdir(struct rdtgroup *rdtgrp, char *name, char *subname)
>  {
> +	struct kernfs_node *pkn = rdtgrp->mon.mon_data_kn;
>  	struct kernfs_node *kn;
>  
>  	kn = kernfs_find_and_get(pkn, name);
>  	if (!kn)
>  		return;
> +
> +	mon_put_default_kn_priv_all(rdtgrp, name);
> +
>  	kernfs_put(kn);
>  
> -	if (kn->dir.subdirs <= 1)
> +	if (kn->dir.subdirs <= 1) {
>  		kernfs_remove(kn);
> -	else
> +	} else {
> +		mon_put_default_kn_priv_all(rdtgrp, subname);
>  		kernfs_remove_by_name(kn, subname);
> +	}
>  }
>  
>  /*
> @@ -3170,10 +3280,10 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>  		sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id);
>  
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> -		mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname);
> +		mon_rmdir_one_subdir(prgrp, name, subname);
>  
>  		list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list)
> -			mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname);
> +			mon_rmdir_one_subdir(crgrp, name, subname);
>  	}
>  }
>  
> @@ -3182,19 +3292,19 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d,
>  			     bool do_sum)
>  {
>  	struct rmid_read rr = {0};
> -	union mon_data_bits priv;
> +	struct mon_data *priv;
>  	struct mon_evt *mevt;
>  	int ret;
>  
>  	if (WARN_ON(list_empty(&r->evt_list)))
>  		return -EPERM;
>  
> -	priv.u.rid = r->rid;
> -	priv.u.domid = do_sum ? d->ci->id : d->hdr.id;
> -	priv.u.sum = do_sum;
>  	list_for_each_entry(mevt, &r->evt_list, list) {
> -		priv.u.evtid = mevt->evtid;
> -		ret = mon_addfile(kn, mevt->name, priv.priv);
> +		priv = mon_get_default_kn_priv(r, d, mevt, prgrp, do_sum);
> +		if (WARN_ON_ONCE(!priv))
> +			return -EINVAL;
> +

This is the warning I hit on the SNC system.

> +		ret = mon_addfile(kn, mevt->name, priv);
>  		if (ret)
>  			return ret;
>  
> @@ -3274,9 +3384,17 @@ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
>  	struct rdtgroup *prgrp, *crgrp;
>  	struct list_head *head;
>  
> +	/*
> +	 * During domain-online create the default control group first
> +	 * so that mon_get_default_kn_priv() can find the allocated structure
> +	 * on subsequent calls.
> +	 */
> +	mkdir_mondata_subdir(kn_mondata, d, r, &rdtgroup_default);
> +
>  	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
>  		parent_kn = prgrp->mon.mon_data_kn;
> -		mkdir_mondata_subdir(parent_kn, d, r, prgrp);
> +		if (prgrp != &rdtgroup_default)
> +			mkdir_mondata_subdir(parent_kn, d, r, prgrp);
>  
>  		head = &prgrp->mon.crdtgrp_list;
>  		list_for_each_entry(crgrp, head, mon.crdtgrp_list) {

Reinette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ