linux-kernel - Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1610172157360.6407@nanos>
Date:   Mon, 17 Oct 2016 23:14:55 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Fenghua Yu <fenghua.yu@...el.com>
cc:     "H. Peter Anvin" <h.peter.anvin@...el.com>,
        Ingo Molnar <mingo@...e.hu>, Tony Luck <tony.luck@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Stephane Eranian <eranian@...gle.com>,
        Borislav Petkov <bp@...e.de>,
        Dave Hansen <dave.hansen@...el.com>,
        Nilay Vaish <nilayvaish@...il.com>, Shaohua Li <shli@...com>,
        David Carrillo-Cisneros <davidcc@...gle.com>,
        Ravi V Shankar <ravi.v.shankar@...el.com>,
        Sai Prakhya <sai.praneeth.prakhya@...el.com>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>, x86 <x86@...nel.org>
Subject: Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file
 system

On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +/*
> + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> + * we can keep a bitmap of free CLOSIDs in a single integer.
> + *
> + * Please note: This only supports global CLOSID across multiple
> + * resources and multiple sockets. User can create rdtgroups including root
> + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> + * number of caches is big or number of supported resources sharing CLOSID
> + * is growing, it's getting harder to find usable rdtgroups which is limited
> + * by the small number of CLOSIDs.
> + *
> + * In the future, if it's necessary, we can implement more complex CLOSID
> + * allocation per socket/per resource domain and utilize CLOSIDs as many
> + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.

I'm confused as usual, but a two socket broadwell has exactly two L3 cache
domains and exactly 16 CLOSIDs per cache domain.

If you take CDP into account then the number of CLOSIDs is reduced to 8 per
cache domains.

So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.

With the proposed user interface the number of unique rdtgroups is simply
the number of CLOSIDs because we handle the cache domains already per
resource, i.e. the meaning of CLOSID can be set independently per cache
domain.

Can you please explain why you think that we can have 16x16 unique
rdtgroups if we just have 16 resp. 8 CLOSIDs available?

> +static int closid_free_map;
> +
> +static void closid_init(void)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
> +	int rdt_max_closid;
> +
> +	/* Enabling L3 CDP halves the number of CLOSIDs */
> +	if (r->cdp_enabled)
> +		r->num_closid = r->max_closid / 2;
> +	else
> +		r->num_closid = r->max_closid;

If you do the L3Data/L3Core thingy then this can go away.

> +	/* Compute rdt_max_closid across all resources */
> +	rdt_max_closid = 0;
> +	for_each_rdt_resource(r)
> +		rdt_max_closid = max(rdt_max_closid, r->num_closid);

Oh no! This needs to be min().

Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
L3 simply because you do not have a seperation of L2 and L3 in
MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
undefined behaviour. See SDM:

 "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
  of the lower half of the COS space will cause undefined performance impact
  to code and data fetches due to MSR space re-indexing into code/data masks
  when CDP is enabled."

> +	if (rdt_max_closid > 32) {
> +		pr_warn_once("Only using 32/%d CLOSIDs\n", rdt_max_closid);

You have a fancy for pr_*_once(). How often is this going to be executed?
Once per mount and we can really print it each time. mount is hardly a fast
path operation.

Also please spell out: "Using only 32 of %d CLOSIDs"
because "Using only 32/64 COSIDs" does not make any sense.

> +		rdt_max_closid = 32;
> +	}

> +static void rdt_reset_pqr_assoc_closid(void *v)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> +	state->closid = 0;
> +	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);

So this is called with preemption disabled, but is thread context the only
context in which pqr_state can be modified? If yes, please add a comment
explaining this clearly, otherwise the protection is not sufficient. I'm
too lazy to stare into the monitoring code ...

> +/*
> + * Forcibly remove all of subdirectories under root.
> + */
> +static void rmdir_all_sub(void)
> +{
> +	struct rdtgroup *rdtgrp;
> +	struct list_head *l, *next;
> +
> +	get_cpu();
> +	/* Reset PQR_ASSOC MSR on this cpu. */
> +	rdt_reset_pqr_assoc_closid(NULL);
> +	/* Reset PQR_ASSOC MSR on the rest of cpus. */
> +	smp_call_function_many(cpu_online_mask, rdt_reset_pqr_assoc_closid,
> +			       NULL, 1);

So you have reset the CLOSIDs to 0, but what resets L3/2_QOS_MASK_0 to the
default value (all valid bits set)? Further what clears CDP? 

> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +			  umode_t mode)
> +{
    ....
> +	/* allocate the rdtgroup. */
> +	rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> +	if (!rdtgrp) {
> +		ret = -ENOSPC;
> +		goto out_closid_free;
> +	}
> +	rdtgrp->closid = closid;
> +	list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
> +
> +	/* kernfs creates the directory for rdtgrp */
> +	kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
> +	if (IS_ERR(kn)) {
> +		ret = PTR_ERR(kn);
> +		goto out_cancel_ref;

Yuck! You are going to free rdtgrp, which is still enqueued in
rdt_all_groups...

> +static int rdtgroup_rmdir(struct kernfs_node *kn)
> +{
> +	struct rdtgroup *rdtgrp;
> +	int ret = 0;
> +
> +	rdtgrp = rdtgroup_kn_lock_live(kn);
> +	if (!rdtgrp) {
> +		rdtgroup_kn_unlock(kn);
> +		return -ENOENT;
> +	}
> +
> +	/*
> +	 * rmdir is for deleting resource groups. Don't
> +	 * allow deletion of "info" or any of its subdirectories
> +	 */
> +	if (!rdtgrp) {

And how are we going to reach this? You checked !rdtgrp already above ....

> +		mutex_unlock(&rdtgroup_mutex);
> +		kernfs_unbreak_active_protection(kn);
> +		return -EPERM;
> +	}
> +
> +	rdtgrp->flags = RDT_DELETED;
> +	closid_free(rdtgrp->closid);
> +	list_del(&rdtgrp->rdtgroup_list);
> +
> +	/*
> +	 * one extra hold on this, will drop when we kfree(rdtgrp)
> +	 * in rdtgroup_kn_unlock()
> +	 */
> +	kernfs_get(kn);

So in rdtgrp_mkdir you have this: 

> +	/*
> +	 * This extra ref will be put in kernfs_remove() and guarantees
> +	 * that @rdtgrp->kn is always accessible.
> +	 */
> +	kernfs_get(kn);
> +

That smells fishy ,..

Thanks,

	tglx