linux-kernel - Re: [PATCH v6 09/18] nitro_enclaves: Add logic for setting an enclave vCPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7adb6707-4101-8f47-6497-fe086dcc11f3@amazon.de>
Date:   Mon, 10 Aug 2020 09:33:42 +0200
From:   Alexander Graf <graf@...zon.de>
To:     Andra Paraschiv <andraprs@...zon.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
CC:     Anthony Liguori <aliguori@...zon.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Colm MacCarthaigh <colmmacc@...zon.com>,
        "David Duncan" <davdunc@...zon.com>,
        Bjoern Doebel <doebel@...zon.de>,
        "David Woodhouse" <dwmw@...zon.co.uk>,
        Frank van der Linden <fllinden@...zon.com>,
        Greg KH <gregkh@...uxfoundation.org>,
        Karen Noel <knoel@...hat.com>,
        "Martin Pohlack" <mpohlack@...zon.de>,
        Matt Wilson <msw@...zon.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Balbir Singh <sblbir@...zon.com>,
        Stefano Garzarella <sgarzare@...hat.com>,
        Stefan Hajnoczi <stefanha@...hat.com>,
        Stewart Smith <trawets@...zon.com>,
        Uwe Dannowski <uwed@...zon.de>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        kvm <kvm@...r.kernel.org>,
        ne-devel-upstream <ne-devel-upstream@...zon.com>
Subject: Re: [PATCH v6 09/18] nitro_enclaves: Add logic for setting an enclave
 vCPU



On 05.08.20 11:10, Andra Paraschiv wrote:
> An enclave, before being started, has its resources set. One of its
> resources is CPU.
> 
> A NE CPU pool is set and enclave CPUs are chosen from it. Offline the
> CPUs from the NE CPU pool during the pool setup and online them back
> during the NE CPU pool teardown. The CPU offline is necessary so that
> there would not be more vCPUs than physical CPUs available to the
> primary / parent VM. In that case the CPUs would be overcommitted and
> would change the initial configuration of the primary / parent VM of
> having dedicated vCPUs to physical CPUs.
> 
> The enclave CPUs need to be full cores and from the same NUMA node. CPU
> 0 and its siblings have to remain available to the primary / parent VM.
> 
> Add ioctl command logic for setting an enclave vCPU.
> 
> Signed-off-by: Alexandru Vasile <lexnv@...zon.com>
> Signed-off-by: Andra Paraschiv <andraprs@...zon.com>
> ---
> Changelog
> 
> v5 -> v6
> 
> * Check CPUs are from the same NUMA node before going through CPU
>    siblings during the NE CPU pool setup.
> * Update documentation to kernel-doc format.
> 
> v4 -> v5
> 
> * Set empty string in case of invalid NE CPU pool.
> * Clear NE CPU pool mask on pool setup failure.
> * Setup NE CPU cores out of the NE CPU pool.
> * Early exit on NE CPU pool setup if enclave(s) already running.
> * Remove sanity checks for situations that shouldn't happen, only if
>    buggy system or broken logic at all.
> * Add check for maximum vCPU id possible before looking into the CPU
>    pool.
> * Remove log on copy_from_user() / copy_to_user() failure and on admin
>    capability check for setting the NE CPU pool.
> * Update the ioctl call to not create a file descriptor for the vCPU.
> * Split the CPU pool usage logic in 2 separate functions - one to get a
>    CPU from the pool and the other to check the given CPU is available in
>    the pool.
> 
> v3 -> v4
> 
> * Setup the NE CPU pool at runtime via a sysfs file for the kernel
>    parameter.
> * Check enclave CPUs to be from the same NUMA node.
> * Use dev_err instead of custom NE log pattern.
> * Update the NE ioctl call to match the decoupling from the KVM API.
> 
> v2 -> v3
> 
> * Remove the WARN_ON calls.
> * Update static calls sanity checks.
> * Update kzfree() calls to kfree().
> * Remove file ops that do nothing for now - open, ioctl and release.
> 
> v1 -> v2
> 
> * Add log pattern for NE.
> * Update goto labels to match their purpose.
> * Remove the BUG_ON calls.
> * Check if enclave state is init when setting enclave vCPU.
> ---
>   drivers/virt/nitro_enclaves/ne_misc_dev.c | 575 ++++++++++++++++++++++
>   1 file changed, 575 insertions(+)
> 
> diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
> index 6c8c12f65666..4787bc59d39d 100644
> --- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
> +++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
> @@ -57,8 +57,11 @@
>    * TODO: Update logic to create new sysfs entries instead of using
>    * a kernel parameter e.g. if multiple sysfs files needed.
>    */
> +static int ne_set_kernel_param(const char *val, const struct kernel_param *kp);
> +
>   static const struct kernel_param_ops ne_cpu_pool_ops = {
>   	.get	= param_get_string,
> +	.set	= ne_set_kernel_param,
>   };
>   
>   static char ne_cpus[NE_CPUS_SIZE];
> @@ -87,6 +90,575 @@ struct ne_cpu_pool {
>   
>   static struct ne_cpu_pool ne_cpu_pool;
>   
> +/**
> + * ne_check_enclaves_created() - Verify if at least one enclave has been created.
> + * @void:	No parameters provided.
> + *
> + * Context: Process context.
> + * Return:
> + * * True if at least one enclave is created.
> + * * False otherwise.
> + */
> +static bool ne_check_enclaves_created(void)
> +{
> +	struct ne_pci_dev *ne_pci_dev = NULL;
> +	/* TODO: Find another way to get the NE PCI device reference. */
> +	struct pci_dev *pdev = pci_get_device(PCI_VENDOR_ID_AMAZON, PCI_DEVICE_ID_NE, NULL);

Can we just make this a global ref count instead?

> +	bool ret = false;
> +
> +	if (!pdev)
> +		return ret;
> +
> +	ne_pci_dev = pci_get_drvdata(pdev);
> +	if (!ne_pci_dev) {
> +		pci_dev_put(pdev);
> +
> +		return ret;
> +	}
> +
> +	mutex_lock(&ne_pci_dev->enclaves_list_mutex);
> +
> +	if (!list_empty(&ne_pci_dev->enclaves_list))
> +		ret = true;
> +
> +	mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
> +
> +	pci_dev_put(pdev);
> +
> +	return ret;
> +}
> +
> +/**
> + * ne_setup_cpu_pool() - Set the NE CPU pool after handling sanity checks such
> + *			 as not sharing CPU cores with the primary / parent VM
> + *			 or not using CPU 0, which should remain available for
> + *			 the primary / parent VM. Offline the CPUs from the
> + *			 pool after the checks passed.
> + * @ne_cpu_list:	The CPU list used for setting NE CPU pool.
> + *
> + * Context: Process context.
> + * Return:
> + * * 0 on success.
> + * * Negative return value on failure.
> + */
> +static int ne_setup_cpu_pool(const char *ne_cpu_list)
> +{
> +	int core_id = -1;
> +	unsigned int cpu = 0;
> +	cpumask_var_t cpu_pool = NULL;
> +	unsigned int cpu_sibling = 0;
> +	unsigned int i = 0;
> +	int numa_node = -1;
> +	int rc = -EINVAL;
> +
> +	if (!ne_cpu_list)
> +		return 0;
> +
> +	if (!zalloc_cpumask_var(&cpu_pool, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	mutex_lock(&ne_cpu_pool.mutex);
> +
> +	rc = cpulist_parse(ne_cpu_list, cpu_pool);
> +	if (rc < 0) {
> +		pr_err("%s: Error in cpulist parse [rc=%d]\n", ne_misc_dev.name, rc);
> +
> +		goto free_pool_cpumask;
> +	}
> +
> +	cpu = cpumask_any(cpu_pool);
> +	if (cpu >= nr_cpu_ids) {
> +		pr_err("%s: No CPUs available in CPU pool\n", ne_misc_dev.name);
> +
> +		rc = -EINVAL;
> +
> +		goto free_pool_cpumask;
> +	}
> +
> +	/*
> +	 * Check if the CPUs from the NE CPU pool are from the same NUMA node.
> +	 */
> +	for_each_cpu(cpu, cpu_pool) {
> +		if (numa_node < 0) {
> +			numa_node = cpu_to_node(cpu);
> +			if (numa_node < 0) {
> +				pr_err("%s: Invalid NUMA node %d\n",
> +				       ne_misc_dev.name, numa_node);
> +
> +				rc = -EINVAL;
> +
> +				goto free_pool_cpumask;
> +			}
> +		} else {
> +			if (numa_node != cpu_to_node(cpu)) {
> +				pr_err("%s: CPUs with different NUMA nodes\n",
> +				       ne_misc_dev.name);
> +
> +				rc = -EINVAL;
> +
> +				goto free_pool_cpumask;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * Check if CPU 0 and its siblings are included in the provided CPU pool
> +	 * They should remain available for the primary / parent VM.
> +	 */
> +	if (cpumask_test_cpu(0, cpu_pool)) {
> +		pr_err("%s: CPU 0 has to remain available\n", ne_misc_dev.name);
> +
> +		rc = -EINVAL;
> +
> +		goto free_pool_cpumask;
> +	}
> +
> +	for_each_cpu(cpu_sibling, topology_sibling_cpumask(0)) {
> +		if (cpumask_test_cpu(cpu_sibling, cpu_pool)) {
> +			pr_err("%s: CPU sibling %d for CPU 0 is in CPU pool\n",
> +			       ne_misc_dev.name, cpu_sibling);
> +
> +			rc = -EINVAL;
> +
> +			goto free_pool_cpumask;
> +		}
> +	}
> +
> +	/*
> +	 * Check if CPU siblings are included in the provided CPU pool. The
> +	 * expectation is that CPU cores are made available in the CPU pool for
> +	 * enclaves.
> +	 */
> +	for_each_cpu(cpu, cpu_pool) {
> +		for_each_cpu(cpu_sibling, topology_sibling_cpumask(cpu)) {
> +			if (!cpumask_test_cpu(cpu_sibling, cpu_pool)) {
> +				pr_err("%s: CPU %d is not in CPU pool\n",
> +				       ne_misc_dev.name, cpu_sibling);
> +
> +				rc = -EINVAL;
> +
> +				goto free_pool_cpumask;
> +			}
> +		}
> +	}
> +
> +	ne_cpu_pool.avail_cores_size = nr_cpu_ids / smp_num_siblings;
> +
> +	ne_cpu_pool.avail_cores = kcalloc(ne_cpu_pool.avail_cores_size,
> +					  sizeof(*ne_cpu_pool.avail_cores),
> +					  GFP_KERNEL);
> +	if (!ne_cpu_pool.avail_cores) {
> +		rc = -ENOMEM;
> +
> +		goto free_pool_cpumask;
> +	}
> +
> +	for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> +		if (!zalloc_cpumask_var(&ne_cpu_pool.avail_cores[i], GFP_KERNEL)) {
> +			rc = -ENOMEM;
> +
> +			goto free_cores_cpumask;
> +		}
> +
> +	/* Split the NE CPU pool in CPU cores. */
> +	for_each_cpu(cpu, cpu_pool) {
> +		core_id = topology_core_id(cpu);
> +		if (core_id < 0 || core_id >= ne_cpu_pool.avail_cores_size) {
> +			pr_err("%s: Invalid core id  %d\n",
> +			       ne_misc_dev.name, core_id);
> +
> +			rc = -EINVAL;
> +
> +			goto clear_cpumask;
> +		}
> +
> +		cpumask_set_cpu(cpu, ne_cpu_pool.avail_cores[core_id]);
> +	}
> +
> +	/*
> +	 * CPUs that are given to enclave(s) should not be considered online
> +	 * by Linux anymore, as the hypervisor will degrade them to floating.
> +	 * The physical CPUs (full cores) are carved out of the primary / parent
> +	 * VM and given to the enclave VM. The same number of vCPUs would run
> +	 * on less pCPUs for the primary / parent VM.
> +	 *
> +	 * We offline them here, to not degrade performance and expose correct
> +	 * topology to Linux and user space.
> +	 */
> +	for_each_cpu(cpu, cpu_pool) {
> +		rc = remove_cpu(cpu);
> +		if (rc != 0) {
> +			pr_err("%s: CPU %d is not offlined [rc=%d]\n",
> +			       ne_misc_dev.name, cpu, rc);
> +
> +			goto online_cpus;
> +		}
> +	}
> +
> +	free_cpumask_var(cpu_pool);
> +
> +	ne_cpu_pool.numa_node = numa_node;
> +
> +	mutex_unlock(&ne_cpu_pool.mutex);
> +
> +	return 0;
> +
> +online_cpus:
> +	for_each_cpu(cpu, cpu_pool)
> +		add_cpu(cpu);
> +clear_cpumask:
> +	for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> +		cpumask_clear(ne_cpu_pool.avail_cores[i]);
> +free_cores_cpumask:
> +	for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> +		free_cpumask_var(ne_cpu_pool.avail_cores[i]);
> +	kfree(ne_cpu_pool.avail_cores);
> +	ne_cpu_pool.avail_cores_size = 0;
> +free_pool_cpumask:
> +	free_cpumask_var(cpu_pool);
> +	mutex_unlock(&ne_cpu_pool.mutex);
> +
> +	return rc;
> +}
> +
> +/**
> + * ne_teardown_cpu_pool() - Online the CPUs from the NE CPU pool and cleanup the
> + *			    CPU pool.
> + * @void:	No parameters provided.
> + *
> + * Context: Process context.
> + */
> +static void ne_teardown_cpu_pool(void)
> +{
> +	unsigned int cpu = 0;
> +	unsigned int i = 0;
> +	int rc = -EINVAL;
> +
> +	mutex_lock(&ne_cpu_pool.mutex);
> +
> +	if (!ne_cpu_pool.avail_cores_size) {
> +		mutex_unlock(&ne_cpu_pool.mutex);
> +
> +		return;
> +	}
> +
> +	for (i = 0; i < ne_cpu_pool.avail_cores_size; i++) {
> +		for_each_cpu(cpu, ne_cpu_pool.avail_cores[i]) {
> +			rc = add_cpu(cpu);
> +			if (rc != 0)
> +				pr_err("%s: CPU %d is not onlined [rc=%d]\n",
> +				       ne_misc_dev.name, cpu, rc);
> +		}
> +
> +		cpumask_clear(ne_cpu_pool.avail_cores[i]);
> +
> +		free_cpumask_var(ne_cpu_pool.avail_cores[i]);
> +	}
> +
> +	kfree(ne_cpu_pool.avail_cores);
> +	ne_cpu_pool.avail_cores_size = 0;
> +
> +	mutex_unlock(&ne_cpu_pool.mutex);
> +}
> +
> +/**
> + * ne_set_kernel_param() - Set the NE CPU pool value via the NE kernel parameter.
> + * @val:	NE CPU pool string value.
> + * @kp :	NE kernel parameter associated with the NE CPU pool.
> + *
> + * Context: Process context.
> + * Return:
> + * * 0 on success.
> + * * Negative return value on failure.
> + */
> +static int ne_set_kernel_param(const char *val, const struct kernel_param *kp)
> +{
> +	char error_val[] = "";
> +	int rc = -EINVAL;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (ne_check_enclaves_created()) {
> +		pr_err("%s: The CPU pool is used by enclave(s)\n", ne_misc_dev.name);
> +
> +		return -EPERM;
> +	}
> +
> +	ne_teardown_cpu_pool();
> +
> +	rc = ne_setup_cpu_pool(val);
> +	if (rc < 0) {
> +		pr_err("%s: Error in setup CPU pool [rc=%d]\n", ne_misc_dev.name, rc);
> +
> +		param_set_copystring(error_val, kp);
> +
> +		return rc;
> +	}
> +
> +	return param_set_copystring(val, kp);
> +}
> +
> +/**
> + * ne_get_cpu_from_cpu_pool() - Get a CPU from the NE CPU pool, either from the
> + *				remaining sibling(s) of a CPU core or the first
> + *				sibling of a new CPU core.
> + * @ne_enclave :	Private data associated with the current enclave.
> + *
> + * Context: Process context. This function is called with the ne_enclave mutex held.
> + * Return:
> + * * vCPU id.
> + * * 0, if no CPU available in the pool.
> + */
> +static unsigned int ne_get_cpu_from_cpu_pool(struct ne_enclave *ne_enclave)
> +{
> +	int core_id = -1;
> +	unsigned int cpu = 0;
> +	unsigned int i = 0;
> +	unsigned int vcpu_id = 0;
> +
> +	/* There are CPU siblings available to choose from. */
> +	for (i = 0; i < ne_enclave->avail_cpu_cores_size; i++)
> +		for_each_cpu(cpu, ne_enclave->avail_cpu_cores[i])

This is really hard to read. I think what you want to say here is 
something along these lines:

/*
  * If previously allocated a thread of a core to this enclave, we first
  * return its sibling(s) for new allocations, so that we can donate a
  * full core.
  */
for (i = 0; i < ne_enclave->nr_parent_cores; i++) {
     for_each_cpu(cpu, ne->enclave->parent_cores[i].reserved_siblings) {
         if (!ne_cpu_donated(cpu)) {
             [...]
         }
     }
}

Similarly for other places in this file where you do cpu mask logic. It 
always needs to be very obvious what the respective mask is for, because 
you have ~3 semantically different ones.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879