[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7adb6707-4101-8f47-6497-fe086dcc11f3@amazon.de>
Date: Mon, 10 Aug 2020 09:33:42 +0200
From: Alexander Graf <graf@...zon.de>
To: Andra Paraschiv <andraprs@...zon.com>,
linux-kernel <linux-kernel@...r.kernel.org>
CC: Anthony Liguori <aliguori@...zon.com>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Colm MacCarthaigh <colmmacc@...zon.com>,
"David Duncan" <davdunc@...zon.com>,
Bjoern Doebel <doebel@...zon.de>,
"David Woodhouse" <dwmw@...zon.co.uk>,
Frank van der Linden <fllinden@...zon.com>,
Greg KH <gregkh@...uxfoundation.org>,
Karen Noel <knoel@...hat.com>,
"Martin Pohlack" <mpohlack@...zon.de>,
Matt Wilson <msw@...zon.com>,
Paolo Bonzini <pbonzini@...hat.com>,
Balbir Singh <sblbir@...zon.com>,
Stefano Garzarella <sgarzare@...hat.com>,
Stefan Hajnoczi <stefanha@...hat.com>,
Stewart Smith <trawets@...zon.com>,
Uwe Dannowski <uwed@...zon.de>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
kvm <kvm@...r.kernel.org>,
ne-devel-upstream <ne-devel-upstream@...zon.com>
Subject: Re: [PATCH v6 09/18] nitro_enclaves: Add logic for setting an enclave
vCPU
On 05.08.20 11:10, Andra Paraschiv wrote:
> An enclave, before being started, has its resources set. One of its
> resources is CPU.
>
> A NE CPU pool is set and enclave CPUs are chosen from it. Offline the
> CPUs from the NE CPU pool during the pool setup and online them back
> during the NE CPU pool teardown. The CPU offline is necessary so that
> there would not be more vCPUs than physical CPUs available to the
> primary / parent VM. In that case the CPUs would be overcommitted and
> would change the initial configuration of the primary / parent VM of
> having dedicated vCPUs to physical CPUs.
>
> The enclave CPUs need to be full cores and from the same NUMA node. CPU
> 0 and its siblings have to remain available to the primary / parent VM.
>
> Add ioctl command logic for setting an enclave vCPU.
>
> Signed-off-by: Alexandru Vasile <lexnv@...zon.com>
> Signed-off-by: Andra Paraschiv <andraprs@...zon.com>
> ---
> Changelog
>
> v5 -> v6
>
> * Check CPUs are from the same NUMA node before going through CPU
> siblings during the NE CPU pool setup.
> * Update documentation to kernel-doc format.
>
> v4 -> v5
>
> * Set empty string in case of invalid NE CPU pool.
> * Clear NE CPU pool mask on pool setup failure.
> * Setup NE CPU cores out of the NE CPU pool.
> * Early exit on NE CPU pool setup if enclave(s) already running.
> * Remove sanity checks for situations that shouldn't happen, only if
> buggy system or broken logic at all.
> * Add check for maximum vCPU id possible before looking into the CPU
> pool.
> * Remove log on copy_from_user() / copy_to_user() failure and on admin
> capability check for setting the NE CPU pool.
> * Update the ioctl call to not create a file descriptor for the vCPU.
> * Split the CPU pool usage logic in 2 separate functions - one to get a
> CPU from the pool and the other to check the given CPU is available in
> the pool.
>
> v3 -> v4
>
> * Setup the NE CPU pool at runtime via a sysfs file for the kernel
> parameter.
> * Check enclave CPUs to be from the same NUMA node.
> * Use dev_err instead of custom NE log pattern.
> * Update the NE ioctl call to match the decoupling from the KVM API.
>
> v2 -> v3
>
> * Remove the WARN_ON calls.
> * Update static calls sanity checks.
> * Update kzfree() calls to kfree().
> * Remove file ops that do nothing for now - open, ioctl and release.
>
> v1 -> v2
>
> * Add log pattern for NE.
> * Update goto labels to match their purpose.
> * Remove the BUG_ON calls.
> * Check if enclave state is init when setting enclave vCPU.
> ---
> drivers/virt/nitro_enclaves/ne_misc_dev.c | 575 ++++++++++++++++++++++
> 1 file changed, 575 insertions(+)
>
> diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
> index 6c8c12f65666..4787bc59d39d 100644
> --- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
> +++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
> @@ -57,8 +57,11 @@
> * TODO: Update logic to create new sysfs entries instead of using
> * a kernel parameter e.g. if multiple sysfs files needed.
> */
> +static int ne_set_kernel_param(const char *val, const struct kernel_param *kp);
> +
> static const struct kernel_param_ops ne_cpu_pool_ops = {
> .get = param_get_string,
> + .set = ne_set_kernel_param,
> };
>
> static char ne_cpus[NE_CPUS_SIZE];
> @@ -87,6 +90,575 @@ struct ne_cpu_pool {
>
> static struct ne_cpu_pool ne_cpu_pool;
>
> +/**
> + * ne_check_enclaves_created() - Verify if at least one enclave has been created.
> + * @void: No parameters provided.
> + *
> + * Context: Process context.
> + * Return:
> + * * True if at least one enclave is created.
> + * * False otherwise.
> + */
> +static bool ne_check_enclaves_created(void)
> +{
> + struct ne_pci_dev *ne_pci_dev = NULL;
> + /* TODO: Find another way to get the NE PCI device reference. */
> + struct pci_dev *pdev = pci_get_device(PCI_VENDOR_ID_AMAZON, PCI_DEVICE_ID_NE, NULL);
Can we just make this a global ref count instead?
> + bool ret = false;
> +
> + if (!pdev)
> + return ret;
> +
> + ne_pci_dev = pci_get_drvdata(pdev);
> + if (!ne_pci_dev) {
> + pci_dev_put(pdev);
> +
> + return ret;
> + }
> +
> + mutex_lock(&ne_pci_dev->enclaves_list_mutex);
> +
> + if (!list_empty(&ne_pci_dev->enclaves_list))
> + ret = true;
> +
> + mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
> +
> + pci_dev_put(pdev);
> +
> + return ret;
> +}
> +
> +/**
> + * ne_setup_cpu_pool() - Set the NE CPU pool after handling sanity checks such
> + * as not sharing CPU cores with the primary / parent VM
> + * or not using CPU 0, which should remain available for
> + * the primary / parent VM. Offline the CPUs from the
> + * pool after the checks passed.
> + * @ne_cpu_list: The CPU list used for setting NE CPU pool.
> + *
> + * Context: Process context.
> + * Return:
> + * * 0 on success.
> + * * Negative return value on failure.
> + */
> +static int ne_setup_cpu_pool(const char *ne_cpu_list)
> +{
> + int core_id = -1;
> + unsigned int cpu = 0;
> + cpumask_var_t cpu_pool = NULL;
> + unsigned int cpu_sibling = 0;
> + unsigned int i = 0;
> + int numa_node = -1;
> + int rc = -EINVAL;
> +
> + if (!ne_cpu_list)
> + return 0;
> +
> + if (!zalloc_cpumask_var(&cpu_pool, GFP_KERNEL))
> + return -ENOMEM;
> +
> + mutex_lock(&ne_cpu_pool.mutex);
> +
> + rc = cpulist_parse(ne_cpu_list, cpu_pool);
> + if (rc < 0) {
> + pr_err("%s: Error in cpulist parse [rc=%d]\n", ne_misc_dev.name, rc);
> +
> + goto free_pool_cpumask;
> + }
> +
> + cpu = cpumask_any(cpu_pool);
> + if (cpu >= nr_cpu_ids) {
> + pr_err("%s: No CPUs available in CPU pool\n", ne_misc_dev.name);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> +
> + /*
> + * Check if the CPUs from the NE CPU pool are from the same NUMA node.
> + */
> + for_each_cpu(cpu, cpu_pool) {
> + if (numa_node < 0) {
> + numa_node = cpu_to_node(cpu);
> + if (numa_node < 0) {
> + pr_err("%s: Invalid NUMA node %d\n",
> + ne_misc_dev.name, numa_node);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> + } else {
> + if (numa_node != cpu_to_node(cpu)) {
> + pr_err("%s: CPUs with different NUMA nodes\n",
> + ne_misc_dev.name);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> + }
> + }
> +
> + /*
> + * Check if CPU 0 and its siblings are included in the provided CPU pool
> + * They should remain available for the primary / parent VM.
> + */
> + if (cpumask_test_cpu(0, cpu_pool)) {
> + pr_err("%s: CPU 0 has to remain available\n", ne_misc_dev.name);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> +
> + for_each_cpu(cpu_sibling, topology_sibling_cpumask(0)) {
> + if (cpumask_test_cpu(cpu_sibling, cpu_pool)) {
> + pr_err("%s: CPU sibling %d for CPU 0 is in CPU pool\n",
> + ne_misc_dev.name, cpu_sibling);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> + }
> +
> + /*
> + * Check if CPU siblings are included in the provided CPU pool. The
> + * expectation is that CPU cores are made available in the CPU pool for
> + * enclaves.
> + */
> + for_each_cpu(cpu, cpu_pool) {
> + for_each_cpu(cpu_sibling, topology_sibling_cpumask(cpu)) {
> + if (!cpumask_test_cpu(cpu_sibling, cpu_pool)) {
> + pr_err("%s: CPU %d is not in CPU pool\n",
> + ne_misc_dev.name, cpu_sibling);
> +
> + rc = -EINVAL;
> +
> + goto free_pool_cpumask;
> + }
> + }
> + }
> +
> + ne_cpu_pool.avail_cores_size = nr_cpu_ids / smp_num_siblings;
> +
> + ne_cpu_pool.avail_cores = kcalloc(ne_cpu_pool.avail_cores_size,
> + sizeof(*ne_cpu_pool.avail_cores),
> + GFP_KERNEL);
> + if (!ne_cpu_pool.avail_cores) {
> + rc = -ENOMEM;
> +
> + goto free_pool_cpumask;
> + }
> +
> + for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> + if (!zalloc_cpumask_var(&ne_cpu_pool.avail_cores[i], GFP_KERNEL)) {
> + rc = -ENOMEM;
> +
> + goto free_cores_cpumask;
> + }
> +
> + /* Split the NE CPU pool in CPU cores. */
> + for_each_cpu(cpu, cpu_pool) {
> + core_id = topology_core_id(cpu);
> + if (core_id < 0 || core_id >= ne_cpu_pool.avail_cores_size) {
> + pr_err("%s: Invalid core id %d\n",
> + ne_misc_dev.name, core_id);
> +
> + rc = -EINVAL;
> +
> + goto clear_cpumask;
> + }
> +
> + cpumask_set_cpu(cpu, ne_cpu_pool.avail_cores[core_id]);
> + }
> +
> + /*
> + * CPUs that are given to enclave(s) should not be considered online
> + * by Linux anymore, as the hypervisor will degrade them to floating.
> + * The physical CPUs (full cores) are carved out of the primary / parent
> + * VM and given to the enclave VM. The same number of vCPUs would run
> + * on less pCPUs for the primary / parent VM.
> + *
> + * We offline them here, to not degrade performance and expose correct
> + * topology to Linux and user space.
> + */
> + for_each_cpu(cpu, cpu_pool) {
> + rc = remove_cpu(cpu);
> + if (rc != 0) {
> + pr_err("%s: CPU %d is not offlined [rc=%d]\n",
> + ne_misc_dev.name, cpu, rc);
> +
> + goto online_cpus;
> + }
> + }
> +
> + free_cpumask_var(cpu_pool);
> +
> + ne_cpu_pool.numa_node = numa_node;
> +
> + mutex_unlock(&ne_cpu_pool.mutex);
> +
> + return 0;
> +
> +online_cpus:
> + for_each_cpu(cpu, cpu_pool)
> + add_cpu(cpu);
> +clear_cpumask:
> + for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> + cpumask_clear(ne_cpu_pool.avail_cores[i]);
> +free_cores_cpumask:
> + for (i = 0; i < ne_cpu_pool.avail_cores_size; i++)
> + free_cpumask_var(ne_cpu_pool.avail_cores[i]);
> + kfree(ne_cpu_pool.avail_cores);
> + ne_cpu_pool.avail_cores_size = 0;
> +free_pool_cpumask:
> + free_cpumask_var(cpu_pool);
> + mutex_unlock(&ne_cpu_pool.mutex);
> +
> + return rc;
> +}
> +
> +/**
> + * ne_teardown_cpu_pool() - Online the CPUs from the NE CPU pool and cleanup the
> + * CPU pool.
> + * @void: No parameters provided.
> + *
> + * Context: Process context.
> + */
> +static void ne_teardown_cpu_pool(void)
> +{
> + unsigned int cpu = 0;
> + unsigned int i = 0;
> + int rc = -EINVAL;
> +
> + mutex_lock(&ne_cpu_pool.mutex);
> +
> + if (!ne_cpu_pool.avail_cores_size) {
> + mutex_unlock(&ne_cpu_pool.mutex);
> +
> + return;
> + }
> +
> + for (i = 0; i < ne_cpu_pool.avail_cores_size; i++) {
> + for_each_cpu(cpu, ne_cpu_pool.avail_cores[i]) {
> + rc = add_cpu(cpu);
> + if (rc != 0)
> + pr_err("%s: CPU %d is not onlined [rc=%d]\n",
> + ne_misc_dev.name, cpu, rc);
> + }
> +
> + cpumask_clear(ne_cpu_pool.avail_cores[i]);
> +
> + free_cpumask_var(ne_cpu_pool.avail_cores[i]);
> + }
> +
> + kfree(ne_cpu_pool.avail_cores);
> + ne_cpu_pool.avail_cores_size = 0;
> +
> + mutex_unlock(&ne_cpu_pool.mutex);
> +}
> +
> +/**
> + * ne_set_kernel_param() - Set the NE CPU pool value via the NE kernel parameter.
> + * @val: NE CPU pool string value.
> + * @kp : NE kernel parameter associated with the NE CPU pool.
> + *
> + * Context: Process context.
> + * Return:
> + * * 0 on success.
> + * * Negative return value on failure.
> + */
> +static int ne_set_kernel_param(const char *val, const struct kernel_param *kp)
> +{
> + char error_val[] = "";
> + int rc = -EINVAL;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (ne_check_enclaves_created()) {
> + pr_err("%s: The CPU pool is used by enclave(s)\n", ne_misc_dev.name);
> +
> + return -EPERM;
> + }
> +
> + ne_teardown_cpu_pool();
> +
> + rc = ne_setup_cpu_pool(val);
> + if (rc < 0) {
> + pr_err("%s: Error in setup CPU pool [rc=%d]\n", ne_misc_dev.name, rc);
> +
> + param_set_copystring(error_val, kp);
> +
> + return rc;
> + }
> +
> + return param_set_copystring(val, kp);
> +}
> +
> +/**
> + * ne_get_cpu_from_cpu_pool() - Get a CPU from the NE CPU pool, either from the
> + * remaining sibling(s) of a CPU core or the first
> + * sibling of a new CPU core.
> + * @ne_enclave : Private data associated with the current enclave.
> + *
> + * Context: Process context. This function is called with the ne_enclave mutex held.
> + * Return:
> + * * vCPU id.
> + * * 0, if no CPU available in the pool.
> + */
> +static unsigned int ne_get_cpu_from_cpu_pool(struct ne_enclave *ne_enclave)
> +{
> + int core_id = -1;
> + unsigned int cpu = 0;
> + unsigned int i = 0;
> + unsigned int vcpu_id = 0;
> +
> + /* There are CPU siblings available to choose from. */
> + for (i = 0; i < ne_enclave->avail_cpu_cores_size; i++)
> + for_each_cpu(cpu, ne_enclave->avail_cpu_cores[i])
This is really hard to read. I think what you want to say here is
something along these lines:
/*
* If previously allocated a thread of a core to this enclave, we first
* return its sibling(s) for new allocations, so that we can donate a
* full core.
*/
for (i = 0; i < ne_enclave->nr_parent_cores; i++) {
for_each_cpu(cpu, ne->enclave->parent_cores[i].reserved_siblings) {
if (!ne_cpu_donated(cpu)) {
[...]
}
}
}
Similarly for other places in this file where you do cpu mask logic. It
always needs to be very obvious what the respective mask is for, because
you have ~3 semantically different ones.
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Powered by blists - more mailing lists