[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJj2-QHKLHTPn-Qoih_=5RbOf3eyenKMmFn0jz6ALQ9KDAbisw@mail.gmail.com>
Date: Mon, 30 Sep 2024 18:14:55 -0700
From: Yuanchu Xie <yuanchu@...gle.com>
To: Wei Liu <liuwe@...rosoft.com>, Rob Bradford <rbradford@...osinc.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: "Theodore Ts'o" <tytso@....edu>, Pasha Tatashin <pasha.tatashin@...een.com>,
Jonathan Corbet <corbet@....net>, Thomas Zimmermann <tzimmermann@...e.de>,
Dan Williams <dan.j.williams@...el.com>, Tom Lendacky <thomas.lendacky@....com>,
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, virtualization@...ts.linux.dev,
dev@...ts.cloudhypervisor.org
Subject: Re: [PATCH v2 1/2] virt: pvmemcontrol: control guest physical memory properties
I made a mistake. This is supposed to be v3.
On Mon, Sep 30, 2024 at 6:13 PM Yuanchu Xie <yuanchu@...gle.com> wrote:
>
> Pvmemcontrol provides a way for the guest to control its physical memory
> properties, and enables optimizations and security features. For
> example, the guest can provide information to the host where parts of a
> hugepage may be unbacked, or sensitive data may not be swapped out, etc.
>
> Pvmemcontrol allows guests to manipulate its gPTE entries in the SLAT,
> and also some other properties of the memory map the back's host memory.
> This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
> capability is available, the changes in the backing of the memory region
> on the host are automatically reflected into the guest. For example, an
> mmap() or madvise() that affects the region will be made visible
> immediately.
>
> There are two components of the implementation: the guest Linux driver
> and Virtual Machine Monitor (VMM) device. A guest-allocated shared
> buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
> device assigns a unique command for each per-cpu buffer. The guest
> writes its pvmemcontrol request in the per-cpu buffer, then writes the
> corresponding command into the command register, calling into the VMM
> device to perform the pvmemcontrol request.
>
> The synchronous per-cpu shared buffer approach avoids the kick and busy
> waiting that the guest would have to do with virtio virtqueue transport.
>
> User API
> From the userland, the pvmemcontrol guest driver is controlled via
> ioctl(2) call. It requires CAP_SYS_ADMIN.
>
> ioctl(fd, PVMEMCONTROL_IOCTL, struct pvmemcontrol_buf *buf);
>
> Guest userland applications can tag VMAs and guest hugepages, or advise
> the host on how to handle sensitive guest pages.
>
> Supported function codes and their use cases:
> PVMEMCONTROL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce
> the struct page and page table lookup overhead by using hugepages backed
> by smaller pages on the host. These pvmemcontrol commands can allow for
> partial freeing of private guest hugepages to save memory. They also
> allow kernel memory, such as kernel stacks and task_structs to be
> paravirtualized if we expose kernel APIs.
>
> PVMEMCONTROL_MERGEABLE can inform the host KSM to deduplicate VM pages.
>
> PVMEMCONTROL_UNMERGEABLE is useful for security, when the VM does not
> want to share its backing pages.
> The same with PVMEMCONTROL_DONTDUMP, so sensitive pages are not included
> in a dump.
> MLOCK/UNLOCK can advise the host that sensitive information is not
> swapped out on the host.
>
> PVMEMCONTROL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages,
> stack guard pages can be handled in the host and memory can be saved in
> the hugepage.
>
> PVMEMCONTROL_SET_VMA_ANON_NAME is useful for observability and debugging
> how guest memory is being mapped on the host.
>
> Sample program making use of PVMEMCONTROL_DONTNEED:
> https://github.com/Dummyc0m/pvmemcontrol-user
>
> The VMM implementation is part of Cloud Hypervisor, the feature
> pvmemcontrol can be enabled and the VMM can then provide the device to a
> supporting guest.
> https://github.com/cloud-hypervisor/cloud-hypervisor
>
> -
> Changelog
> PATCH v2 -> v3
> - added PVMEMCONTROL_MERGEABLE for memory dedupe.
> - updated link to the upstream Cloud Hypervisor repo, and specify the
> feature required to enable the device.
> PATCH v1 -> v2
> - fixed byte order sparse warning. ioread/write already does
> little-endian.
> - add include for linux/percpu.h
> RFC v1 -> PATCH v1
> - renamed memctl to pvmemcontrol
> - defined device endianness as little endian
>
> v1:
> https://lore.kernel.org/linux-mm/20240518072422.771698-1-yuanchu@google.com/
> v2:
> https://lore.kernel.org/linux-mm/20240612021207.3314369-1-yuanchu@google.com/
>
> Change-Id: Ib9e4026df815a8ffd8d8b29ce13dd12ce3714e21
>
> Add MADV_MERGEABLE to pvmemcontrol
>
> Align pvmemcontrol comments
>
> This change aligns the pvmemcontrol operation IDs and comments in the pvmemcontrol header file
>
> Signed-off-by: Yuanchu Xie <yuanchu@...gle.com>
> ---
> .../userspace-api/ioctl/ioctl-number.rst | 2 +
> drivers/virt/Kconfig | 2 +
> drivers/virt/Makefile | 1 +
> drivers/virt/pvmemcontrol/Kconfig | 10 +
> drivers/virt/pvmemcontrol/Makefile | 2 +
> drivers/virt/pvmemcontrol/pvmemcontrol.c | 459 ++++++++++++++++++
> include/uapi/linux/pvmemcontrol.h | 76 +++
> 7 files changed, 552 insertions(+)
> create mode 100644 drivers/virt/pvmemcontrol/Kconfig
> create mode 100644 drivers/virt/pvmemcontrol/Makefile
> create mode 100644 drivers/virt/pvmemcontrol/pvmemcontrol.c
> create mode 100644 include/uapi/linux/pvmemcontrol.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index a141e8e65c5d..34a9954cafc7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -372,6 +372,8 @@ Code Seq# Include File Comments
> 0xCD 01 linux/reiserfs_fs.h
> 0xCE 01-02 uapi/linux/cxl_mem.h Compute Express Link Memory Devices
> 0xCF 02 fs/smb/client/cifs_ioctl.h
> +0xDA 00 uapi/linux/pvmemcontrol.h Pvmemcontrol Device
> + <mailto:yuanchu@...gle.com>
> 0xDB 00-0F drivers/char/mwave/mwavepub.h
> 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/
> <mailto:aherrman@...ibm.com>
> diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
> index d8c848cf09a6..454e347a90cf 100644
> --- a/drivers/virt/Kconfig
> +++ b/drivers/virt/Kconfig
> @@ -49,4 +49,6 @@ source "drivers/virt/acrn/Kconfig"
>
> source "drivers/virt/coco/Kconfig"
>
> +source "drivers/virt/pvmemcontrol/Kconfig"
> +
> endif
> diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
> index f29901bd7820..3a1fd6e076ad 100644
> --- a/drivers/virt/Makefile
> +++ b/drivers/virt/Makefile
> @@ -10,3 +10,4 @@ obj-y += vboxguest/
> obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
> obj-$(CONFIG_ACRN_HSM) += acrn/
> obj-y += coco/
> +obj-$(CONFIG_PVMEMCONTROL) += pvmemcontrol/
> diff --git a/drivers/virt/pvmemcontrol/Kconfig b/drivers/virt/pvmemcontrol/Kconfig
> new file mode 100644
> index 000000000000..9fe16da23bd8
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config PVMEMCONTROL
> + tristate "pvmemcontrol Guest Service Module"
> + depends on KVM_GUEST
> + help
> + pvmemcontrol is a guest kernel module that allows to communicate
> + with hypervisor / VMM and control the guest memory backing.
> +
> + To compile as a module, choose M, the module will be called
> + pvmemcontrol. If unsure, say N.
> diff --git a/drivers/virt/pvmemcontrol/Makefile b/drivers/virt/pvmemcontrol/Makefile
> new file mode 100644
> index 000000000000..2fc087ef3ef5
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_PVMEMCONTROL) := pvmemcontrol.o
> diff --git a/drivers/virt/pvmemcontrol/pvmemcontrol.c b/drivers/virt/pvmemcontrol/pvmemcontrol.c
> new file mode 100644
> index 000000000000..f8a07114fad8
> --- /dev/null
> +++ b/drivers/virt/pvmemcontrol/pvmemcontrol.c
> @@ -0,0 +1,459 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Control guest physical memory properties by sending
> + * madvise-esque requests to the host VMM.
> + *
> + * Author: Yuanchu Xie <yuanchu@...gle.com>
> + * Author: Pasha Tatashin <pasha.tatashin@...een.com>
> + */
> +#include <linux/spinlock.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu-defs.h>
> +#include <linux/percpu.h>
> +#include <linux/types.h>
> +#include <linux/gfp.h>
> +#include <linux/compiler.h>
> +#include <linux/fs.h>
> +#include <linux/sched/clock.h>
> +#include <linux/wait.h>
> +#include <linux/printk.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/proc_fs.h>
> +#include <linux/resource_ext.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/percpu.h>
> +#include <linux/byteorder/generic.h>
> +#include <linux/io-64-nonatomic-lo-hi.h>
> +#include <uapi/linux/pvmemcontrol.h>
> +
> +#define PCI_VENDOR_ID_GOOGLE 0x1ae0
> +#define PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL 0x0087
> +
> +#define PVMEMCONTROL_COMMAND_OFFSET 0x08
> +#define PVMEMCONTROL_REQUEST_OFFSET 0x00
> +#define PVMEMCONTROL_RESPONSE_OFFSET 0x00
> +
> +/*
> + * Magic values that perform the action specified when written to
> + * the command register.
> + */
> +enum pvmemcontrol_transport_command {
> + PVMEMCONTROL_TRANSPORT_RESET = 0x060FE6D2,
> + PVMEMCONTROL_TRANSPORT_REGISTER = 0x0E359539,
> + PVMEMCONTROL_TRANSPORT_READY = 0x0CA8D227,
> + PVMEMCONTROL_TRANSPORT_DISCONNECT = 0x030F5DA0,
> + PVMEMCONTROL_TRANSPORT_ACK = 0x03CF5196,
> + PVMEMCONTROL_TRANSPORT_ERROR = 0x01FBA249,
> +};
> +
> +/* Contains the function code and arguments for specific function */
> +struct pvmemcontrol_vmm_call_le {
> + __le64 func_code; /* pvmemcontrol set function code */
> + __le64 addr; /* hyper. page size aligned guest phys. addr */
> + __le64 length; /* hyper. page size aligned length */
> + __le64 arg; /* function code specific argument */
> +};
> +
> +/* Is filled on return to guest from VMM from most function calls */
> +struct pvmemcontrol_vmm_ret_le {
> + __le32 ret_errno; /* on error, value of errno */
> + __le32 ret_code; /* pvmemcontrol internal error code, on success 0 */
> + __le64 ret_value; /* return value from the function call */
> + __le64 arg0; /* currently unused */
> + __le64 arg1; /* currently unused */
> +};
> +
> +struct pvmemcontrol_buf_le {
> + union {
> + struct pvmemcontrol_vmm_call_le call;
> + struct pvmemcontrol_vmm_ret_le ret;
> + };
> +};
> +
> +struct pvmemcontrol_percpu_channel {
> + struct pvmemcontrol_buf_le buf;
> + u64 buf_phys_addr;
> + u32 command;
> +};
> +
> +struct pvmemcontrol {
> + void __iomem *base_addr;
> + struct device *device;
> + /* cache the info call */
> + struct pvmemcontrol_vmm_ret pvmemcontrol_vmm_info;
> + struct pvmemcontrol_percpu_channel __percpu *pcpu_channels;
> +};
> +
> +static DEFINE_RWLOCK(pvmemcontrol_lock);
> +static struct pvmemcontrol *pvmemcontrol __read_mostly;
> +
> +static void pvmemcontrol_write_command(void __iomem *base_addr, u32 command)
> +{
> + iowrite32(command, base_addr + PVMEMCONTROL_COMMAND_OFFSET);
> +}
> +
> +static u32 pvmemcontrol_read_command(void __iomem *base_addr)
> +{
> + return ioread32(base_addr + PVMEMCONTROL_COMMAND_OFFSET);
> +}
> +
> +static void pvmemcontrol_write_reg(void __iomem *base_addr, u64 buf_phys_addr)
> +{
> + iowrite64_lo_hi(buf_phys_addr, base_addr + PVMEMCONTROL_REQUEST_OFFSET);
> +}
> +
> +static u32 pvmemcontrol_read_resp(void __iomem *base_addr)
> +{
> + return ioread32(base_addr + PVMEMCONTROL_RESPONSE_OFFSET);
> +}
> +
> +static void pvmemcontrol_buf_call_to_le(struct pvmemcontrol_buf_le *le,
> + const struct pvmemcontrol_buf *buf)
> +{
> + le->call.func_code = cpu_to_le64(buf->call.func_code);
> + le->call.addr = cpu_to_le64(buf->call.addr);
> + le->call.length = cpu_to_le64(buf->call.length);
> + le->call.arg = cpu_to_le64(buf->call.arg);
> +}
> +
> +static void pvmemcontrol_buf_ret_from_le(struct pvmemcontrol_buf *buf,
> + const struct pvmemcontrol_buf_le *le)
> +{
> + buf->ret.ret_errno = le32_to_cpu(le->ret.ret_errno);
> + buf->ret.ret_code = le32_to_cpu(le->ret.ret_code);
> + buf->ret.ret_value = le64_to_cpu(le->ret.ret_value);
> + buf->ret.arg0 = le64_to_cpu(le->ret.arg0);
> + buf->ret.arg1 = le64_to_cpu(le->ret.arg1);
> +}
> +
> +static void pvmemcontrol_send_request(struct pvmemcontrol *pvmemcontrol,
> + struct pvmemcontrol_buf *buf)
> +{
> + struct pvmemcontrol_percpu_channel *channel;
> +
> + preempt_disable();
> + channel = this_cpu_ptr(pvmemcontrol->pcpu_channels);
> +
> + pvmemcontrol_buf_call_to_le(&channel->buf, buf);
> + pvmemcontrol_write_command(pvmemcontrol->base_addr, channel->command);
> + pvmemcontrol_buf_ret_from_le(buf, &channel->buf);
> +
> + preempt_enable();
> +}
> +
> +static int __pvmemcontrol_vmm_call(struct pvmemcontrol_buf *buf)
> +{
> + int err = 0;
> +
> + if (!pvmemcontrol)
> + return -EINVAL;
> +
> + read_lock(&pvmemcontrol_lock);
> + if (!pvmemcontrol) {
> + err = -EINVAL;
> + goto unlock;
> + }
> + if (buf->call.func_code == PVMEMCONTROL_INFO) {
> + memcpy(&buf->ret, &pvmemcontrol->pvmemcontrol_vmm_info,
> + sizeof(buf->ret));
> + goto unlock;
> + }
> +
> + pvmemcontrol_send_request(pvmemcontrol, buf);
> +
> +unlock:
> + read_unlock(&pvmemcontrol_lock);
> + return err;
> +}
> +
> +static int pvmemcontrol_init_info(struct pvmemcontrol *dev,
> + struct pvmemcontrol_buf *buf)
> +{
> + buf->call.func_code = PVMEMCONTROL_INFO;
> +
> + pvmemcontrol_send_request(dev, buf);
> + if (buf->ret.ret_code)
> + return buf->ret.ret_code;
> +
> + /* Initialize global pvmemcontrol_vmm_info */
> + memcpy(&dev->pvmemcontrol_vmm_info, &buf->ret,
> + sizeof(dev->pvmemcontrol_vmm_info));
> + dev_info(dev->device,
> + "pvmemcontrol_vmm_info.ret_errno = %u\n"
> + "pvmemcontrol_vmm_info.ret_code = %u\n"
> + "pvmemcontrol_vmm_info.major_version = %llu\n"
> + "pvmemcontrol_vmm_info.minor_version = %llu\n"
> + "pvmemcontrol_vmm_info.page_size = %llu\n",
> + dev->pvmemcontrol_vmm_info.ret_errno,
> + dev->pvmemcontrol_vmm_info.ret_code,
> + dev->pvmemcontrol_vmm_info.arg0,
> + dev->pvmemcontrol_vmm_info.arg1,
> + dev->pvmemcontrol_vmm_info.ret_value);
> +
> + return 0;
> +}
> +
> +static int pvmemcontrol_open(struct inode *inode, struct file *filp)
> +{
> + struct pvmemcontrol_buf *buf = NULL;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EACCES;
> +
> + /* Do not allow exclusive open */
> + if (filp->f_flags & O_EXCL)
> + return -EINVAL;
> +
> + buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + /* Overwrite the misc device set by misc_register */
> + filp->private_data = buf;
> + return 0;
> +}
> +
> +static int pvmemcontrol_release(struct inode *inode, struct file *filp)
> +{
> + kfree(filp->private_data);
> + filp->private_data = NULL;
> + return 0;
> +}
> +
> +static long pvmemcontrol_ioctl(struct file *filp, unsigned int cmd,
> + unsigned long ioctl_param)
> +{
> + struct pvmemcontrol_buf *buf = filp->private_data;
> + int err;
> +
> + if (cmd != PVMEMCONTROL_IOCTL_VMM)
> + return -EINVAL;
> +
> + if (copy_from_user(&buf->call, (void __user *)ioctl_param,
> + sizeof(struct pvmemcontrol_buf)))
> + return -EFAULT;
> +
> + err = __pvmemcontrol_vmm_call(buf);
> + if (err)
> + return err;
> +
> + if (copy_to_user((void __user *)ioctl_param, &buf->ret,
> + sizeof(struct pvmemcontrol_buf)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static const struct file_operations pvmemcontrol_fops = {
> + .owner = THIS_MODULE,
> + .open = pvmemcontrol_open,
> + .release = pvmemcontrol_release,
> + .unlocked_ioctl = pvmemcontrol_ioctl,
> + .compat_ioctl = compat_ptr_ioctl,
> +};
> +
> +static struct miscdevice pvmemcontrol_dev = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = KBUILD_MODNAME,
> + .fops = &pvmemcontrol_fops,
> +};
> +
> +static int pvmemcontrol_connect(struct pvmemcontrol *pvmemcontrol)
> +{
> + int cpu;
> + u32 cmd;
> +
> + pvmemcontrol_write_command(pvmemcontrol->base_addr,
> + PVMEMCONTROL_TRANSPORT_RESET);
> + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> + dev_err(pvmemcontrol->device,
> + "failed to reset device, cmd 0x%x\n", cmd);
> + return -EINVAL;
> + }
> +
> + for_each_possible_cpu(cpu) {
> + struct pvmemcontrol_percpu_channel *channel =
> + per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu);
> +
> + pvmemcontrol_write_reg(pvmemcontrol->base_addr,
> + channel->buf_phys_addr);
> + pvmemcontrol_write_command(pvmemcontrol->base_addr,
> + PVMEMCONTROL_TRANSPORT_REGISTER);
> +
> + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> + dev_err(pvmemcontrol->device,
> + "failed to register pcpu buf, cmd 0x%x\n", cmd);
> + return -EINVAL;
> + }
> + channel->command =
> + pvmemcontrol_read_resp(pvmemcontrol->base_addr);
> + }
> +
> + pvmemcontrol_write_command(pvmemcontrol->base_addr,
> + PVMEMCONTROL_TRANSPORT_READY);
> + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> + if (cmd != PVMEMCONTROL_TRANSPORT_ACK) {
> + dev_err(pvmemcontrol->device,
> + "failed to ready device, cmd 0x%x\n", cmd);
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +static int pvmemcontrol_disconnect(struct pvmemcontrol *pvmemcontrol)
> +{
> + u32 cmd;
> +
> + pvmemcontrol_write_command(pvmemcontrol->base_addr,
> + PVMEMCONTROL_TRANSPORT_DISCONNECT);
> +
> + cmd = pvmemcontrol_read_command(pvmemcontrol->base_addr);
> + if (cmd != PVMEMCONTROL_TRANSPORT_ERROR) {
> + dev_err(pvmemcontrol->device,
> + "failed to disconnect device, cmd 0x%x\n", cmd);
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +static int pvmemcontrol_alloc_percpu_channels(struct pvmemcontrol *pvmemcontrol)
> +{
> + int cpu;
> +
> + pvmemcontrol->pcpu_channels = alloc_percpu_gfp(
> + struct pvmemcontrol_percpu_channel, GFP_ATOMIC | __GFP_ZERO);
> + if (!pvmemcontrol->pcpu_channels)
> + return -ENOMEM;
> +
> + for_each_possible_cpu(cpu) {
> + struct pvmemcontrol_percpu_channel *channel =
> + per_cpu_ptr(pvmemcontrol->pcpu_channels, cpu);
> + phys_addr_t buf_phys = per_cpu_ptr_to_phys(&channel->buf);
> +
> + channel->buf_phys_addr = buf_phys;
> + }
> + return 0;
> +}
> +
> +static int pvmemcontrol_init(struct device *device, void __iomem *base_addr)
> +{
> + struct pvmemcontrol_buf *buf = NULL;
> + struct pvmemcontrol *dev = NULL;
> + int err = 0;
> +
> + err = misc_register(&pvmemcontrol_dev);
> + if (err)
> + return err;
> +
> + /* We take a spinlock for a long time, but this is only during init. */
> + write_lock(&pvmemcontrol_lock);
> + if (READ_ONCE(pvmemcontrol)) {
> + dev_warn(device, "multiple pvmemcontrol devices present\n");
> + err = -EEXIST;
> + goto fail_free;
> + }
> +
> + dev = kzalloc(sizeof(struct pvmemcontrol), GFP_ATOMIC);
> + buf = kzalloc(sizeof(struct pvmemcontrol_buf), GFP_ATOMIC);
> + if (!dev || !buf) {
> + err = -ENOMEM;
> + goto fail_free;
> + }
> +
> + dev->base_addr = base_addr;
> + dev->device = device;
> +
> + err = pvmemcontrol_alloc_percpu_channels(dev);
> + if (err)
> + goto fail_free;
> +
> + err = pvmemcontrol_connect(dev);
> + if (err)
> + goto fail_free;
> +
> + err = pvmemcontrol_init_info(dev, buf);
> + if (err)
> + goto fail_free;
> +
> + WRITE_ONCE(pvmemcontrol, dev);
> + write_unlock(&pvmemcontrol_lock);
> + return 0;
> +
> +fail_free:
> + write_unlock(&pvmemcontrol_lock);
> + kfree(dev);
> + kfree(buf);
> + misc_deregister(&pvmemcontrol_dev);
> + return err;
> +}
> +
> +static int pvmemcontrol_pci_probe(struct pci_dev *dev,
> + const struct pci_device_id *id)
> +{
> + void __iomem *base_addr;
> + int err;
> +
> + err = pcim_enable_device(dev);
> + if (err < 0)
> + return err;
> +
> + base_addr = pcim_iomap(dev, 0, 0);
> + if (!base_addr)
> + return -ENOMEM;
> +
> + err = pvmemcontrol_init(&dev->dev, base_addr);
> + if (err)
> + pci_disable_device(dev);
> +
> + return err;
> +}
> +
> +static void pvmemcontrol_pci_remove(struct pci_dev *pci_dev)
> +{
> + int err;
> + struct pvmemcontrol *dev;
> +
> + write_lock(&pvmemcontrol_lock);
> + dev = READ_ONCE(pvmemcontrol);
> + if (!dev) {
> + err = -EINVAL;
> + dev_err(&pci_dev->dev, "cleanup called when uninitialized\n");
> + write_unlock(&pvmemcontrol_lock);
> + return;
> + }
> +
> + /* disconnect */
> + err = pvmemcontrol_disconnect(dev);
> + if (err)
> + dev_err(&pci_dev->dev, "device did not ack disconnect\n");
> + /* free percpu channels */
> + free_percpu(dev->pcpu_channels);
> +
> + kfree(dev);
> + WRITE_ONCE(pvmemcontrol, NULL);
> + write_unlock(&pvmemcontrol_lock);
> + misc_deregister(&pvmemcontrol_dev);
> +}
> +
> +static const struct pci_device_id pvmemcontrol_pci_id_tbl[] = {
> + { PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEVICE_ID_GOOGLE_PVMEMCONTROL) },
> + { 0 }
> +};
> +MODULE_DEVICE_TABLE(pci, pvmemcontrol_pci_id_tbl);
> +
> +static struct pci_driver pvmemcontrol_pci_driver = {
> + .name = "pvmemcontrol",
> + .id_table = pvmemcontrol_pci_id_tbl,
> + .probe = pvmemcontrol_pci_probe,
> + .remove = pvmemcontrol_pci_remove,
> +};
> +module_pci_driver(pvmemcontrol_pci_driver);
> +
> +MODULE_AUTHOR("Yuanchu Xie <yuanchu@...gle.com>");
> +MODULE_DESCRIPTION("pvmemcontrol Guest Service Module");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/pvmemcontrol.h b/include/uapi/linux/pvmemcontrol.h
> new file mode 100644
> index 000000000000..31b366dee796
> --- /dev/null
> +++ b/include/uapi/linux/pvmemcontrol.h
> @@ -0,0 +1,76 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Userspace interface for /dev/pvmemcontrol
> + * pvmemcontrol Guest Memory Service Module
> + *
> + * Copyright (c) 2024, Google LLC.
> + * Yuanchu Xie <yuanchu@...gle.com>
> + * Pasha Tatashin <pasha.tatashin@...een.com>
> + */
> +
> +#ifndef _UAPI_PVMEMCONTROL_H
> +#define _UAPI_PVMEMCONTROL_H
> +
> +#include <linux/wait.h>
> +#include <linux/types.h>
> +#include <asm/param.h>
> +
> +/* Contains the function code and arguments for specific function */
> +struct pvmemcontrol_vmm_call {
> + __u64 func_code; /* pvmemcontrol set function code */
> + __u64 addr; /* hyper. page size aligned guest phys. addr */
> + __u64 length; /* hyper. page size aligned length */
> + __u64 arg; /* function code specific argument */
> +};
> +
> +/* Is filled on return to guest from VMM from most function calls */
> +struct pvmemcontrol_vmm_ret {
> + __u32 ret_errno; /* on error, value of errno */
> + __u32 ret_code; /* pvmemcontrol internal error code, on success 0 */
> + __u64 ret_value; /* return value from the function call */
> + __u64 arg0; /* major version for func_code INFO */
> + __u64 arg1; /* minor version for func_code INFO */
> +};
> +
> +struct pvmemcontrol_buf {
> + union {
> + struct pvmemcontrol_vmm_call call;
> + struct pvmemcontrol_vmm_ret ret;
> + };
> +};
> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define PVMEMCONTROL_IOCTL_TYPE 0xDA
> +
> +#define PVMEMCONTROL_IOCTL_VMM _IOWR(PVMEMCONTROL_IOCTL_TYPE, 0x00, struct pvmemcontrol_buf)
> +
> +/*
> + * Returns the host page size in ret_value.
> + * major version in arg0.
> + * minor version in arg1.
> + */
> +#define PVMEMCONTROL_INFO 0
> +
> +/* Pvmemcontrol calls, pvmemcontrol_vmm_return is returned */
> +#define PVMEMCONTROL_DONTNEED 1 /* madvise(addr, len, MADV_DONTNEED); */
> +#define PVMEMCONTROL_REMOVE 2 /* madvise(addr, len, MADV_MADV_REMOVE); */
> +#define PVMEMCONTROL_FREE 3 /* madvise(addr, len, MADV_FREE); */
> +#define PVMEMCONTROL_PAGEOUT 4 /* madvise(addr, len, MADV_PAGEOUT); */
> +#define PVMEMCONTROL_DONTDUMP 5 /* madvise(addr, len, MADV_DONTDUMP); */
> +
> +/* prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, addr, len, arg) */
> +#define PVMEMCONTROL_SET_VMA_ANON_NAME 6
> +
> +#define PVMEMCONTROL_MLOCK 7 /* mlock2(addr, len, 0) */
> +#define PVMEMCONTROL_MUNLOCK 8 /* munlock(addr, len) */
> +
> +#define PVMEMCONTROL_MPROTECT_NONE 9 /* mprotect(addr, len, PROT_NONE) */
> +#define PVMEMCONTROL_MPROTECT_R 10 /* mprotect(addr, len, PROT_READ) */
> +#define PVMEMCONTROL_MPROTECT_W 11 /* mprotect(addr, len, PROT_WRITE) */
> +/* mprotect(addr, len, PROT_READ | PROT_WRITE) */
> +#define PVMEMCONTROL_MPROTECT_RW 12
> +
> +#define PVMEMCONTROL_MERGEABLE 13 /* madvise(addr, len, MADV_MERGEABLE); */
> +#define PVMEMCONTROL_UNMERGEABLE 14 /* madvise(addr, len, MADV_UNMERGEABLE); */
> +
> +#endif /* _UAPI_PVMEMCONTROL_H */
> --
> 2.46.1.824.gd892dcdcdd-goog
>
Powered by blists - more mailing lists