linux-kernel - RE: [PATCH 2/2] Drivers: hv: add XStore Fastpath driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR21MB15931C10916D27DB73CA5026D70A9@MWHPR21MB1593.namprd21.prod.outlook.com>
Date:   Mon, 21 Jun 2021 18:09:56 +0000
From:   Michael Kelley <mikelley@...rosoft.com>
To:     "longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>
CC:     Long Li <longli@...rosoft.com>, KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Wei Liu <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>
Subject: RE: [PATCH 2/2] Drivers: hv: add XStore Fastpath driver

From: longli@...uxonhyperv.com <longli@...uxonhyperv.com> Sent: Monday, June 7, 2021 6:05 PM
> 
> Add XStore Fastpath driver to support accelerated access to Azure Blob
> Storage Services.
> 
> Cc: K. Y. Srinivasan <kys@...rosoft.com>
> Cc: Haiyang Zhang <haiyangz@...rosoft.com>
> Cc: Stephen Hemminger <sthemmin@...rosoft.com>
> Cc: Wei Liu <wei.liu@...nel.org>
> Cc: Dexuan Cui <decui@...rosoft.com>
> Signed-off-by: Long Li <longli@...rosoft.com>
> ---
>  MAINTAINERS                 |   1 +
>  drivers/hv/Kconfig          |  11 +
>  drivers/hv/Makefile         |   1 +
>  drivers/hv/channel_mgmt.c   |   6 +
>  drivers/hv/hv_xs_fastpath.c | 579
> ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/hv/hv_xs_fastpath.h |  82 +++++++
>  include/linux/hyperv.h      |   9 +
>  7 files changed, 689 insertions(+)
>  create mode 100644 drivers/hv/hv_xs_fastpath.c
>  create mode 100644 drivers/hv/hv_xs_fastpath.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9487061..b547eb9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8440,6 +8440,7 @@ M:	Haiyang Zhang <haiyangz@...rosoft.com>
>  M:	Stephen Hemminger <sthemmin@...rosoft.com>
>  M:	Wei Liu <wei.liu@...nel.org>
>  M:	Dexuan Cui <decui@...rosoft.com>
> +M:	Long Li <longli@...rosoft.com>
>  L:	linux-hyperv@...r.kernel.org
>  S:	Supported
>  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git

Seems like adding you to the Hyper-V maintainers list should be a
separate patch.  Having "maintainer" status is unrelated to adding
this new driver.

> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 66c794d..2128a8b 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -27,4 +27,15 @@ config HYPERV_BALLOON
>  	help
>  	  Select this option to enable Hyper-V Balloon driver.
> 
> +config HYPERV_XS_FASTPATH
> +	tristate "Microsoft Azure XStore Fastpath driver"
> +	depends on HYPERV && X86_64
> +	help
> +	  Select this option to enable Microsoft Azure XStore Fastpath driver.
> +
> +	  This driver supports accelerated Microsoft Azure XStore Blob access.
> +	  To compile this driver as a module, choose M here. The module will be
> +	  called hv_xs_fastpath.
> +
> +
>  endmenu
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index 94daf82..080fe88 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -2,6 +2,7 @@
>  obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
>  obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
>  obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
> +obj-$(CONFIG_HYPERV_XS_FASTPATH)+= hv_xs_fastpath.o
> 
>  CFLAGS_hv_trace.o = -I$(src)
>  CFLAGS_hv_balloon.o = -I$(src)
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 6fd7ae5..a3f156e 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -153,6 +153,12 @@
>  	  .allowed_in_isolated = false,
>  	},
> 
> +	/* XStore Fastpath */
> +	{ .dev_type = HV_XS_FASTPATH,
> +	  HV_XS_FASTPATH_GUID,
> +	  .perf_device = true,

I'm curious about setting .perf_device=true.  The other
perf devices are those that support multiple VMbus channels,
but this device has only a single channel.  Is there another
reason to set .perf_device=true?

Also set .allowed_in_isolated to "false" so that it is
explicitly called out like the other entries in the table.

> +	},
> +
>  	/* Unknown GUID */
>  	{ .dev_type = HV_UNKNOWN,
>  	  .perf_device = false,
> diff --git a/drivers/hv/hv_xs_fastpath.c b/drivers/hv/hv_xs_fastpath.c
> new file mode 100644
> index 0000000..ee4152e
> --- /dev/null
> +++ b/drivers/hv/hv_xs_fastpath.c
> @@ -0,0 +1,579 @@
> +// SPDX-License-Identifier: GPL-2.0-only

Let's have an offline discussion about the license choice here
and in hv_xs_fastpath.h.

> +/*
> + * Copyright (c) 2021, Microsoft Corporation.
> + *
> + * Authors:
> + *   Long Li <longli@...rosoft.com>
> + */
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/cred.h>
> +#include <linux/debugfs.h>
> +#include <linux/pagemap.h>
> +#include "hv_xs_fastpath.h"
> +
> +#ifdef CONFIG_DEBUG_FS
> +struct dentry *xs_fastpath_debugfs_root;
> +#endif
> +
> +static struct xs_fastpath_device xs_fastpath_dev;
> +
> +static int xs_fastpath_ringbuffer_size = (128 * 1024);
> +module_param(xs_fastpath_ringbuffer_size, int, 0444);
> +MODULE_PARM_DESC(xs_fastpath_ringbuffer_size, "Ring buffer size (bytes)");
> +
> +static const struct hv_vmbus_device_id id_table[] = {
> +	{ HV_XS_FASTPATH_GUID,
> +	  .driver_data = 0
> +	},
> +	{ },
> +};
> +
> +#define XS_ERR 0
> +#define XS_WARN 1
> +#define XS_DBG 2
> +static int log_level = XS_WARN;
> +module_param(log_level, int, 0644);
> +MODULE_PARM_DESC(logging_level,

s/logging_level/log_level/

> +	"Log level, 0 - Error, 1 - Warning (default), 3 - Debug.");

s/Log level, 0/Log level: 0/
s/3 - Debug/2 - Debug/

> +
> +#define xs_fastpath_log(level, fmt, args...)	\
> +do {	\
> +	if (level <= log_level)	\
> +		pr_err("%s:%d " fmt, __func__, __LINE__, ##args);	\
> +} while (0)
> +
> +#define xs_fastpath_dbg(fmt, args...) xs_fastpath_log(XS_DBG, fmt, ##args)
> +#define xs_fastpath_warn(fmt, args...) xs_fastpath_log(XS_WARN, fmt, ##args)
> +#define xs_fastpath_err(fmt, args...) xs_fastpath_log(XS_ERR, fmt, ##args)
> +
> +struct xs_fastpath_vsp_request_ctx {
> +	struct list_head list;
> +	struct completion wait_vsp;
> +	struct xs_fastpath_request_sync *request;
> +};

Any reason for this structure definition to be here instead of
in hv_xs_fastpath.h?

> +
> +static void xs_fastpath_on_channel_callback(void *context)
> +{
> +	struct vmbus_channel *channel = (struct vmbus_channel *)context;
> +	const struct vmpacket_descriptor *desc;
> +
> +	xs_fastpath_dbg("entering interrupt from vmbus\n");
> +	foreach_vmbus_pkt(desc, channel) {
> +		struct xs_fastpath_vsp_request_ctx *request_ctx;
> +		struct xs_fastpath_vsp_response *response;
> +		u64 cmd_rqst = vmbus_request_addr(&channel->requestor,
> +					desc->trans_id);
> +		if (cmd_rqst == VMBUS_RQST_ERROR) {
> +			xs_fastpath_err("Incorrect transaction id %llu\n",
> +				desc->trans_id);
> +			continue;
> +		}
> +		request_ctx = (struct xs_fastpath_vsp_request_ctx *) cmd_rqst;
> +		response = hv_pkt_data(desc);
> +
> +		xs_fastpath_dbg("got response for request %pUb status %u "
> +			"response_len %u\n",
> +			&request_ctx->request->guid, response->error,
> +			response->response_len);
> +		request_ctx->request->response.status = response->error;
> +		request_ctx->request->response.response_len =
> +			response->response_len;
> +		complete(&request_ctx->wait_vsp);
> +	}

This for loop could potentially go on for a long time if there were lots
of responses in the ring buffer, or responses being added while this
loop is running.  It seems like there should be some timeout, and the
remaining work rescheduled to prevent it from running too long.  I know
this code is just like the code in storvsc, but storvsc_on_channel_callback()
has the same problem, which is on my list to fix.

> +
> +}
> +
> +static int xs_fastpath_fop_open(struct inode *inode, struct file *file)
> +{
> +	atomic_inc(&xs_fastpath_dev.file_count);
> +	if (xs_fastpath_dev.removing) {
> +		if (atomic_dec_and_test(&xs_fastpath_dev.file_count))
> +			wake_up(&xs_fastpath_dev.wait_files);
> +		return -ENODEV;
> +	}
> +
> +	file->private_data = &xs_fastpath_dev;
> +
> +	return 0;
> +}
> +
> +static int xs_fastpath_fop_release(struct inode *inode, struct file *file)
> +{
> +	if (atomic_dec_and_test(&xs_fastpath_dev.file_count))
> +		wake_up(&xs_fastpath_dev.wait_files);
> +	return 0;
> +}
> +
> +static inline bool xs_fastpath_safe_file_access(struct file *file)
> +{
> +	return file->f_cred == current_cred() && !uaccess_kernel();
> +}
> +
> +static int get_buffer_pages(int rw, void __user *buffer, u32 buffer_len,
> +	struct page ***ppages, size_t *start, size_t *num_pages)
> +{
> +	struct iovec iov;
> +	struct iov_iter iter;
> +	int ret;
> +	ssize_t result;
> +	struct page **pages;
> +
> +	ret = import_single_range(rw, buffer, buffer_len, &iov, &iter);
> +	if (ret) {
> +		xs_fastpath_err("request buffer access error %d\n", ret);
> +		return ret;
> +	}
> +	xs_fastpath_dbg("iov_iter type %d offset %lu count %lu nr_segs %lu\n",
> +		iter.type, iter.iov_offset, iter.count, iter.nr_segs);
> +
> +	result = iov_iter_get_pages_alloc(&iter, &pages, buffer_len, start);
> +	if (result < 0) {
> +		xs_fastpath_err("failed to ping user pages result=%ld\n", result);

s/ping/pin/   [??]

> +		return result;
> +	}
> +	if (result != buffer_len) {
> +		xs_fastpath_err("can't ping user pages requested %d got %ld\n",

s/ping/pin/   [??]

> +			buffer_len, result);
> +		return -EFAULT;
> +	}
> +
> +	*ppages = pages;
> +	*num_pages = (result + *start + PAGE_SIZE - 1) / PAGE_SIZE;
> +	return 0;
> +}

Can any of the xs_fastpath_err() calls be caused by a user error, such
as passing in bad values on an ioctl call?  If so, we probably want to
not print a console message in those cases.

> +
> +static void fill_in_page_buffer(u64 *pfn_array,
> +	int *index, struct page **pages, unsigned long num_pages)
> +{
> +	int i, page_idx = *index;
> +
> +	for (i = 0; i < num_pages; i++)
> +		pfn_array[page_idx++] = page_to_pfn(pages[i]);
> +	*index = page_idx;
> +}

See notes below.  Filling in the pfn_array needs to account for
running on an ARM64 system where PAGE_SIZE != HV_HYP_PAGE_SIZE.

> +
> +static void free_buffer_pages(size_t num_pages, struct page **pages)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < num_pages; i++)
> +		if (pages[i])
> +			put_page(pages[i]);
> +	kfree(pages);

Doesn't this need to be kvfree()?  'pages' is allocated with kvmalloc(),
which could return vmalloc memory.

> +}
> +
> +static long xs_fastpath_ioctl_user_request(struct file *filp, unsigned long arg)
> +{
> +	struct xs_fastpath_device *dev = filp->private_data;
> +	char __user *argp = (char __user *) arg;
> +	struct xs_fastpath_request_sync request;
> +	struct xs_fastpath_vsp_request_ctx *request_ctx;
> +	unsigned long flags;
> +	int ret;
> +	size_t request_start, request_num_pages = 0;
> +	size_t response_start, response_num_pages = 0;
> +	size_t data_start, data_num_pages = 0, total_num_pages;
> +	struct page **request_pages = NULL, **response_pages = NULL;
> +	struct page **data_pages = NULL;
> +	struct vmbus_packet_mpb_array *desc;
> +	u64 *pfn_array;
> +	int desc_size;
> +	int page_idx;
> +	struct xs_fastpath_vsp_request *vsp_request;
> +
> +	if (dev->removing)
> +		return -ENODEV;

I think this test is just to avoid starting a new request if we know that the
device has been marked for removing.  But because there's no
synchronization, it's possible for the "removing" flag to be set
immediately after the test has passed.

> +
> +	if (!xs_fastpath_safe_file_access(filp)) {
> +		xs_fastpath_err("process %d(%s) changed security contexts after"
> +			" opening file descriptor\n",
> +			task_tgid_vnr(current), current->comm);

I don't think we should produce a console message, at least not a "err"
level.  It gives a user the ability to generate an arbitrary number of console
messages.  Maybe "dbg" level would be OK.

> +		return -EACCES;
> +	}
> +
> +	if (!access_ok(argp, sizeof(struct xs_fastpath_request_sync))) {
> +
> +		xs_fastpath_err("don't have permission to user provided buffer\n");
> +		return -EFAULT;
> +	}

Is this check needed at all?  Again, we don't want to generate a
console message due to a user error, and copy_from_user() below
will validate that the access is OK.   A "dbg" level message would be
OK.

> +
> +	if (copy_from_user(&request, argp, sizeof(request)))
> +		return -EFAULT;
> +
> +	xs_fastpath_dbg("xs_fastpath ioctl request guid %pUb timeout %u request_len %u"
> +		" response_len %u data_len %u request_buffer %llx "
> +		"response_buffer %llx data_buffer %llx\n",
> +		&request.guid, request.timeout, request.request_len,
> +		request.response_len, request.data_len, request.request_buffer,
> +		request.response_buffer, request.data_buffer);
> +
> +	if (!request.request_len || !request.response_len)
> +		return -EINVAL;
> +
> +	request_ctx = kzalloc(sizeof(*request_ctx), GFP_KERNEL);
> +	if (!request_ctx)
> +		return -ENOMEM;

Could the request_ctx just be on the stack?  Below, it stores
a pointer to the request, which is on the stack, and the
request is accessed on the onchannel_callback function.
So having request_ctx on the stack wouldn't be the only
usage of a stack variable in the onchannel_callback function.

> +
> +	init_completion(&request_ctx->wait_vsp);
> +	request_ctx->request = &request;
> +
> +	/*
> +	 * Need to set rw to READ to have page table set up for passing to VSP.
> +	 * Setting it to WRITE will cause the page table entry not allocated
> +	 * as it's waiting on Copy-On-Write on next page fault. This doesn't
> +	 * work in this scenario because VSP wants all the pages to be present.
> +	 */
> +	ret = get_buffer_pages(READ, (void __user *) request.request_buffer,
> +		request.request_len, &request_pages, &request_start,
> +		&request_num_pages);
> +	if (ret)
> +		goto get_user_page_failed;
> +
> +	ret = get_buffer_pages(READ, (void __user *) request.response_buffer,
> +		request.response_len, &response_pages, &response_start,
> +		&response_num_pages);
> +	if (ret)
> +		goto get_user_page_failed;
> +
> +	if (request.data_len) {
> +		ret = get_buffer_pages(READ,
> +			(void __user *) request.data_buffer, request.data_len,
> +			&data_pages, &data_start, &data_num_pages);
> +		if (ret)
> +			goto get_user_page_failed;
> +	}

Do the request_len, response_len, and data_len need to be validated
before being passed to get_buffer_pages() since they are passed in from
user space?  The fields are 32 bits, so the max value is 4 Gbytes, which
maybe is OK.  But with XS_FASTPATH_MAX_PAGES set to 8192, the
maximum size is 32 Mbytes with a 4K page size, and 512 Mbytes with
a 64K page size.

> +
> +	total_num_pages = request_num_pages + response_num_pages +
> +				data_num_pages;
> +	if (total_num_pages > XS_FASTPATH_MAX_PAGES) {

Note that this check imposes different limits depending on PAGE_SIZE.
Is the value XS_FASTPATH_MAX_PAGES governed by the largest
vmbus_packet_mpb_array that Hyper-V will accept?  If not, does there
need to be a check against the max vmbus_packet_mpb_array size?
(I don't know what the Hyper-V limit is.)

> +		xs_fastpath_err("number of DMA pages %lu buffer exceeding %u\n",
> +			total_num_pages, XS_FASTPATH_MAX_PAGES);
> +		ret = -EINVAL;
> +		goto get_user_page_failed;
> +	}
> +
> +	/* Construct a VMBUS packet and send it over to VSP */
> +	desc_size = sizeof(struct vmbus_packet_mpb_array) +
> +			sizeof(u64) * total_num_pages;

The above doesn't handle the case where the guest PAGE_SIZE
is different from the Hyper-V page size (HV_HYP_PAGE_SIZE), as
can happen on ARM64.

> +	desc = kzalloc(desc_size, GFP_KERNEL);
> +	vsp_request = kzalloc(sizeof(*vsp_request), GFP_KERNEL);
> +	if (!desc || !vsp_request) {
> +		kfree(desc);
> +		kfree(vsp_request);
> +		ret = -ENOMEM;
> +		goto get_user_page_failed;
> +	}
> +
> +	desc->range.offset = 0;
> +	desc->range.len = total_num_pages * PAGE_SIZE;
> +	pfn_array = desc->range.pfn_array;
> +	page_idx = 0;
> +
> +	if (request.data_len) {
> +		fill_in_page_buffer(pfn_array, &page_idx, data_pages,
> +			data_num_pages);
> +		vsp_request->data_buffer_offset = data_start;
> +		vsp_request->data_buffer_length = request.data_len;
> +		vsp_request->data_buffer_valid = 1;
> +	}
> +
> +	fill_in_page_buffer(pfn_array, &page_idx, request_pages,
> +		request_num_pages);
> +	vsp_request->request_buffer_offset = request_start +
> +						data_num_pages * PAGE_SIZE;
> +	vsp_request->request_buffer_length = request.request_len;
> +
> +	fill_in_page_buffer(pfn_array, &page_idx, response_pages,
> +		response_num_pages);
> +	vsp_request->response_buffer_offset = response_start +
> +		(data_num_pages + request_num_pages) * PAGE_SIZE;
> +	vsp_request->response_buffer_length = request.response_len;

All of the above that is filling in the pfn_array needs to handle the case
where PAGE_SIZE != HV_HYP_PAGE_SIZE.

> +
> +	vsp_request->version = 0;
> +	vsp_request->timeout_ms = request.timeout;

Does the request.timeout value from user space need to be validated?
I'm thinking about a value of 0, or a really small value like 1 ms, or a
really large value that effectively means no timeout. 

> +	vsp_request->operation_type = XS_FASTPATH_DRIVER_USER_REQUEST;
> +	guid_copy(&vsp_request->transaction_id, &request.guid);
> +
> +	spin_lock_irqsave(&dev->vsp_pending_lock, flags);
> +	list_add_tail(&request_ctx->list, &dev->vsp_pending_list);
> +	spin_unlock_irqrestore(&dev->vsp_pending_lock, flags);
> +
> +	xs_fastpath_dbg("sending request to VSP\n");
> +	xs_fastpath_dbg("desc_size %u desc->range.len %u desc->range.offset %u\n",
> +		desc_size, desc->range.len, desc->range.offset);
> +	xs_fastpath_dbg("vsp_request data_buffer_offset %u data_buffer_length %u "
> +		"data_buffer_valid %u request_buffer_offset %u "
> +		"request_buffer_length %u response_buffer_offset %u "
> +		"response_buffer_length %u\n",
> +		vsp_request->data_buffer_offset,
> +		vsp_request->data_buffer_length,
> +		vsp_request->data_buffer_valid,
> +		vsp_request->request_buffer_offset,
> +		vsp_request->request_buffer_length,
> +		vsp_request->response_buffer_offset,
> +		vsp_request->response_buffer_length);
> +
> +	ret = vmbus_sendpacket_mpb_desc(dev->device->channel, desc, desc_size,
> +		vsp_request, sizeof(*vsp_request), (u64) request_ctx);
> +
> +	kfree(desc);
> +	kfree(vsp_request);
> +	if (ret)
> +		goto vmbus_send_failed;
> +
> +	wait_for_completion(&request_ctx->wait_vsp);

This is a "wait forever", which exists in other places in the Hyper-V drivers,
but we've learned that might not be a good idea.  It's more complicated
to deal with a timeout, and the potential for the Hyper-V response to come in
after the timeout, but that would be the most robust approach.  Maybe we
can live with the "wait forever" approach for now, but in the long run we'll
probably want something like a 30 second or 60 second timeout.

> +
> +	/*
> +	 * At this point, the response is already written to request
> +	 * by VMBUS completion handler, copy them to user-mode buffers
> +	 * and return to user-mode
> +	 */
> +	if (copy_to_user(argp +
> +			offsetof(struct xs_fastpath_request_sync,
> +				response.status),
> +			&request.response.status,
> +			sizeof(request.response.status))) {
> +		ret = -EFAULT;
> +		goto vmbus_send_failed;
> +	}
> +
> +	if (copy_to_user(argp +
> +			offsetof(struct xs_fastpath_request_sync,
> +				response.response_len),
> +			&request.response.response_len,
> +			sizeof(request.response.response_len)))
> +		ret = -EFAULT;
> +
> +vmbus_send_failed:
> +	spin_lock_irqsave(&dev->vsp_pending_lock, flags);
> +	list_del(&request_ctx->list);
> +	if (list_empty(&dev->vsp_pending_list))
> +		wake_up(&dev->wait_vsp);
> +	spin_unlock_irqrestore(&dev->vsp_pending_lock, flags);
> +
> +get_user_page_failed:
> +	free_buffer_pages(request_num_pages, request_pages);
> +	free_buffer_pages(response_num_pages, response_pages);
> +	free_buffer_pages(data_num_pages, data_pages);
> +
> +	kfree(request_ctx);
> +	return ret;
> +}
> +
> +static long xs_fastpath_fop_ioctl(struct file *filp, unsigned int cmd,
> +	unsigned long arg)
> +{
> +	long ret = -EIO;
> +
> +	switch (cmd) {
> +	case IOCTL_XS_FASTPATH_DRIVER_USER_REQUEST:
> +		if (_IOC_SIZE(cmd) != sizeof(struct xs_fastpath_request_sync))
> +			return -EINVAL;
> +		ret = xs_fastpath_ioctl_user_request(filp, arg);
> +		break;
> +
> +	default:
> +		xs_fastpath_err("unrecognized IOCTL code %u\n", cmd);

Again, probably don't want to output a console message just because
user space passed a bad ioctl code.

> +	}
> +
> +	return ret;
> +}
> +
> +static const struct file_operations xs_fastpath_client_fops = {
> +	.owner	= THIS_MODULE,
> +	.open	= xs_fastpath_fop_open,
> +	.unlocked_ioctl = xs_fastpath_fop_ioctl,
> +	.release = xs_fastpath_fop_release,
> +};
> +
> +static struct miscdevice xs_fastpath_misc_device = {
> +	MISC_DYNAMIC_MINOR,
> +	"azure_xs_fastpath",
> +	&xs_fastpath_client_fops,
> +};
> +
> +static int xs_fastpath_show_pending_requests(struct seq_file *m, void *v)
> +{
> +	unsigned long flags;
> +	struct xs_fastpath_vsp_request_ctx *request_ctx;
> +
> +	seq_puts(m, "List of pending requests\n");
> +	seq_puts(m, "UUID request_len response_len data_len "
> +		"request_buffer response_buffer data_buffer\n");
> +	spin_lock_irqsave(&xs_fastpath_dev.vsp_pending_lock, flags);
> +	list_for_each_entry(request_ctx, &xs_fastpath_dev.vsp_pending_list, list) {
> +		seq_printf(m, "%pUb ", &request_ctx->request->guid);
> +		seq_printf(m, "%u ", request_ctx->request->request_len);
> +		seq_printf(m, "%u ", request_ctx->request->response_len);
> +		seq_printf(m, "%u ", request_ctx->request->data_len);
> +		seq_printf(m, "%llx ", request_ctx->request->request_buffer);
> +		seq_printf(m, "%llx ", request_ctx->request->response_buffer);
> +		seq_printf(m, "%llx\n", request_ctx->request->data_buffer);
> +	}
> +	spin_unlock_irqrestore(&xs_fastpath_dev.vsp_pending_lock, flags);
> +
> +	return 0;
> +}
> +
> +static int xs_fastpath_debugfs_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, xs_fastpath_show_pending_requests, NULL);
> +}
> +
> +static const struct file_operations xs_fastpath_debugfs_fops = {
> +	.open		= xs_fastpath_debugfs_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release
> +};
> +
> +static void xs_fastpath_remove_device(struct xs_fastpath_device *dev)
> +{
> +	wait_event(dev->wait_files, atomic_read(&dev->file_count) == 0);

After this wait_event() completes, but before misc_deregister() completes,
the device can still be found and opened.  Because the removing flag is
set, xs_fastpath_fop_open() won't do anything but increment the
file_count, then decrement it again, and call wake_up().  That all works
with xs_fastpath_dev as a static variable.  But it might not work if
xs_fastpath_dev was dynamically allocated, as would be the case if
multiple such devices were supported.  So I'm wondering if this code
really should do the misc_deregister() and debugfs_remove_recursive()
first, and then to the wait_event().  When done in that order, you
know that the device can't be found again after the wait_event()
completes.  Then the "removing" flag is not even needed.

> +	misc_deregister(&xs_fastpath_misc_device);
> +#ifdef CONFIG_DEBUG_FS
> +	debugfs_remove_recursive(xs_fastpath_debugfs_root);
> +#endif
> +	/* At this point, we won't get any requests from user-mode */
> +}
> +
> +static int xs_fastpath_create_device(struct xs_fastpath_device *dev)
> +{
> +	int rc;
> +	struct dentry *d;
> +
> +	atomic_set(&xs_fastpath_dev.file_count, 0);
> +	init_waitqueue_head(&xs_fastpath_dev.wait_files);
> +	rc = misc_register(&xs_fastpath_misc_device);
> +	if (rc) {
> +		xs_fastpath_err("misc_register failed rc %d\n", rc);
> +		return rc;
> +	}
> +
> +#ifdef CONFIG_DEBUG_FS
> +	xs_fastpath_debugfs_root = debugfs_create_dir("xs_fastpath", NULL);
> +	if (!IS_ERR_OR_NULL(xs_fastpath_debugfs_root)) {
> +		d = debugfs_create_file("pending_requests", 0400,
> +			xs_fastpath_debugfs_root, NULL,
> +			&xs_fastpath_debugfs_fops);
> +		if (IS_ERR_OR_NULL(d)) {
> +			xs_fastpath_err("failed to create debugfs file\n");
> +			debugfs_remove_recursive(xs_fastpath_debugfs_root);
> +			xs_fastpath_debugfs_root = NULL;
> +		}
> +	} else
> +		xs_fastpath_err("failed to create debugfs root\n");
> +#endif
> +
> +	return 0;
> +}
> +
> +static int xs_fastpath_connect_to_vsp(struct hv_device *device, u32 ring_size)
> +{
> +	int ret;
> +
> +	spin_lock_init(&xs_fastpath_dev.vsp_pending_lock);
> +	INIT_LIST_HEAD(&xs_fastpath_dev.vsp_pending_list);
> +	init_waitqueue_head(&xs_fastpath_dev.wait_vsp);
> +	xs_fastpath_dev.removing = false;
> +
> +	xs_fastpath_dev.device = device;
> +	device->channel->rqstor_size = xs_fastpath_ringbuffer_size /
> +		sizeof(struct xs_fastpath_vsp_request);

That setting probably isn't correct.  Presumably the Hyper-V VSP can
hold a certain number of requests that it has already removed the
guest->host ring buffer.  So the max number of in-flight requests
would be that Hyper-V VSP count, plus the number of requests that
will fit in the ring buffer.

Independent of this setting, the ioctl's are synchronous, so the
total number of in-flight requests will be the number of threads
making requests.   Note that these ioctl calls include all blobs
being accessed using this method by all users running in the VM.
Nonetheless, it might be reasonable to just put a fixed cap
on that number.  Perhaps something like 1024?  If the limit
is exceeded, the calls to vmbus_sendpacket_mpb_desc()
will start failing because of being unable to allocate a
requestID.

Presumably the memory management subsystem will also
limit the amount of memory being pinned by the ioctl calls,
which could also cause the ioctl calls to return an error.


> +
> +	ret = vmbus_open(device->channel, ring_size, ring_size, NULL, 0,
> +			xs_fastpath_on_channel_callback, device->channel);
> +
> +	if (ret) {
> +		xs_fastpath_err("failed to connect to VSP ret %d\n", ret);
> +		return ret;
> +	}
> +
> +	hv_set_drvdata(device, &xs_fastpath_dev);
> +
> +	return ret;
> +}
> +
> +static void xs_fastpath_remove_vmbus(struct hv_device *device)
> +{
> +	struct xs_fastpath_device *dev = hv_get_drvdata(device);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&dev->vsp_pending_lock, flags);
> +	if (!list_empty(&dev->vsp_pending_list)) {

I don't think this can ever happen.  If the device has already been
removed by xs_fastpath_remove_device(), then we know that the
device isn't open in any processes, so there can't be any ioctl's
in progress that would have entries in the vsp_pending_list.

> +		spin_unlock_irqrestore(&dev->vsp_pending_lock, flags);
> +		xs_fastpath_dbg("wait for vsp_pending_list\n");
> +		wait_event(dev->wait_vsp,
> +			list_empty(&dev->vsp_pending_list));
> +	} else
> +		spin_unlock_irqrestore(&dev->vsp_pending_lock, flags);
> +
> +	/* At this point, no VSC/VSP traffic is possible over vmbus */
> +	hv_set_drvdata(device, NULL);
> +	vmbus_close(device->channel);
> +}
> +
> +static int xs_fastpath_probe(struct hv_device *device,
> +			const struct hv_vmbus_device_id *dev_id)
> +{
> +	int rc;
> +
> +	xs_fastpath_dbg("probing device\n");
> +
> +	rc = xs_fastpath_connect_to_vsp(device, xs_fastpath_ringbuffer_size);
> +	if (rc) {
> +		xs_fastpath_err("error connecting to VSP rc %d\n", rc);
> +		return rc;
> +	}
> +
> +	// create user-mode client library facing device
> +	rc = xs_fastpath_create_device(&xs_fastpath_dev);
> +	if (rc) {
> +		xs_fastpath_remove_vmbus(device);
> +		return rc;
> +	}
> +
> +	xs_fastpath_dbg("successfully probed device\n");
> +	return 0;
> +}
> +
> +static int xs_fastpath_remove(struct hv_device *dev)
> +{
> +	struct xs_fastpath_device *device = hv_get_drvdata(dev);
> +
> +	device->removing = true;
> +	xs_fastpath_remove_device(device);
> +	xs_fastpath_remove_vmbus(dev);
> +	return 0;
> +}
> +
> +static struct hv_driver xs_fastpath_drv = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = id_table,
> +	.probe = xs_fastpath_probe,
> +	.remove = xs_fastpath_remove,
> +	.driver = {
> +		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
> +	},
> +};
> +
> +static int __init xs_fastpath_drv_init(void)
> +{
> +	int ret;
> +
> +	ret = vmbus_driver_register(&xs_fastpath_drv);
> +	return ret;
> +}
> +
> +static void __exit xs_fastpath_drv_exit(void)
> +{
> +	vmbus_driver_unregister(&xs_fastpath_drv);
> +}
> +
> +MODULE_LICENSE("GPL");

Let's discuss the license choice offline.

> +MODULE_DESCRIPTION("Microsoft Azure XStore Storage Fastpath driver");

The word "Storage" seems unnecessary.  Everywhere else the name is just
"Azure Xstore Fastpath".

> +module_init(xs_fastpath_drv_init);
> +module_exit(xs_fastpath_drv_exit);
> diff --git a/drivers/hv/hv_xs_fastpath.h b/drivers/hv/hv_xs_fastpath.h
> new file mode 100644
> index 0000000..6db1904
> --- /dev/null
> +++ b/drivers/hv/hv_xs_fastpath.h
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2021, Microsoft Corporation.
> + *
> + * Authors:
> + *   Long Li <longli@...rosoft.com>
> + */
> +#ifndef _XS_FASTPATH_H
> +#define _XS_FASTPATH_H
> +
> +#include <linux/hyperv.h>
> +#include <linux/miscdevice.h>
> +#include <linux/hashtable.h>
> +#include <linux/uio.h>
> +
> +struct xs_fastpath_device {
> +	struct hv_device *device;
> +	bool removing;
> +
> +	struct list_head vsp_pending_list;
> +	spinlock_t vsp_pending_lock;
> +	wait_queue_head_t wait_vsp;
> +
> +	wait_queue_head_t wait_files;
> +	atomic_t file_count;
> +};
> +
> +/* user-mode sync request sent through ioctl */
> +struct xs_fastpath_request_sync_response {
> +	__u32 status;
> +	__u32 response_len;
> +};
> +
> +struct xs_fastpath_request_sync {
> +	guid_t guid;
> +	__u32 timeout;
> +	__u32 request_len;
> +	__u32 response_len;
> +	__u32 data_len;
> +	__aligned_u64 request_buffer;
> +	__aligned_u64 response_buffer;
> +	__aligned_u64 data_buffer;
> +	struct xs_fastpath_request_sync_response response;
> +};

Wouldn't the structures used by user space need to go in a
uapi header file?

> +
> +/* VSP messages */
> +enum xs_fastpath_vsp_request_type {
> +	XS_FASTPATH_DRIVER_REQUEST_FIRST     = 0x100,
> +	XS_FASTPATH_DRIVER_USER_REQUEST      = 0x100,
> +	XS_FASTPATH_DRIVER_REGISTER_BUFFER   = 0x101,
> +	XS_FASTPATH_DRIVER_DEREGISTER_BUFFER = 0x102,
> +	XS_FASTPATH_DRIVER_REQUEST_MAX       = 0x103

Most of these don't seem to be used in the code.  And it seems
a bit unexpected for the MAX value to not be the same as the
last value (DEREGISTER_BUFFER).

> +};
> +
> +/* VSC->VSP request */
> +struct xs_fastpath_vsp_request {
> +	u32 version;
> +	u32 timeout_ms;
> +	u32 data_buffer_offset;
> +	u32 data_buffer_length;
> +	u32 data_buffer_valid;
> +	u32 operation_type;
> +	u32 request_buffer_offset;
> +	u32 request_buffer_length;
> +	u32 response_buffer_offset;
> +	u32 response_buffer_length;
> +	guid_t transaction_id;
> +} __packed;
> +
> +/* VSP->VSC response */
> +struct xs_fastpath_vsp_response {
> +	u32 length;
> +	u32 error;
> +	u32 response_len;
> +} __packed;
> +
> +#define IOCTL_XS_FASTPATH_DRIVER_USER_REQUEST \
> +		_IOWR('R', 10, struct xs_fastpath_request_sync)

How was the ioctl code decided?  Using 'R' and '10' seems to
be in the range assigned to /dev/random, per the file
Documentation/userspace-api/ioctl/ioctl-number.rst.

Does this also need to go in a uapi header file?

> +
> +#define XS_FASTPATH_MAX_PAGES 8192

How was this value determined?  Would be good to leave
a comment, and the comment should speak to how the
limit is handled when PAGE_SIZE varies on different
architectures.

> +
> +#endif /* define _XS_FASTPATH_H */
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index d1e59db..a1730c4 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -772,6 +772,7 @@ enum vmbus_device_type {
>  	HV_FCOPY,
>  	HV_BACKUP,
>  	HV_DM,
> +	HV_XS_FASTPATH,
>  	HV_UNKNOWN,
>  };
> 
> @@ -1350,6 +1351,14 @@ int vmbus_allocate_mmio(struct resource **new, struct
> hv_device *device_obj,
>  			  0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f)
> 
>  /*
> + * XStore Fastpath GUID
> + * {0590b792-db64-45cc-81db-b8d70c577c9e}
> + */
> +#define HV_XS_FASTPATH_GUID \
> +	.guid = GUID_INIT(0x0590b792, 0xdb64, 0x45cc, 0x81, 0xdb, \
> +			  0xb8, 0xd7, 0x0c, 0x57, 0x7c, 0x9e)
> +
> +/*
>   * Shutdown GUID
>   * {0e0b6031-5213-4934-818b-38d90ced39db}
>   */
> --
> 1.8.3.1