[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7716784-face-44ff-837d-50f7ca79f0e9@linux.microsoft.com>
Date: Tue, 11 Mar 2025 10:15:25 +0530
From: Naman Jain <namjain@...ux.microsoft.com>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: "K . Y . Srinivasan" <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>,
Stephen Hemminger <stephen@...workplumber.org>,
linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...nel.org, Saurabh Sengar <ssengar@...ux.microsoft.com>,
Michael Kelley <mhklinux@...look.com>, Long Li <longli@...rosoft.com>
Subject: Re: [PATCH] uio_hv_generic: Fix sysfs creation path for ring buffer
On 3/5/2025 12:36 PM, Naman Jain wrote:
>
>
> On 2/27/2025 11:54 AM, Naman Jain wrote:
>>
>>
>> On 2/26/2025 8:03 PM, Greg Kroah-Hartman wrote:
>>> On Wed, Feb 26, 2025 at 05:51:46PM +0530, Naman Jain wrote:
>>>>
>>>>
>>>> On 2/26/2025 3:33 PM, Greg Kroah-Hartman wrote:
>>>>> On Wed, Feb 26, 2025 at 10:43:41AM +0530, Naman Jain wrote:
>>>>>>
>>>>>>
>>>>>> On 2/25/2025 2:09 PM, Greg Kroah-Hartman wrote:
>>>>>>> On Tue, Feb 25, 2025 at 02:04:43PM +0530, Naman Jain wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/25/2025 11:42 AM, Greg Kroah-Hartman wrote:
>>>>>>>>> On Tue, Feb 25, 2025 at 10:50:01AM +0530, Naman Jain wrote:
>>>>>>>>>> On regular bootup, devices get registered to vmbus first, so when
>>>>>>>>>> uio_hv_generic driver for a particular device type is probed,
>>>>>>>>>> the device is already initialized and added, so sysfs creation in
>>>>>>>>>> uio_hv_generic probe works fine. However, when device is removed
>>>>>>>>>> and brought back, the channel rescinds and device again gets
>>>>>>>>>> registered to vmbus. However this time, the uio_hv_generic
>>>>>>>>>> driver is
>>>>>>>>>> already registered to probe for that device and in this case
>>>>>>>>>> sysfs
>>>>>>>>>> creation is tried before the device gets initialized completely.
>>>>>>>>>>
>>>>>>>>>> Fix this by moving the core logic of sysfs creation for ring
>>>>>>>>>> buffer,
>>>>>>>>>> from uio_hv_generic to HyperV's vmbus driver, where rest of
>>>>>>>>>> the sysfs
>>>>>>>>>> attributes for the channels are defined. While doing that,
>>>>>>>>>> make use
>>>>>>>>>> of attribute groups and macros, instead of creating sysfs
>>>>>>>>>> directly,
>>>>>>>>>> to ensure better error handling and code flow.
>>>>
>>>> <snip>
>>>>
>>>>>>>>>> +static int hv_mmap_ring_buffer_wrapper(struct file *filp,
>>>>>>>>>> struct kobject *kobj,
>>>>>>>>>> + const struct bin_attribute *attr,
>>>>>>>>>> + struct vm_area_struct *vma)
>>>>>>>>>> +{
>>>>>>>>>> + struct vmbus_channel *channel = container_of(kobj, struct
>>>>>>>>>> vmbus_channel, kobj);
>>>>>>>>>> +
>>>>>>>>>> + if (!channel->mmap_ring_buffer)
>>>>>>>>>> + return -ENODEV;
>>>>>>>>>> + return channel->mmap_ring_buffer(channel, vma);
>>>>>>>>>
>>>>>>>>> What is preventing mmap_ring_buffer from being set to NULL
>>>>>>>>> right after
>>>>>>>>> checking it and then calling it here? I see no locks here or
>>>>>>>>> where you
>>>>>>>>> are assigning this variable at all, so what is preventing these
>>>>>>>>> types of
>>>>>>>>> races?
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>>
>>>>>>>>> greg k-h
>>>>>>>>
>>>>>>>> Thank you so much for reviewing.
>>>>>>>> I spent some time to understand if this race condition can
>>>>>>>> happen and it
>>>>>>>> seems execution flow is pretty sequential, for a particular
>>>>>>>> channel of a
>>>>>>>> device.
>>>>>>>>
>>>>>>>> Unless hv_uio_remove (which makes channel->mmap_ring_buffer
>>>>>>>> NULL) can be
>>>>>>>> called in parallel to hv_uio_probe (which had set
>>>>>>>> channel->mmap_ring_buffer to non NULL), I doubt race can happen
>>>>>>>> here.
>>>>>>>>
>>>>>>>> Code Flow: (R, W-> Read, Write to channel->mmap_ring_buffer)
>>>>>>>>
>>>>>>>> vmbus_device_register
>>>>>>>> device_register
>>>>>>>> hv_uio_probe
>>>>>>>> hv_create_ring_sysfs (W to non NULL)
>>>>>>>> sysfs_update_group
>>>>>>>> vmbus_chan_attr_is_visible (R)
>>>>>>>> vmbus_add_channel_kobj
>>>>>>>> sysfs_create_group
>>>>>>>> vmbus_chan_attr_is_visible (R)
>>>>>>>> hv_mmap_ring_buffer_wrapper (critical section)
>>>>>>>>
>>>>>>>> hv_uio_remove
>>>>>>>> hv_remove_ring_sysfs (W to NULL)
>>>>>>>
>>>>>>> Yes, and right in here someone mmaps the file.
>>>>>>>
>>>>>>> I think you can race here, no locks at all feels wrong.
>>>>>>>
>>>>>>> Messing with sysfs groups and files like this is rough, and
>>>>>>> almost never
>>>>>>> a good idea, why can't you just do this all at once with the default
>>>>>>> groups, why is this being added/removed out-of-band?
>>>>>>>
>>>>>>> thanks,
>>>>>>>
>>>>>>> greg k-h
>>>>>>
>>>>>> The decision to avoid creating a "ring" sysfs attribute by default
>>>>>> likely stems from a specific use case where it wasn't needed for
>>>>>> every
>>>>>> device. By creating it automatically, it keeps the uio_hv_generic
>>>>>> driver simpler and helps prevent potential race conditions.
>>>>>> However, it
>>>>>> has an added cost of having ring buffer for all the channels,
>>>>>> where it
>>>>>> is not required. I am trying to find if there are any more
>>>>>> implications
>>>>>> of it.
>>>>>
>>>>> You do know about the "is_visable" attribute callback, right? Why not
>>>>> just use that instead of manually mucking around with the
>>>>> adding/removing of sysfs attributes at all? That is what it was
>>>>> designed for.
>>>>>
>>>>> thanks,
>>>>>
>>>>> greg k-h
>>>>
>>>> Hi Greg,
>>>> Yes, I am utilizing that in my patch. For differentiating channels of a
>>>> uio_hv_generic device, and for *selectively* creating sysfs, we
>>>> introduced this field in channel struct "channel->mmap_ring_buffer",
>>>> which we were setting only in uio_hv_generic. But, by the time we set
>>>> this in uio_hv_generic driver, the sysfs creation has already gone
>>>> through and sysfs doesn't get updated dynamically. That's where there
>>>> was a need to call sysfs_update_group. I thought the better place to
>>>> keep sysfs_update_group would be in vmbus driver, where we are creating
>>>> the original sysfs entries, hence I had to add the wrapper functions.
>>>> This led us to the race condition we are trying to address now.
>>>>
>>>>
>>>> @@ -1838,6 +1872,10 @@ static umode_t vmbus_chan_attr_is_visible(struct
>>>> kobject *kobj,
>>>> attr == &chan_attr_monitor_id.attr))
>>>> return 0;
>>>>
>>>> + /* Hide ring attribute if channel's mmap_ring_buffer function
>>>> is not yet
>>>> initialised */
>>>> + if (attr == &chan_attr_ring_buffer.attr && !channel-
>>>> >mmap_ring_buffer)
>>>> + return 0;
>>>> +
>>>> return attr->mode;
>>>
>>> Ok, that's good. BUT you need to change the detection of this to be
>>> before the device is set up in the driver. Why can't you do it in the
>>> device creation logic itself instead of after-the-fact when you will
>>> ALWAYS race with userspace?
>>>
>>> thanks,
>>>
>>> greg k-h
>>
>> Sure, will check more on this. Thanks.
>>
>> Regards,
>> Naman
>>
>
> Hi Greg,
> I understand this is deviating from the discussions that we have had
> till now, but I wanted to kindly request your opinion on the following
> approach, which came up in one of our internal discussions.
>
> By moving the sysfs creation logic from hv_uio_probe to hv_uio_open, I
> believe we can address this problem. Here are some of the benefits of
> this approach:
>
> * This approach involves minimal changes and provides a localized
> solution.
>
> * Since the use-case of ring sysfs is specific to uio_hv_generic and
> DPDK, this will give us the flexibility to modify it based on
> requirements. For example, ring_buffer_bin_attr.size should depend on
> the ring buffer's allocated size, which is easier to manage if the
> current code resides in uio_hv_generic.
>
> * The use-case of DPDK is such that at any given time, either the
> hv_netvsc kernel driver or the userspace (DPDK) will be controlling this
> HV_NIC device. We do not want to expose the ring buffer to userspace
> when hv_netvsc is using the device. This is where the "awareness" of the
> current user comes into play, and we need a way to control the
> visibility of the "ring" sysfs from uio_hv_generic.
>
>
> Alternatively, I could focus on resolving the race condition you
> mentioned and proceed with refining the patch. This approach addresses
> most of the general practice concerns you highlighted.
>
> Regards,
> Naman
Hello Greg,
Here are the two approaches that we discussed.
Can you please suggest the approach which looks better to you
for next version.
**Approach 1: move sysfs creation to hv_uio_open**
---
drivers/uio/uio_hv_generic.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
index 1b19b5647495..fec78c90279e 100644
--- a/drivers/uio/uio_hv_generic.c
+++ b/drivers/uio/uio_hv_generic.c
@@ -218,6 +218,11 @@ hv_uio_open(struct uio_info *info, struct inode *inode)
vmbus_set_chn_rescind_callback(dev->channel, hv_uio_rescind);
vmbus_set_sc_create_callback(dev->channel, hv_uio_new_channel);
+ ret = sysfs_create_bin_file(&dev->channel->kobj, &ring_buffer_bin_attr);
+ if (ret)
+ dev_notice(&dev->device,
+ "sysfs create ring bin file failed; %d\n", ret);
+
ret = vmbus_connect_ring(dev->channel,
hv_uio_channel_cb, dev->channel);
if (ret == 0)
@@ -350,11 +355,6 @@ hv_uio_probe(struct hv_device *dev,
goto fail_close;
}
- ret = sysfs_create_bin_file(&channel->kobj, &ring_buffer_bin_attr);
- if (ret)
- dev_notice(&dev->device,
- "sysfs create ring bin file failed; %d\n", ret);
-
hv_set_drvdata(dev, pdata);
return 0;
--
=====================================================================
**Approach 2: Move sysfs creation to hyperv drivers**
---
drivers/hv/hyperv_vmbus.h | 6 +++
drivers/hv/vmbus_drv.c | 73 +++++++++++++++++++++++++++++++++++-
drivers/uio/uio_hv_generic.c | 33 ++++++----------
include/linux/hyperv.h | 6 +++
4 files changed, 95 insertions(+), 23 deletions(-)
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 29780f3a7478..0b450e53161e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -477,4 +477,10 @@ static inline int hv_debug_add_dev_dir(struct
hv_device *dev)
#endif /* CONFIG_HYPERV_TESTING */
+/* Create and remove sysfs entry for memory mapped ring buffers for a
channel */
+int hv_create_ring_sysfs(struct vmbus_channel *channel,
+ int (*hv_mmap_ring_buffer)(struct vmbus_channel *channel,
+ struct vm_area_struct *vma));
+int hv_remove_ring_sysfs(struct vmbus_channel *channel);
+
#endif /* _HYPERV_VMBUS_H */
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 22afebfc28ff..44ecbca83c9e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1802,6 +1802,27 @@ static ssize_t subchannel_id_show(struct
vmbus_channel *channel,
}
static VMBUS_CHAN_ATTR_RO(subchannel_id);
+static int hv_mmap_ring_buffer_wrapper(struct file *filp, struct
kobject *kobj,
+ const struct bin_attribute *attr,
+ struct vm_area_struct *vma)
+{
+ struct vmbus_channel *channel = container_of(kobj, struct
vmbus_channel, kobj);
+
+ /*
+ * hv_(create|remove)_ring_sysfs implementation ensures that
mmap_ring_buffer
+ * is not NULL.
+ */
+ return channel->mmap_ring_buffer(channel, vma);
+}
+
+static struct bin_attribute chan_attr_ring_buffer = {
+ .attr = {
+ .name = "ring",
+ .mode = 0600,
+ },
+ .size = 2 * SZ_2M,
+ .mmap = hv_mmap_ring_buffer_wrapper,
+};
static struct attribute *vmbus_chan_attrs[] = {
&chan_attr_out_mask.attr,
&chan_attr_in_mask.attr,
@@ -1821,6 +1842,11 @@ static struct attribute *vmbus_chan_attrs[] = {
NULL
};
+static struct bin_attribute *vmbus_chan_bin_attrs[] = {
+ &chan_attr_ring_buffer,
+ NULL
+};
+
/*
* Channel-level attribute_group callback function. Returns the
permission for
* each attribute, and returns 0 if an attribute is not visible.
@@ -1841,9 +1867,24 @@ static umode_t vmbus_chan_attr_is_visible(struct
kobject *kobj,
return attr->mode;
}
+static umode_t vmbus_chan_bin_attr_is_visible(struct kobject *kobj,
+ const struct bin_attribute *attr, int idx)
+{
+ const struct vmbus_channel *channel =
+ container_of(kobj, struct vmbus_channel, kobj);
+
+ /* Hide ring attribute if channel's ring_sysfs_visible is set to false */
+ if (attr == &chan_attr_ring_buffer && !channel->ring_sysfs_visible)
+ return 0;
+
+ return attr->attr.mode;
+}
+
static const struct attribute_group vmbus_chan_group = {
.attrs = vmbus_chan_attrs,
- .is_visible = vmbus_chan_attr_is_visible
+ .bin_attrs = vmbus_chan_bin_attrs,
+ .is_visible = vmbus_chan_attr_is_visible,
+ .is_bin_visible = vmbus_chan_bin_attr_is_visible,
};
static const struct kobj_type vmbus_chan_ktype = {
@@ -1851,6 +1892,36 @@ static const struct kobj_type vmbus_chan_ktype = {
.release = vmbus_chan_release,
};
+/*
+ * hv_create_ring_sysfs - create ring sysfs entry corresponding to ring
buffers for a channel
+ */
+int hv_create_ring_sysfs(struct vmbus_channel *channel,
+ int (*hv_mmap_ring_buffer)(struct vmbus_channel *channel,
+ struct vm_area_struct *vma))
+{
+ struct kobject *kobj = &channel->kobj;
+
+ channel->mmap_ring_buffer = hv_mmap_ring_buffer;
+ channel->ring_sysfs_visible = true;
+ return sysfs_update_group(kobj, &vmbus_chan_group);
+}
+EXPORT_SYMBOL_GPL(hv_create_ring_sysfs);
+
+/*
+ * hv_remove_ring_sysfs - remove ring sysfs entry corresponding to ring
buffers for a channel
+ */
+int hv_remove_ring_sysfs(struct vmbus_channel *channel)
+{
+ struct kobject *kobj = &channel->kobj;
+ int ret;
+
+ channel->ring_sysfs_visible = false;
+ ret = sysfs_update_group(kobj, &vmbus_chan_group);
+ channel->mmap_ring_buffer = NULL;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(hv_remove_ring_sysfs);
+
/*
* vmbus_add_channel_kobj - setup a sub-directory under device/channels
*/
diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
index 1b19b5647495..a398e93ed382 100644
--- a/drivers/uio/uio_hv_generic.c
+++ b/drivers/uio/uio_hv_generic.c
@@ -131,15 +131,12 @@ static void hv_uio_rescind(struct vmbus_channel
*channel)
vmbus_device_unregister(channel->device_obj);
}
-/* Sysfs API to allow mmap of the ring buffers
+/* Function used for mmap of ring buffer sysfs interface.
* The ring buffer is allocated as contiguous memory by vmbus_open
*/
-static int hv_uio_ring_mmap(struct file *filp, struct kobject *kobj,
- const struct bin_attribute *attr,
- struct vm_area_struct *vma)
+static int
+hv_uio_ring_mmap(struct vmbus_channel *channel, struct vm_area_struct *vma)
{
- struct vmbus_channel *channel
- = container_of(kobj, struct vmbus_channel, kobj);
void *ring_buffer = page_address(channel->ringbuffer_page);
if (channel->state != CHANNEL_OPENED_STATE)
@@ -149,15 +146,6 @@ static int hv_uio_ring_mmap(struct file *filp,
struct kobject *kobj,
channel->ringbuffer_pagecount << PAGE_SHIFT);
}
-static const struct bin_attribute ring_buffer_bin_attr = {
- .attr = {
- .name = "ring",
- .mode = 0600,
- },
- .size = 2 * SZ_2M,
- .mmap = hv_uio_ring_mmap,
-};
-
/* Callback from VMBUS subsystem when new channel created. */
static void
hv_uio_new_channel(struct vmbus_channel *new_sc)
@@ -178,8 +166,7 @@ hv_uio_new_channel(struct vmbus_channel *new_sc)
/* Disable interrupts on sub channel */
new_sc->inbound.ring_buffer->interrupt_mask = 1;
set_channel_read_mode(new_sc, HV_CALL_ISR);
-
- ret = sysfs_create_bin_file(&new_sc->kobj, &ring_buffer_bin_attr);
+ ret = hv_create_ring_sysfs(new_sc, hv_uio_ring_mmap);
if (ret) {
dev_err(device, "sysfs create ring bin file failed; %d\n", ret);
vmbus_close(new_sc);
@@ -350,10 +337,12 @@ hv_uio_probe(struct hv_device *dev,
goto fail_close;
}
- ret = sysfs_create_bin_file(&channel->kobj, &ring_buffer_bin_attr);
- if (ret)
- dev_notice(&dev->device,
- "sysfs create ring bin file failed; %d\n", ret);
+ /*
+ * This internally calls sysfs_update_group, which returns a non-zero
value if it executes
+ * before sysfs_create_group. This is a false alarm from this use case
point of view and
+ * thus, no need to check the return value and print warning.
+ */
+ hv_create_ring_sysfs(channel, hv_uio_ring_mmap);
hv_set_drvdata(dev, pdata);
@@ -375,7 +364,7 @@ hv_uio_remove(struct hv_device *dev)
if (!pdata)
return;
- sysfs_remove_bin_file(&dev->channel->kobj, &ring_buffer_bin_attr);
+ hv_remove_ring_sysfs(dev->channel);
uio_unregister_device(&pdata->info);
hv_uio_cleanup(dev, pdata);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 7f4f8d8bdf43..f93e489eca62 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1058,6 +1058,12 @@ struct vmbus_channel {
/* The max size of a packet on this channel */
u32 max_pkt_size;
+
+ /* function to mmap ring buffer memory to the channel's sysfs ring
attribute */
+ int (*mmap_ring_buffer)(struct vmbus_channel *channel, struct
vm_area_struct *vma);
+
+ /* boolean to control visibility of sysfs for ring buffer */
+ bool ring_sysfs_visible;
};
#define lock_requestor(channel, flags) \
--
Regards,
Naman
Powered by blists - more mailing lists