lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <MWHPR21MB159321330ED9242231D30F93D7159@MWHPR21MB1593.namprd21.prod.outlook.com>
Date:   Mon, 12 Jul 2021 03:58:32 +0000
From:   Michael Kelley <mikelley@...rosoft.com>
To:     Long Li <longli@...rosoft.com>,
        "longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>
CC:     Jonathan Corbet <corbet@....net>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Wei Liu <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Hans de Goede <hdegoede@...hat.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        Maximilian Luz <luzmaximilian@...il.com>,
        Mike Rapoport <rppt@...nel.org>,
        Ben Widawsky <ben.widawsky@...el.com>,
        Jiri Slaby <jirislaby@...nel.org>,
        Andra Paraschiv <andraprs@...zon.com>,
        Siddharth Gupta <sidgup@...eaurora.org>,
        Hannes Reinecke <hare@...e.de>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>
Subject: RE: [Patch v2 2/3] Drivers: hv: add Azure Blob driver

From: Long Li <longli@...rosoft.com> Sent: Friday, July 9, 2021 12:44 PM
> >
> > From: Long Li <longli@...rosoft.com> Sent: Friday, July 2, 2021 4:59 PM
> > > > PM
> > > >
> > > > [snip]
> > > >
> > > > > > > +static void az_blob_remove_device(struct az_blob_device *dev) {
> > > > > > > +	wait_event(dev->file_wait, list_empty(&dev->file_list));
> > > > > > > +	misc_deregister(&az_blob_misc_device);
> > > > > > > +#ifdef CONFIG_DEBUG_FS
> > > > > > > +	debugfs_remove_recursive(az_blob_debugfs_root);
> > > > > > > +#endif
> > > > > > > +	/* At this point, we won't get any requests from user-mode
> > > > > > > +*/ }
> > > > > > > +
> > > > > > > +static int az_blob_create_device(struct az_blob_device *dev) {
> > > > > > > +	int rc;
> > > > > > > +	struct dentry *d;
> > > > > > > +
> > > > > > > +	rc = misc_register(&az_blob_misc_device);
> > > > > > > +	if (rc) {
> > > > > > > +		az_blob_err("misc_register failed rc %d\n", rc);
> > > > > > > +		return rc;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +#ifdef CONFIG_DEBUG_FS
> > > > > > > +	az_blob_debugfs_root = debugfs_create_dir("az_blob",
> > NULL);
> > > > > > > +	if (!IS_ERR_OR_NULL(az_blob_debugfs_root)) {
> > > > > > > +		d = debugfs_create_file("pending_requests", 0400,
> > > > > > > +			az_blob_debugfs_root, NULL,
> > > > > > > +			&az_blob_debugfs_fops);
> > > > > > > +		if (IS_ERR_OR_NULL(d)) {
> > > > > > > +			az_blob_warn("failed to create debugfs
> > file\n");
> > > > > > > +
> > 	debugfs_remove_recursive(az_blob_debugfs_root);
> > > > > > > +			az_blob_debugfs_root = NULL;
> > > > > > > +		}
> > > > > > > +	} else
> > > > > > > +		az_blob_warn("failed to create debugfs root\n");
> > #endif
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int az_blob_connect_to_vsp(struct hv_device *device,
> > > > > > > +u32
> > > > > > > +ring_size) {
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	spin_lock_init(&az_blob_dev.file_lock);
> > > > > >
> > > > > > I'd argue that the spin lock should not be re-initialized here.
> > > > > > Here's the sequence where things go wrong:
> > > > > >
> > > > > > 1) The driver is unbound, so az_blob_remove() is called.
> > > > > > 2) az_blob_remove() sets the "removing" flag to true, and calls
> > > > > > az_blob_remove_device().
> > > > > > 3) az_blob_remove_device() waits for the file_list to become empty.
> > > > > > 4) After the file_list becomes empty, but before
> > > > > > misc_deregister() is called, a separate thread opens the device again.
> > > > > > 5) In the separate thread, az_blob_fop_open() obtains the
> > > > > > file_lock spin
> > > > lock.
> > > > > > 6) Before az_blob_fop_open() releases the spin lock,
> > > > > > az_blob_remove_device() completes, along with az_blob_remove().
> > > > > > 7) Then the device gets rebound, and az_blob_connect_to_vsp()
> > > > > > gets called, all while az_blob_fop_open() still holds the spin
> > > > > > lock.  So the spin lock get re- initialized while it is held.
> > > > > >
> > > > > > This is admittedly a far-fetched scenario, but stranger things
> > > > > > have happened. :-)  The issue is that you are counting on the
> > > > > > az_blob_dev structure to persist and have a valid file_lock,
> > > > > > even while the device is unbound.  So any initialization should
> > > > > > only happen in
> > > > az_blob_drv_init().
> > > > >
> > > > > I'm not sure if az_blob_probe() and az_blob_remove() can be called
> > > > > at the same time, as az_blob_remove_vmbus() is called the last in
> > > > az_blob_remove().
> > > > > Is it possible for vmbus asking the driver to probe a new channel
> > > > > before the old channel is closed? I expect the vmbus provide
> > > > > guarantee that those calls are made in sequence.
> > > >
> > > > In my scenario above, az_blob_remove_vmbus() and az_blob_remove()
> > > > run to completion in Step #6, all while some other thread is still
> > > > in the middle of an
> > > > open() call and holding the file_lock spin lock.  Then in Step #7
> > > > az_blob_probe() runs.  So az_blob_remove() and az_blob_probe()
> > > > execute sequentially, not at the same time.
> > > >
> > > > Michael
> > >
> > > I think it's a valid scenario.  The return value of
> > > devtmpfs_delete_node() is not checked in misc_deregister(). It decreases
> > the refcount on inodes but it's not guaranteed that someone else is still using
> > it (in the middle of opening a file).
> > >
> > > However, this works fine for "rmmod" that causes device to be removed.
> > > Before file is opened the refcount on the module is increased so it can't be
> > removed when file is being opened. The scenario you described can't
> > happen.
> > >
> > > But during VMBUS rescind, it can happen. It's possible that the driver
> > > is using the spinlock that has been re-initialized, when the next VMBUS
> > offer on the same channel comes before all the attempting open file calls
> > exit.
> >
> > I was thinking about the rescind scenario.  vmbus_onoffer_rescind() will run
> > on the global workqueue.  If it eventually calls az_blob_remove() and then
> > az_blob_remove_device(), it will wait until the file_list is empty, which
> > essentially means waiting until user space processes decide to close the
> > instances they have open.  This seems like a problem that could block the
> > global workqueue for a long time and
> > thereby hang the kernel.   Is my reasoning valid?  If so, I haven't
> > thought about what the solution might be.  It seems like we do need to wait
> > until any in-progress requests to Hyper-V are complete because Hyper-V has
> > references to guest physical memory.  But waiting for all open instances to
> > be closed seems to be problematic.
> 
> My tests showed that misc_deregister() caused all opened files to be released if there are no pending I/O waiting in the
> driver.
> 
> If there are pending I/O, we must wait as the VSP owns the memory of the I/O. The correct VSP behavior is to return all the
> pending I/O along with rescind. This is the same to what storvsc does for rescind.

Yes, we have to wait since the VSP owns the memory.  But you make
a good point that the VSP behavior should be to return all the pending
I/Os at roughly the same time as the rescind.

> 
> It looks to me waiting for opened files after the call to misc_deregister(), but before removing the vmbus channel is a safe
> approach.

To me, this would be a great solution.  And it closes the original timing
window I pointed out where some other thread could open the device
after having waited for the file_list to be empty, but before
misc_deregister() is called.  With the order swapped, once the file_list
is empty, there's no way for the device to be opened again.  So it solves
the problem with the spin_lock initialization.

> 
> If the VSP is behaving correctly, the rescind process should not block for too long. If we want to deal with a buggy VSP that
> takes forever to release a resource, we want to create a work queue for rescind handling.

I don't think we need to deal with a buggy VSP holding memory for
an excessively long time.

Michael

> > >
> > > This is a very rare. I agree things happen that we should make sure the
> > driver can handle this. I'll update the driver.
> > >
> > > Long
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > +	INIT_LIST_HEAD(&az_blob_dev.file_list);
> > > > > > > +	init_waitqueue_head(&az_blob_dev.file_wait);
> > > > > > > +
> > > > > > > +	az_blob_dev.removing = false;
> > > > > > > +
> > > > > > > +	az_blob_dev.device = device;
> > > > > > > +	device->channel->rqstor_size = device_queue_depth;
> > > > > > > +
> > > > > > > +	ret = vmbus_open(device->channel, ring_size, ring_size,
> > NULL, 0,
> > > > > > > +			az_blob_on_channel_callback, device-
> > >channel);
> > > > > > > +
> > > > > > > +	if (ret) {
> > > > > > > +		az_blob_err("failed to connect to VSP ret %d\n", ret);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	hv_set_drvdata(device, &az_blob_dev);
> > > > > > > +
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void az_blob_remove_vmbus(struct hv_device *device) {
> > > > > > > +	/* At this point, no VSC/VSP traffic is possible over vmbus */
> > > > > > > +	hv_set_drvdata(device, NULL);
> > > > > > > +	vmbus_close(device->channel); }
> > > > > > > +
> > > > > > > +static int az_blob_probe(struct hv_device *device,
> > > > > > > +			const struct hv_vmbus_device_id *dev_id) {
> > > > > > > +	int rc;
> > > > > > > +
> > > > > > > +	az_blob_dbg("probing device\n");
> > > > > > > +
> > > > > > > +	rc = az_blob_connect_to_vsp(device,
> > az_blob_ringbuffer_size);
> > > > > > > +	if (rc) {
> > > > > > > +		az_blob_err("error connecting to VSP rc %d\n", rc);
> > > > > > > +		return rc;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	// create user-mode client library facing device
> > > > > > > +	rc = az_blob_create_device(&az_blob_dev);
> > > > > > > +	if (rc) {
> > > > > > > +		az_blob_remove_vmbus(device);
> > > > > > > +		return rc;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	az_blob_dbg("successfully probed device\n");
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int az_blob_remove(struct hv_device *dev) {
> > > > > > > +	struct az_blob_device *device = hv_get_drvdata(dev);
> > > > > > > +	unsigned long flags;
> > > > > > > +
> > > > > > > +	spin_lock_irqsave(&device->file_lock, flags);
> > > > > > > +	device->removing = true;
> > > > > > > +	spin_unlock_irqrestore(&device->file_lock, flags);
> > > > > > > +
> > > > > > > +	az_blob_remove_device(device);
> > > > > > > +	az_blob_remove_vmbus(dev);
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static struct hv_driver az_blob_drv = {
> > > > > > > +	.name = KBUILD_MODNAME,
> > > > > > > +	.id_table = id_table,
> > > > > > > +	.probe = az_blob_probe,
> > > > > > > +	.remove = az_blob_remove,
> > > > > > > +	.driver = {
> > > > > > > +		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
> > > > > > > +	},
> > > > > > > +};
> > > > > > > +
> > > > > > > +static int __init az_blob_drv_init(void) {
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	ret = vmbus_driver_register(&az_blob_drv);
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void __exit az_blob_drv_exit(void) {
> > > > > > > +	vmbus_driver_unregister(&az_blob_drv);
> > > > > > > +}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ