lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZWsgFViqdJuC7N1d@yilunxu-OptiPlex-7050>
Date:   Sat, 2 Dec 2023 20:16:21 +0800
From:   Xu Yilun <yilun.xu@...ux.intel.com>
To:     Marco Pagani <marpagan@...hat.com>
Cc:     Moritz Fischer <mdf@...nel.org>, Wu Hao <hao.wu@...el.com>,
        Xu Yilun <yilun.xu@...el.com>, Tom Rix <trix@...hat.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        linux-kernel@...r.kernel.org, linux-fpga@...r.kernel.org
Subject: Re: [RFC PATCH v2 1/2] fpga: add a module owner field to
 fpga_manager and fpga_manager_ops

On Thu, Nov 30, 2023 at 11:42:36AM +0100, Marco Pagani wrote:
> 
> 
> On 2023-11-25 10:11, Xu Yilun wrote:
> > On Fri, Nov 24, 2023 at 05:28:06PM +0100, Marco Pagani wrote:
> >> Add a module *owner field to the fpga_manager_ops and fpga_manager
> >> structs to protect the fpga manager against the unloading of the
> >> low-level control module while someone is holding a reference to the
> >> manager device. Low-level control modules should statically set the
> >> owner field of the fpga_manager_ops struct to THIS_MODULE. Then, when
> >> the manager is registered using fpga_mgr_register(), the value is copied
> >> into the owner field of the fpga_manager struct (that contains the
> >> device context). In this way, the manager can later use it in
> >> fpga_mgr_get() to take the low-level module's refcount. To prevent races
> >> while unloading the low-level control module, fpga_mgr_get() and part of
> >> the fpga_mgr_unregister() methods are protected with a mutex.
> >>
> >> Other changes: move put_device() from __fpga_mgr_get() to fpga_mgr_get()
> >> and of_fpga_mgr_get() to improve code clarity.
> >>
> >> Fixes: 654ba4cc0f3e ("fpga manager: ensure lifetime with of_fpga_mgr_get")
> >> Signed-off-by: Marco Pagani <marpagan@...hat.com>
> >> ---
> >>  drivers/fpga/fpga-mgr.c       | 56 +++++++++++++++++++++++++----------
> >>  include/linux/fpga/fpga-mgr.h |  4 +++
> >>  2 files changed, 44 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/drivers/fpga/fpga-mgr.c b/drivers/fpga/fpga-mgr.c
> >> index 06651389c592..608605d59860 100644
> >> --- a/drivers/fpga/fpga-mgr.c
> >> +++ b/drivers/fpga/fpga-mgr.c
> >> @@ -21,6 +21,8 @@
> >>  static DEFINE_IDA(fpga_mgr_ida);
> >>  static const struct class fpga_mgr_class;
> >>  
> >> +static DEFINE_MUTEX(mgr_lock);
> >> +
> >>  struct fpga_mgr_devres {
> >>  	struct fpga_manager *mgr;
> >>  };
> >> @@ -667,17 +669,15 @@ ATTRIBUTE_GROUPS(fpga_mgr);
> >>  static struct fpga_manager *__fpga_mgr_get(struct device *dev)
> >>  {
> >>  	struct fpga_manager *mgr;
> >> +	struct module *owner;
> >>  
> >>  	mgr = to_fpga_manager(dev);
> >> +	owner = mgr->owner;
> >>  
> >> -	if (!try_module_get(dev->parent->driver->owner))
> >> -		goto err_dev;
> >> +	if (owner && !try_module_get(owner))
> > 
> > No need to test owner == NULL, try_module_get() does this.
> 
> You are right. I'll remove it in the next version.
> 
> > 
> >> +		mgr = ERR_PTR(-ENODEV);
> >>  
> >>  	return mgr;
> >> -
> >> -err_dev:
> >> -	put_device(dev);
> >> -	return ERR_PTR(-ENODEV);
> >>  }
> >>  
> >>  static int fpga_mgr_dev_match(struct device *dev, const void *data)
> >> @@ -693,12 +693,22 @@ static int fpga_mgr_dev_match(struct device *dev, const void *data)
> >>   */
> >>  struct fpga_manager *fpga_mgr_get(struct device *dev)
> >>  {
> >> -	struct device *mgr_dev = class_find_device(&fpga_mgr_class, NULL, dev,
> >> -						   fpga_mgr_dev_match);
> >> +	struct fpga_manager *mgr = ERR_PTR(-ENODEV);
> >> +	struct device *mgr_dev;
> >> +
> >> +	mutex_lock(&mgr_lock);
> >> +
> >> +	mgr_dev = class_find_device(&fpga_mgr_class, NULL, dev, fpga_mgr_dev_match);
> >>  	if (!mgr_dev)
> >> -		return ERR_PTR(-ENODEV);
> >> +		goto out;
> >> +
> >> +	mgr = __fpga_mgr_get(mgr_dev);
> >> +	if (IS_ERR(mgr))
> >> +		put_device(mgr_dev);
> >>  
> >> -	return __fpga_mgr_get(mgr_dev);
> >> +out:
> >> +	mutex_unlock(&mgr_lock);
> >> +	return mgr;
> >>  }
> >>  EXPORT_SYMBOL_GPL(fpga_mgr_get);
> >>  
> >> @@ -711,13 +721,22 @@ EXPORT_SYMBOL_GPL(fpga_mgr_get);
> >>   */
> >>  struct fpga_manager *of_fpga_mgr_get(struct device_node *node)
> >>  {
> >> -	struct device *dev;
> >> +	struct fpga_manager *mgr = ERR_PTR(-ENODEV);
> >> +	struct device *mgr_dev;
> >> +
> >> +	mutex_lock(&mgr_lock);
> >> +
> >> +	mgr_dev = class_find_device_by_of_node(&fpga_mgr_class, node);
> >> +	if (!mgr_dev)
> >> +		goto out;
> >>  
> >> -	dev = class_find_device_by_of_node(&fpga_mgr_class, node);
> >> -	if (!dev)
> >> -		return ERR_PTR(-ENODEV);
> >> +	mgr = __fpga_mgr_get(mgr_dev);
> >> +	if (IS_ERR(mgr))
> >> +		put_device(mgr_dev);
> >>  
> >> -	return __fpga_mgr_get(dev);
> >> +out:
> >> +	mutex_unlock(&mgr_lock);
> >> +	return mgr;
> >>  }
> >>  EXPORT_SYMBOL_GPL(of_fpga_mgr_get);
> >>  
> >> @@ -727,7 +746,7 @@ EXPORT_SYMBOL_GPL(of_fpga_mgr_get);
> >>   */
> >>  void fpga_mgr_put(struct fpga_manager *mgr)
> >>  {
> >> -	module_put(mgr->dev.parent->driver->owner);
> >> +	module_put(mgr->owner);
> >>  	put_device(&mgr->dev);
> >>  }
> >>  EXPORT_SYMBOL_GPL(fpga_mgr_put);
> >> @@ -806,6 +825,7 @@ fpga_mgr_register_full(struct device *parent, const struct fpga_manager_info *in
> >>  
> >>  	mgr->name = info->name;
> >>  	mgr->mops = info->mops;
> >> +	mgr->owner = info->mops->owner;
> >>  	mgr->priv = info->priv;
> >>  	mgr->compat_id = info->compat_id;
> >>  
> >> @@ -888,7 +908,11 @@ void fpga_mgr_unregister(struct fpga_manager *mgr)
> >>  	 */
> >>  	fpga_mgr_fpga_remove(mgr);
> >>  
> >> +	mutex_lock(&mgr_lock);
> >> +
> >>  	device_unregister(&mgr->dev);
> >> +
> >> +	mutex_unlock(&mgr_lock);
> > 
> > Why this part should be protected rather than the whole
> > fpga_mgr_unregister()?
> >
> 
> Protecting the fpga_remove() op seems unnecessary to me because it
> does not affect the manager device's lifetime. Moreover, it may hold
> the mutex for a long time since it was intended to interact with the
> hardware to put it in a specific state before removing the driver.
> 
> > I feel the scope of the protection is unclear to me in this patch. What
> > data should be protected from concurrent access by this mutex? From the
> > code seems the racing of mgr dev should be protected but apparently it
> > doesn't have to.
> 
> The mutex is there to ensure the lifetime of the manager device and
> its context (struct fpga_manager) if fpga_mgr_get() happens to run
> concurrently with the removal of the low-level module.
> 
> > 
> > And with this mutex, the get/put/unregister() for one mgr should be
> > exclusive with another mgr, but that also seems not necessary.
> > 
> 
> I decided to use a static mutex because I thought putting the mutex
> in the manager's context would be unsafe since its life would be tied
> to the manager's life. For instance, consider the following sequence:
> 
> - A removes the low-level control module, and delete_module progresses
> up to the point when it calls the module's exit function, which in turn
> calls fpga_mgr_unregister().
> 
> - fpga_mgr_unregister() takes the mutex but gets descheduled before
> completing the unregistering of the manager device.
> 
> - Meanwhile, B wants to get the manager (it is still there) and calls
> fpga_mgr_get(), which tries to take the mutex but gets suspended since
> it is held by A.
> 
> - A resumes and fpga_mgr_unregister() releases the manager device and

The lifecycle of the manager device is not entirely decided by
fpga_mgr_unregister(), this func just puts/decreases the device
refcount.

Remember fpga_mgr_get() gets the device via
class_find_device()->get_device(). I assume if the valid device pointer
could be returned by class_find_device(), it would never be released by
other nice players. So I have no worry about the per manager mutex.

> its context, including the mutex on which B is suspended.
> 
> We could mitigate this specific case using mutex_trylock(). However,
> there will still be other problematic cases, like if fpga_mgr_get()
> gets suspended right before taking the mutex and then delete_module
> proceeds up to when fpga_mgr_unregister() frees the manager device
> and its context.
> 
> > I think the mgr->owner & mgr->ops should be protected from concurrent
> > access of delete_module & fpga_mgr_get/put(), so how about:
> > 
> > struct fpga_manager_ops {
> > 	struct module *owner;
> > 	...
> > };
> > 
> > struct fpga_manager {
> > 	...
> > 	struct mutex mops_lock;
> > 	const struct fpga_manager_ops *mops;
> > 	...
> > };
> > 
> > 
> > static struct fpga_manager *__fpga_mgr_get(struct device *dev)
> > {
> > 	struct fpga_manager *mgr;
> > 
> > 	mgr = to_fpga_manager(dev);
> > 
> > 	mutex_lock(&mgr->mops_lock);
> > 
> > 	if (!mgr->mops || !try_module_get(mgr->mops->owner))
> > 		mgr = ERR_PTR(-ENODEV);
> > 
> > 	mutex_unlock(&mgr->mops_lock);
> > 		
> > 	return mgr;
> > }
> > 
> > void fpga_mgr_unregister(struct fpga_manager *mgr)
> > {
> > 	fpga_mgr_fpga_remove(mgr);	
> > 
> > 	mutex_lock(&mgr->ops_lock);
> > 	mgr->mops = NULL;
> > 	mutex_unlock(&mgr->ops_lock);
> > 
> > 	device_unregister(&mgr->dev);	
> > }
> > 
> > Not actually tested.
> > 
> 
> I think protecting the only the ops is not enough for the same reasons.
> If fpga_mgr_get() gets suspended right after class_find_device(),and
> meanwhile the low-level module is removed, it resumes with a reference
> to a manager device that no longer exists.
> 
> In a certain sense, however, using a mutex seems like a mitigation
> that does not solve the problem at its root. I honestly still think
> that taking the module's refcount right when registering the manager
> is the only way that is both safe and robust against code changes.

I would nak either. As mentioned above, that makes rmmod vendor module
impossible. Introducing another user interface to release the module's
refcount is also a bad idea. Who decides to take the ref, who releases
it. A user has no knowledge of what is happening inside and should not
enforce.

> However, my proposal was rejected.
> 
> So, if you prefer, I can drop the mutex entirely in the next version,
> and we leave the responsibility of keeping all kernel pieces to the

No, please try to fix it. Could you please reconsider my proposal?

Thanks,
Yilun

> user. It will still be an improvement over taking the low-level
> module's refcount through the parent device's driver pointer.
> 
> Thanks,
> Marco
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ