linux-kernel - Re: [PATCH] uio: Replace mutex info_lock with percpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YoNPlgoUIzduNXih@kroah.com>
Date:   Tue, 17 May 2022 09:32:38 +0200
From:   Greg KH <gregkh@...uxfoundation.org>
To:     Guixin Liu <kanie@...ux.alibaba.com>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: [PATCH] uio: Replace mutex info_lock with percpu_ref

On Tue, May 10, 2022 at 01:50:31PM +0800, Guixin Liu wrote:
> If the underlying driver works in parallel, the mutex info_lock in uio 
> will force driver to work sequentially, so that become performance 
> bottleneck. Lets replace it with percpu_ref for better performance. 
> 
> Use tcm_loop and tcmu(backstore is file, and I did some work to make tcmu
> work in parallel at uio_write() path) to evaluate performance,
> fio job: fio -filename=/dev/sdb  -direct=1 -size=2G -name=1 -thread
> -runtime=60 -time_based  -rw=randread -numjobs=16 -iodepth=16 -bs=128k
> 
> Without this patch:
> 	READ: bw=2828MiB/s (2965MB/s), 176MiB/s-177MiB/s (185MB/s-186MB/s), 
> io=166GiB (178GB), run=60000-60001msec
> 
> With this patch:
> 	READ: bw=3382MiB/s (3546MB/s), 211MiB/s-212MiB/s (221MB/s-222MB/s), 
> io=198GiB (213GB), run=60001-60001msec
> 
> Reviewed-by: Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>
> Signed-off-by: Guixin Liu <kanie@...ux.alibaba.com>

Why is UIO being used for a block device?  Why not use a real block
driver instead that can properly handle the locking issues involved
here?



> ---
>  drivers/uio/uio.c          | 95 ++++++++++++++++++++++++++++++++++------------
>  include/linux/uio_driver.h |  5 ++-
>  2 files changed, 75 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
> index 43afbb7..72c16ba 100644
> --- a/drivers/uio/uio.c
> +++ b/drivers/uio/uio.c
> @@ -24,6 +24,8 @@
>  #include <linux/kobject.h>
>  #include <linux/cdev.h>
>  #include <linux/uio_driver.h>
> +#include <linux/completion.h>
> +#include <linux/percpu-refcount.h>
>  
>  #define UIO_MAX_DEVICES		(1U << MINORBITS)
>  
> @@ -218,7 +220,9 @@ static ssize_t name_show(struct device *dev,
>  	struct uio_device *idev = dev_get_drvdata(dev);
>  	int ret;
>  
> -	mutex_lock(&idev->info_lock);
> +	if (!percpu_ref_tryget_live(&idev->info_ref))
> +		return -EINVAL;
> +

You are now just putting the contention to a per-cpu lock, so any
single-cpu load will have the same issue, right?  And your example above
is a single-cpu load, so how is this any faster?  Is the mutex going
across all cpus to sync such a load that moving this to a percpu thing
that much better?

And as you have now split this into one-lock-per-cpu instead of
one-lock-per-device, you just broke the situation where multiple threads
are accessing the same device at the same time, right?

You have also changed the functionality of the driver to force userspace
to handle when the lock can not be taken as previously it would always
work and just delay until it did happen.  What workflow does that now
affect that always assumed that these code paths would succeed?

Also the kernel test bot found problems with the patch :(

thanks,

greg k-h