linux-kernel - Re: fuse scalability part 1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 23 Sep 2015 18:13:23 -0700
From:	Ashish Samant <ashish.samant@...cle.com>
To:	Miklos Szeredi <miklos@...redi.hu>,
	fuse-devel@...ts.sourceforge.net
CC:	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Srinivas Eeda <srinivas.eeda@...cle.com>
Subject: Re: fuse scalability part 1


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:
> This part splits out an "input queue" and a "processing queue" from the
> monolithic "fuse connection", each of those having their own spinlock.
>
> The end of the patchset adds the ability to "clone" a fuse connection.  This
> means, that instead of having to read/write requests/answers on a single fuse
> device fd, the fuse daemon can have multiple distinct file descriptors open.
> Each of those can be used to receive requests and send answers, currently the
> only constraint is that a request must be answered on the same fd as it was read
> from.
>
> This can be extended further to allow binding a device clone to a specific CPU
> or NUMA node.
>
> Patchset is available here:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
>
> Libfuse patches adding support for "clone_fd" option:
>
>    git://git.code.sf.net/p/fuse/fuse clone_fd
>
> Thanks,
> Miklos
>
>
We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). Sorry for the delay in 
getting these done. We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.

1) Writes to single mount

dd processes                throughput(without patchset) throughput(with 
patchset)
in parallel

4                                    633 
Mb/s                                               606 Mb/s
8                                   583.2 
Mb/s                                             561.6 Mb/s
16                                 436 
Mb/s                                                640.6 Mb/s
32                                 500.5 
Mb/s                                             718.1 Mb/s
64                                 440.7 Mb/s                            
                  1276.8 Mb/s
128                               526.2 
Mb/s                                             2343.4 Mb/s

2) Reading from single mount

dd processes                 throughput(without patchset) 
throughput(with patchset)
in parallel

4                                    1171 
Mb/s                                              1059 Mb/s
8                                    1626 
Mb/s                                              677 Mb/s
16                                  1014 
Mb/s                                              2240.6 Mb/s
32                                  807.6 
Mb/s                                             2512.9 Mb/s
64                                  985.8 
Mb/s                                             2870.3 Mb/s
128                                1355 
Mb/s                                              2996.5 Mb/s



2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset)


dd processes                  Time/req(without patchset) Time/req(with 
patchset)
in parallel

4                                     0.025 ms                     
0.00685 ms
8                                     0.174 ms                      
0.0071 ms
16                                   0.9825 
ms                                        0.0115 ms
32                                   2.4965 ms                           
              0.0315 ms
64                                   4.8335 ms                  0.071 ms
128                                 5.972 ms                         
0.1812 ms

In conclusion, splitting of fc->lock into multiple locks and splitting 
the request queues definitely helps performance.

Thanks,
Ashish
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/