linux-kernel - Re: [PATCH 00/19] fs: rework and optimize i

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171214151411.GC9205@fieldses.org>
Date:   Thu, 14 Dec 2017 10:14:11 -0500
From:   "J. Bruce Fields" <bfields@...ldses.org>
To:     Jeff Layton <jlayton@...nel.org>
Cc:     Dave Chinner <david@...morbit.com>, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, hch@....de, neilb@...e.de,
        amir73il@...il.com, jack@...e.de, viro@...iv.linux.org.uk
Subject: Re: [PATCH 00/19] fs: rework and optimize i_version handling in
 filesystems

On Thu, Dec 14, 2017 at 09:14:47AM -0500, Jeff Layton wrote:
> On Wed, 2017-12-13 at 19:02 -0500, Jeff Layton wrote:
> > On Thu, 2017-12-14 at 10:03 +1100, Dave Chinner wrote:
> > > On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> > > > On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > > > > This is great, thanks.
> > > > > 
> > > > > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > > > > With this, we reduce inode metadata updates across all 3 filesystems
> > > > > > down to roughly the frequency of the timestamp granularity, particularly
> > > > > > when it's not being queried (the vastly common case).
> > > > > > 
> > > > > > The pessimal workload here is 1 byte writes, and it helps that
> > > > > > significantly. Of course, that's not what we'd consider a real-world
> > > > > > workload.
> > > > > > 
> > > > > > A tiobench-example.fio workload also shows some modest performance
> > > > > > gains, and I've gotten mails from the kernel test robot that show some
> > > > > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > > > > the vm-scalability testsuite to be specific), with an earlier version of
> > > > > > this set.
> > > > > > 
> > > > > > With larger writes, the gains with this patchset mostly vaporize,
> > > > > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > > > > 
> > > > > > I'm happy to run other workloads if anyone can suggest them.
> > > > > > 
> > > > > > At this point, the patchset works and does what it's expected to do in
> > > > > > my own testing. It seems like it's at least a modest performance win
> > > > > > across all 3 major disk-based filesystems. It may also encourage others
> > > > > > to implement i_version as well since it reduces the cost.
> > > > > 
> > > > > Do you have an idea what the remaining cost is?
> > > > > 
> > > > > Especially in the ext4 case, are you still able to measure any
> > > > > difference in performance between the cases where i_version is turned on
> > > > > and off, after these patches?
> > > > 
> > > > Attached is a fio jobfile + the output from 3 different runs using it
> > > > with ext4. This one is using 4k writes. There was no querying of
> > > > i_version during the runs.  I've done several runs with each and these
> > > > are pretty representative of the results:
> > > > 
> > > > old = 4.15-rc3, i_version enabled
> > > > ivers = 4.15-rc3 + these patches, i_version enabled
> > > > noivers = 4.15-rc3 + these patches, i_version disabled
> > > > 
> > > > To snip out the run status lines:
> > > > 
> > > > old:
> > > > WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> > > > 
> > > > ivers:
> > > > WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> > > > 
> > > > noivers:
> > > > WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> > > > 
> > > > So, I see some performance degradation with -o iversion compared to not
> > > > having it enabled (maybe due to the extra atomic fetches?), but this set
> > > > erases most of the difference.
> > > 
> > > So what is the performance difference when something is actively
> > > querying the i_version counter as fast as it can (e.g. file being
> > > constantly stat()d via NFS whilst being modified)? How does the
> > > performance compare to the old code in that case?
> > > 
> > 
> > I haven't benchmarked that with the latest set, but I did with the set
> > that I posted around a year ago. Basically I just ran a similar test to
> > this, and had another shell open doing statx(..., STATX_VERSION, ...);
> > the thing in a tight loop.
> > 
> > I did see some performance hit vs. the case where no one is viewing it,
> > but it was still significantly faster than the unpatched version that
> > was incrementing the counter every time.
> > 
> > That was on a different test rig, and the patchset has some small
> > differences now. I'll see if I can get some hard numbers with such a
> > testcase soon.
> 
> I reran the test on xfs, with another shell running test-statx to query
> the i_version value in a tight loop:
> 
>     WRITE: bw=110MiB/s (115MB/s), 13.6MiB/s-14.0MiB/s (14.2MB/s-14.7MB/s), io=64.5GiB (69.3GB), run=600001-600001msec
> 
> ...contrast that with the run where I was not doing any queries:
> 
>    WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec
> 
> ...vs the unpatched kernel:
> 
>    WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec
> 
> There is some clear peformance impact when you are running frequent
> queries of the i_version.
> 
> My gut feeling is that you could probably make the new code perform
> worse than the old if you were to _really_ hammer the inode with queries
> for the i_version (probably from many threads in parallel) while doing a
> lot of small writes to it.
> 
> That'd be a pretty unusual workload though.

It may be pretty common for NFS itself: if I'm understanding the client
code right (mainly nfs4_write_need_cache_consistency()), our client will
request the change attribute in every WRITE that isn't a pNFS write, an
O_DIRECT write, or associated with a delegation.

The goal of this series isn't to improve NFS performance, it's to save
non-NFS users from paying a performance penalty for something that NFS
requires for correctness.  Probably this series doesn't make much
difference in the NFS write case, and that's fine.  Still, might be
worth confirming that a workload with lots of small NFS writes is mostly
unaffected.

--b.