[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.02.2009230445030.1800@file01.intranet.prod.int.rdu2.redhat.com>
Date: Wed, 23 Sep 2020 13:19:42 -0400 (EDT)
From: Mikulas Patocka <mpatocka@...hat.com>
To: Dave Chinner <david@...morbit.com>
cc: Dan Williams <dan.j.williams@...el.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Vishal Verma <vishal.l.verma@...el.com>,
Dave Jiang <dave.jiang@...el.com>,
Ira Weiny <ira.weiny@...el.com>,
Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
Eric Sandeen <esandeen@...hat.com>,
Dave Chinner <dchinner@...hat.com>,
"Kani, Toshi" <toshi.kani@....com>,
"Norton, Scott J" <scott.norton@....com>,
"Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)"
<rajesh.tadakamadla@....com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
linux-nvdimm <linux-nvdimm@...ts.01.org>
Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols
__copy_user_flushcache and __copy_from_user_flushcache)
On Wed, 23 Sep 2020, Dave Chinner wrote:
> > > dir-test /mnt/test/linux-2.6 63000 1048576
> > > nvfs 6.6s
> > > ext4 dax 8.4s
> > > xfs dax 12.2s
> > >
> > >
> > > dir-test /mnt/test/linux-2.6 63000 1048576 link
> > > nvfs 4.7s
> > > ext4 dax 5.6s
> > > xfs dax 7.8s
> > >
> > > dir-test /mnt/test/linux-2.6 63000 1048576 dir
> > > nvfs 8.2s
> > > ext4 dax 15.1s
> > > xfs dax 11.8s
> > >
> > > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's not a
> > > huge difference - it's not orders of magnitude faster.
> >
> > If I increase the size of the test directory, NVFS is order of magnitude
> > faster:
> >
> > time dir-test /mnt/test/ 2000000 2000000
> > NVFS: 0m29,395s
> > XFS: 1m59,523s
> > EXT4: 1m14,176s
>
> What happened to NVFS there? The runtime went up by a factor of 5,
> even though the number of ops performed only doubled.
This test is from a different machine (K10 Opteron) than the above test
(Skylake Xeon). I borrowed the Xeon for a short time and I no longer have
access to it.
> > time dir-test /mnt/test/ 8000000 8000000
> > NVFS: 2m13,507s
> > XFS: 14m31,261s
> > EXT4: reports "file 1976882 can't be created: No space left on device",
> > (although there are free blocks and inodes)
> > Is it a bug or expected behavior?
>
> Exponential increase in runtime for a workload like this indicates
> the XFS journal is too small to run large scale operations. I'm
> guessing you're just testing on a small device?
In this test, the pmem device had 64GiB.
I've created 1TiB ramdisk, formatted it with XFS and ran dir-test 8000000
on it, however it wasn't much better - it took 14m8,824s.
> In which case, you'd get a 16MB log for XFS, which is tiny and most
> definitely will limit performance of any large scale metadta
> operation. Performance should improve significantly for large scale
> operations with a much larger log, and that should bring the XFS
> runtimes down significantly.
Is there some mkfs.xfs option that can increase log size?
> > If you think that the lack of journaling is show-stopper, I can implement
> > it.
>
> I did not say that. My comments are about the requirement for
> atomicity of object changes, not journalling. Journalling is an
> -implementation that can provide change atomicity-, it is not a
> design constraint for metadata modification algorithms.
>
> Really, you can chose how to do object update however you want. What
> I want to review is the design documentation and a correctness proof
> for whatever mechanism you choose to use. Without that information,
> we have absolutely no chance of reviewing the filesystem
> implementation for correctness. We don't need a proof for something
> that uses journalling (because we all know how that works), but for
> something that uses soft updates we most definitely need the proof
> of correctness for the update algorithm before we can determine if
> the implementation is good...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@...morbit.com
I am thinking about this: I can implement lightweight journaling that will
journal just a few writes - I'll allocate some small per-cpu intent log
for that.
For example, in nvfs_rename, we call nvfs_delete_de and nvfs_finish_add -
these functions are very simple, both of them write just one word - so we
can add these two words to the intent log. The same for setattr requesting
simultaneous uid/gid/mode change - they are small, so they'll fit into the
intent log well.
Regarding verifiability, I can do this - the writes to pmem are wrapped in
a macro nv_store. So, I can modify this macro so that it logs all
modifications. Then I take the log, cut it at random time, reorder the
entries (to simulate reordering in the CPU write-combining buffers),
replay it, run nvfsck on it and mount it. This way, we can verify that no
matter where the crash happened, either an old file or a new file is
present in a directory.
Do you agree with that?
Mikulas
Powered by blists - more mailing lists