linux-kernel - NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LRH.2.02.2009191336380.3478@file01.intranet.prod.int.rdu2.redhat.com>
Date:   Mon, 21 Sep 2020 12:20:42 -0400 (EDT)
From:   Mikulas Patocka <mpatocka@...hat.com>
To:     Dan Williams <dan.j.williams@...el.com>
cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vishal Verma <vishal.l.verma@...el.com>,
        Dave Jiang <dave.jiang@...el.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
        Eric Sandeen <esandeen@...hat.com>,
        Dave Chinner <dchinner@...hat.com>,
        "Kani, Toshi" <toshi.kani@....com>,
        "Norton, Scott J" <scott.norton@....com>,
        "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" 
        <rajesh.tadakamadla@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>
Subject: NVFS XFS metadata (was: [PATCH] pmem: export the symbols
 __copy_user_flushcache and __copy_from_user_flushcache)

On Wed, 16 Sep 2020, Mikulas Patocka wrote:

> 
> 
> On Wed, 16 Sep 2020, Dan Williams wrote:
> 
> > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@...hat.com> wrote:
> > >
> > > > My first question about nvfs is how it compares to a daxfs with
> > > > executables and other binaries configured to use page cache with the
> > > > new per-file dax facility?
> > >
> > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > because it doesn't have the overhead of the buffer cache and bios. See
> > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > 
> > ...and that metadata problem is intractable upstream? Christoph poked
> > at bypassing the block layer for xfs metadata operations [1], I just
> > have not had time to carry that further.
> > 
> > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > he's dropped that branch from his xfs.git
> 
> XFS is very big. I wanted to create something small.

And the another difference is that XFS metadata are optimized for disks 
and SSDs.

On disks and SSDs, reading one byte is as costly as reading a full block. 
So we must put as much information to a block as possible. XFS uses 
b+trees for file block mapping and for directories - it is reasonable 
decision because b+trees minimize the number of disk accesses.

On persistent memory, each access has its own cost, so NVFS uses metadata 
structures that minimize the number of cache lines accessed (rather than 
the number of blocks accessed). For block mapping, NVFS uses the classic 
unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
indirect block, we do just three memory accesses and we are done. If we 
used b+trees, the number of accesses would be much larger than 3 (we would 
have to do binary search in the b+tree nodes).

The same for directories - NVFS hashes the file name and uses radix-tree 
to locate a directory page where the directory entry is located. XFS 
b+trees would result in much more accesses than the radix-tree.

Regarding journaling - NVFS doesn't do it because persistent memory is so 
fast that we can just check it in the case of crash. NVFS has a 
multithreaded fsck that can do 3 million inodes per second. XFS does 
journaling (it was reasonable decision for disks where fsck took hours) 
and it will cause overhead for all the filesystem operations.

Mikulas