linux-kernel - Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LRH.2.02.2009240853200.3485@file01.intranet.prod.int.rdu2.redhat.com>
Date:   Thu, 24 Sep 2020 11:00:20 -0400 (EDT)
From:   Mikulas Patocka <mpatocka@...hat.com>
To:     Matthew Wilcox <willy@...radead.org>
cc:     Dave Chinner <david@...morbit.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vishal Verma <vishal.l.verma@...el.com>,
        Dave Jiang <dave.jiang@...el.com>,
        Ira Weiny <ira.weiny@...el.com>, Jan Kara <jack@...e.cz>,
        Eric Sandeen <esandeen@...hat.com>,
        Dave Chinner <dchinner@...hat.com>,
        "Kani, Toshi" <toshi.kani@....com>,
        "Norton, Scott J" <scott.norton@....com>,
        "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" 
        <rajesh.tadakamadla@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>
Subject: Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols
 __copy_user_flushcache and __copy_from_user_flushcache)



On Tue, 22 Sep 2020, Matthew Wilcox wrote:

> > > The NVFS indirect block tree has a fan-out of 16,
> > 
> > No. The top level in the inode contains 16 blocks (11 direct and 5 
> > indirect). And each indirect block can have 512 pointers (4096/8). You can 
> > format the device with larger block size and this increases the fanout 
> > (the NVFS block size must be greater or equal than the system page size).
> > 
> > 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> > map 256 TiB and 5 levels can map 128 PiB.
> 
> But compare to an unfragmented file ... you can map the entire thing with
> a single entry.  Even if you have to use a leaf node, you can get four
> extents in a single cacheline (and that's a fairly naive leaf node layout;
> I don't know exactly what XFS uses)

But the benchmarks show that it is comparable to extent-based filesystems.

> > > Rename is another operation that has specific "operation has atomic
> > > behaviour" expectations. I haven't looked at how you've
> > > implementated that yet, but I suspect it also is extremely difficult
> > > to implement in an atomic manner using direct pmem updates to the
> > > directory structures.
> > 
> > There is a small window when renamed inode is neither in source nor in 
> > target directory. Fsck will reclaim such inode and add it to lost+found - 
> > just like on EXT2.
> 
> ... ouch.  If you have to choose, it'd be better to link it to the second
> directory then unlink it from the first one.  Then your fsck can detect
> it has the wrong count and fix up the count (ie link it into both
> directories rather than neither).

I admit that this is lame and I'll fix it. Rename is not so 
performance-critical, so I can add a small journal for this.

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it. But then, I'll have something that has complexity of EXT4 and 
> > performance of EXT4. So that there will no longer be any reason why to use 
> > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> > attract some users who want good performance and who don't care about GID 
> > and UID being updated atomically, etc.
> 
> Well, what's your intent with nvfs?  Do you already have customers in mind
> who want to use this in production, or is this somewhere to play with and
> develop concepts that might make it into one of the longer-established
> filesystems?

I develop it just because I thought it may be interesting. So far, it 
doesn't have any serious users (the physical format is still changing). I 
hope that it could be useable as a general purpose root filesystem when 
Optane DIMMs become common.

Mikulas