linux-ext4 - Re: e2fsprogs concurrency questions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <24A05FA3-F618-4469-BC0A-E19A425A0E36@dilger.ca>
Date:   Sun, 19 Sep 2021 04:39:01 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Eric Blake <eblake@...hat.com>, linux-ext4@...r.kernel.org,
        libguestfs@...hat.com
Subject: Re: e2fsprogs concurrency questions

What I don't understand here is why not just use a network filesystem that is explicitly designed for this task (eg. NFS or Ganesha on to of ext4)?

Cheers, Andreas

> On Sep 18, 2021, at 21:49, Theodore Ts'o <tytso@....edu> wrote:
> 
> On Fri, Sep 17, 2021 at 04:06:55PM -0500, Eric Blake wrote:
>> TL;DR summary: is there documented best practices for parallel access
>> to the same inode of an ext2 filesystem from multiple threads?
>> 
>> First, a meta-question: is there a publicly archived mailing list for
>> questions on e2fsprogs?  The README merely mentions Ted's email
>> address, and http://e2fsprogs.sourceforge.net/ is silent on contact
>> information, although with some googling, I found at least
>> https://patchwork.ozlabs.org/project/linux-ext4/patch/20201205045856.895342-6-tytso@mit.edu/
>> which suggests linux-ext4@...r.kernel.org as a worthwhile list to
>> mention on the web page.
> 
> Yes, discussions and patches relating to e2fsprogs take place on
> linux-ext4@...r.kernel.org.  (Just as xfsprogs patches and discussions
> are sent to linux-xfs@...r.kernel.org.)
> 
>> Now, on to my real reason for writing.  The nbdkit project is using
>> the ext2fs library to provide an ext2/3/4 filter on top of any data
>> being served over NBD (Network Block Device protocol) in userspace:
>> https://libguestfs.org/nbdkit-ext2-filter.1.html
>> 
>> Searching for the word 'thread' or 'concurrent' in libext2fs.info came
>> up with no hits, so I'm going off of minimal documentation, and mostly
>> what I can ascertain from existing examples (of which I'm not seeing
>> very many).
> 
> Historically, libext2fs and e2fsprogs had no pthreads or concurrency
> access at all.  This is because e2fsprogs predates Linux having
> pthreads support at all.
> 
> This is _starting_ to change, but more on that in a little.
> 
>> Right now, the nbdkit filter forces some rather strict serialization
>> in order to be conservatively safe: for every client that wants to
>> connect, the nbdkit filter calls ext2fs_open(), then eventually
>> ext2fs_file_open2(), then exposes the contents of that one extracted
>> file over NBD one operation at a time, then closes everything back
>> down before accepting a second client.  But we'd LOVE to add some
>> parallelization; the NBD protocol allows multiple clients, as well as
>> out-of-order processing of requests from a single client.
>> 
>> Right away, I already know that calling ext2fs_open() more than once
>> on the same file system is a recipe for disaster (it is equivalent to
>> mounting the same block device at once through more than one bare
>> metal OS, and won't work).  So I've got a proposal for how to rework
>> the nbdkit code to open the file system exactly once and share that
>> handle among multiple NBD clients:
>> https://listman.redhat.com/archives/libguestfs/2021-May/msg00028.html
> 
> So you are apparently calling ext2fs_open() before forking, and then
> you want to use the ext2fs handle from separate processes.  Is that
> correct?
> 
> That's not going to work if you are going to try to modify the file
> system from different processes simultaneously.  That's because the
> libext2fs using a writeback cache.  After the fork, each process has
> its own copy of the wrteiback cache.
> 
> If you are using threads, older versions of libext2fs don't do any
> locking before modifying data structures internal to the ext2_fs file
> handle.  So if two threads simultaneously try to use the "ext2_fs fs"
> handle, they might try to access the block allocation bitmap (for
> example) at the same time, without locking, and so bad things will
> happen.
> 
> You can do your own locking to make sure only one thread is trying to
> use the fs handle at a time, at which point you should be fine.  So
> you can have multiple clients accessing the file system without having
> to open the file system, open a file, and then close the file and
> close the file system before accepting the next client.  But only one
> client can be using the ext2_fs handle at a time, and if you want to
> share any libext2fs data structure across multiple threads,
> appropriate read/write locking would be needed.
> 
>> Is it okay to have two concurrent handles open to the same inode, or
>> do I need to implement a hash map on my end so that two NBD clients
>> requesting access to the same file within the ext2 filesystem share a
>> single inode?  If concurrent handles are supported, what mechanism can
>> I use to ensure that a flush performed on one handle will be visible
>> for reading from the other handle, as ext2fs_file_flush does not seem
>> to be strong enough?
> 
> You could have two threads sharing the same file handle, with locking
> so that only one thread is using a file handle at a time.  Also, since
> we don't have an analogue for pread(2) and pwrite(2), each thread
> would have to assume that the fseek position may have changed by some
> other thread, so after it grabbed the file system lock, and then the
> per-file lock, it would need to call ext2fs_file_llseek() to make sure
> file's position is at a known location before calling
> ext2fs_file_read() or ext2fs_file_write().
> 
> (The reason why ext2fs_file_flush() is not strong enough is because
> that will force writeback, but it doesn't invaludate any cached
> information about the file's inode or extent tree structure.  So if
> the file inode or extent tree is getting modified by some other thread
> out from under it, you're going to have a bad time.)
> 
>> Finally, I see with
>> https://patchwork.ozlabs.org/project/linux-ext4/patch/20201205045856.895342-6-tytso@mit.edu/
>> that you recently added EXT2_FLAG_THREADS, as well as
>> CHANNEL_FLAGS_THREADS.  I think it should be fairly straightforward to
>> tweak my nbdkit custom IO manager to advertise CHANNEL_FLAGS_THREADS
>> (as the NBD protocol really DOES support parallel outstanding IO
>> requests), and then add EXT2_FLAG_THREADS into the flags I pss to
>> ext2fs_file_open2(), to try and get ext2fs to take advantage of
>> parallel access to the underlying storage (regardless of whether the
>> clients are parallel coming into ext2fs).  Are there any concurrency
>> issues I should be aware of on that front when updating my code?
> 
> So this is the _beginning_ of adding threaded support into libext2fs.
> At the moment, we now have locking for the unix_io.c data structures.
> This allows multiple threads to safely do read-only operations in
> parallel.  But this is *all* that it allows.
> 
> This was implemented as part of preparatory work to do parallel
> e2fsck.  The strategy is that we will have several different threads
> reading from disjoint parts of the file system.  So for example, one
> thread might be reading from block groups 0 -- 100.  Another thread
> might be reading from block groups 101 -- 200.  And so on.  Each
> thread will have its own copy of struct e2fsck_struct, and when they
> are done they will merge their data to the global e2fsck_struct.  If
> there are any inconsistencies that need to be fixed, such that the
> file system needs to be modified, this will require waiting until all
> of the threads are done, or other specialized locking inside e2fsck.
> Of course, in the "happy path", where the file system does not need
> any repairs, we won't need to do any special locking or waiting, since
> the workload will be read-only.
> 
> So we do not have any concurrency support for allocating inodes, or
> allocating blocks, or assigning blocks to an inode's extent tree, etc.
> Nor do we currently have any plans to add more concureency support to
> libext2fs.
> 
> To do this would require a huge amount of effort, and it would also
> require making a lot of changes to the underlying data structures.
> For example, even if we added locking to all of the various data
> structures hanging off of the ext2_fs handle, if two threads tried to
> open the same inode using ext2fs_file_open(), the two file handles are
> completely independent, and there is no way for one thread to do any
> kind of cache invalidation of another thread's file handle after it
> has modified the inode.  The same is true if one thread is using a
> directory iterator while another process tries to modify that
> directory.
> 
>> Obviously, when the kernel accesses an ext2/3/4 file system, it DOES
>> support full concurrency (separate user space processes can open
>> independent handles to the same file....
> 
> Yes, and that's because the kernel was designed with that in mind from
> the beginning.  The ext2fs library was originally designed to support
> programs like e2fsck, mke2fs, and debugfs.  None of these tools
> required concurrency, and as I've mentioned, at the time when
> libext2fs was first implemented, Linux didn't even *have* threads
> support.  So concurrency wasn't even possible, even if it had been
> needed at that time.
> 
>> process must be observed from another).  But nbdkit is all about
>> accessing the data of an ext2 filesystem from userspace, without any
>> kernel bio involvement, and is thus reliant on whatever concurrency
>> guarantees the ext2progs library has (or lacks).
> 
> The e2fsprogs library pretty much doesn't have any concurrency
> guarantees, sorry.  I suspect you could create a layer on top of
> libext2fs which actually implemented a global inode cache ala the
> kernel, so that when two threads call something like ext2fs_iget()
> function, it works like the kernel's iget() function and they get the
> same inode structure, which is reference counted.  Things like the
> directory iterator would have to be changed into something more like
> the functions exported by the kernel VFS layer, which would make this
> layer useless for e2fsck, but it would be more useful for a threaded
> client that wanted concurrent read/write access to the filesystem from
> mulitple threads.
> 
> Cheers,
> 
>                        - Ted