linux-ext4 - Re: direct I/O: ext4 seems to not honor RWF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877ckio5y8.fsf@x1.mail-host-address-is-not-set>
Date: Tue, 09 Jan 2024 15:57:19 +0000
From: Free Ekanayaka <free.ekanayaka@...il.com>
To: Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>
Cc: linux-ext4@...r.kernel.org
Subject: Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is
 disabled

Jan Kara <jack@...e.cz> writes:

[...]
>> I suspect correct crash recovery behaviour here requires
>> multiple cache flushes to ensure the correct ordering or data vs
>> metadata updates. i.e:
>> 
>> 	....
>> 	data write completes
>> 	fdatasync()
>> 	  cache flush to ensure data is on disk
>> 	  if (dirty metadata) {
>> 		issue metadata write(s) for extent records and inode
>> 		....
>> 		metadata write(s) complete
>> 		cache flush to ensure metadata is on disk
>> 	  }
>> 
>> If we don't flush the cache between the data write and the metadata
>> write(s) that marks the extent as written, we could have a state
>> after a power fail where the metadata writes hit the disk
>> before the data write and after the system comes back up that file
>> now it exposes stale data to the user.
>
> So when we are journalling, we end up doing this (we flush data disk before
> writing and flushing the transaction commit block in jbd2). When we are not
> doing journalling (which is the case here), our crash consistency
> guarantees are pretty weak. We want to guarantee that if fsync(2)
> successfully completed on the file before the crash, user should see the
> data there. But not much more - i.e., stale data exposure in case of crash
> is fully within what sysadmin should expect from a filesystem without a
> journal.

Right, which is exectly the tradeoff I need. Weaker guarantees for lower
latency.

All I need is that RWF_DSYNC holds up the promise that once I see a
successful io_uring completion entry, than I'm sure that the data has
made it to disk and it would survive a power loss.

> After all even if we improved fsync(2) as you suggest, we'd still
> have normal page writeback where we'd have to separate data & metadata
> writes with cache flushes and I don't think the performace overhead is
> something people would be willing to pay.
>
> So yes, nojournal mode is unsafe in case of crash. It is there for people
> not caring about the filesystem after the crash, single user filesystems
> doing data verification in userspace and similar special usecases. Still, I
> think we want at least minimum fsync(2) guarantees if nothing else for
> backwards compatibility with ext2.

I'm doing data verification in user space indeed. As sad, the file has
been pre-allocated posix_fallocate() and fsync'ed (along with its
dir), so no metadata changes will occur, just the bare write.

FWIW the use case is writing the log for an implementation of the Raft
consensus algorithm. So basically a series of sequential writes.