linux-ext4 - Re: [PATCH] [RFC] iomap: Use FUA for pure data O

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZXJPK0BTSs843bmt@dread.disaster.area>
Date: Fri, 8 Dec 2023 10:03:07 +1100
From: Dave Chinner <david@...morbit.com>
To: Theodore Ts'o <tytso@....edu>
Cc: hch@....de, Christian Brauner <brauner@...nel.org>,
	"Darrick J. Wong" <djwong@...nel.org>, Jan Kara <jack@...e.cz>,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@...hat.com>
> > 
> > If we are doing direct IO writes with datasync semantics, we often
> > have to flush metadata changes along with the data write. However,
> > if we are overwriting existing data, there are no metadata changes
> > that we need to flush. In this case, optimising the IO by using
> > FUA write makes sense.
> > 
> > We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> > requires a metadata flush - this is currently used by DAX to ensure
> > extent modi$fication as stable in page fault operations. For direct
> > IO writes, we can use it to determine if we need to flush metadata
> > or not once the data is on disk.
> 
> Hi,
> 
> I've gotten an inquiry from some engineers at Microsoft who would
> really like it if ext4 could use FUA writes when doing O_DSYNC writes,
> since this is soemthing that SQL Server uses.  In the discussion for
> this patch series back in 2018[1], ext4 hadn't yet converted over to
> iomap for Direct I/O, and so adding this feature for ext4 wasn't
> really practical.
> 
> [1] https://lore.kernel.org/all/20180319160650.mavedzwienzgwgqi@quack2.suse.cz/
> 
> Today, ext4 does use iomap for DIO, but an experiment seems to
> indicate that something hasn't been wired up to enable FUA for O_DSYNC
> writes.  I've looked at fs/iomap/direct-io.c and it wasn't immediately
> obvious what I need to add to enable this feature.

I've actually looked briefly into this in the past for ext4 - IIRC
the problem is that mtime updates during the write are causing the
inode to get marked dirty.

Yeah, typical trace from a RWF_DSYNC direct IO through ext4 is:

xfs_io-6002  [007]  2986.814841: ext4_es_lookup_extent_enter: dev 7,0 ino 12 lblk 0
xfs_io-6002  [007]  2986.814848: ext4_es_lookup_extent_exit: dev 7,0 ino 12 found 1 [0/30720) 34304 W
xfs_io-6002  [007]  2986.814856: ext4_journal_start_inode: dev 7,0 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x39
xfs_io-6002  [007]  2986.814875: ext4_mark_inode_dirty: dev 7,0 ino 12 caller ext4_dirty_inode+0x5c
xfs_io-6002  [007]  2986.814917: iomap_dio_rw_begin:   dev 7:0 ino 0xc size 0x40000000 offset 0x0 length 0x20000 done_before 0x0 flags DIRECT dio_flags  aio 0
xfs_io-6002  [007]  2986.814922: iomap_iter:           dev 7:0 ino 0xc pos 0x0 length 0x0 flags WRITE|DIRECT (0x11) ops ext4_iomap_overwrite_ops caller __iomap_dio_rw+0x1c2

So I think the problem may be through file_modified() from
ext4_dio_write_checks() triggering mtime updates which then comes
back through mark_inode_dirty and that ends up causing
the inode to be marked dirty and tracked by the journal.

Yup, there it is in generic_update_time():

	if (updated & (S_ATIME|S_MTIME|S_CTIME))
                dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC;

it's setting the inode I_DIRTY_SYNC for mtime updates.

XFS does not do this. It implements the ->update_time() method so
doesn't use generic_update_time(). XFS tracks pure timestamp updates
separately to other dirty inode metadata. We then elide "timestamp
dirty" state from the IOMAP_F_DIRTY checks because O_DSYNC doesn't
require timestamps to be persisted.

FWIW, the patch below allows testing FUA optimisations on loop
devices....

-Dave.
-- 
Dave Chinner
david@...morbit.com

loop: add support for FUA writes

From: Dave Chinner <dchinner@...hat.com>

Any file-backed loop device on a filesysetm that support the ->fsync
operation can do FUA writes. All a FUA write requires from the loop
device is RWF_DSYNC semantics for the specific REQ_FUA request that
is being processed. i.e. that it is on stable storage before the
request completes.

This allows simple testing of things like iomap FUA optimisations,
and should also provide better performance for loop device in
situations where the upper filesystem is issuing FUA writes to try
to avoid needing journal flushes. Using RWF_DSYNC will also allow
the lower filesystem to attempt to use REQ_FUA on loop devices using
direct IO and potentially improve the IO patterns all the way down
the stack as a result.

Prep:

# losetup --direct-io=on /dev/loop0 /mnt/scratch/foo.img 
# mkfs.xfs /dev/loop0
# mount /dev/loop0 /mnt/test
# xfs_io -fdc "pwrite -b 2m 0 1g" -c fsync /mnt/test/foo
# sync

Measure:

# xfs_io -fdc "pwrite -V 1 -D -b 128k 0 1m" -c fsync /mnt/test/foo

Before:

# grep . /sys/block/loop0/queue/*
....
/sys/block/loop0/queue/fua:0
....

# trace-cmd stream -e block* |grep xfs_io
.....
<...>-4654  [007]   809.018173: block_rq_issue:       7,0 WS 131072 () 192 + 256 [xfs_io]
<...>-4654  [007]   809.025910: block_bio_queue:      7,0 WS 448 + 256 [xfs_io]
<...>-4654  [007]   809.025917: block_getrq:          7,0 WS 448 + 256 [xfs_io]
....

After:

# grep . /sys/block/loop0/queue/*
....
/sys/block/loop0/queue/fua:1
....

# trace-cmd stream -e block* |grep xfs_io
.....
<...>-4648  [006]   793.968169: block_rq_issue:       7,0 WFS 131072 () 192 + 256 [xfs_io]
<...>-4648  [006]   793.977521: block_bio_queue:      7,0 WFS 448 + 256 [xfs_io]
<...>-4648  [006]   793.977531: block_getrq:          7,0 WFS 448 + 256 [xfs_io]

The upper XFS filesystem is issuing REQ_FUA writes to the loop
device now.

Signed-off-by: Dave Chinner <dchinner@...hat.com>
---
 drivers/block/loop.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9f2d412fc560..6f2df09b3179 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -238,7 +238,8 @@ static void loop_set_size(struct loop_device *lo, loff_t size)
 		kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
 }
 
-static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
+static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos,
+		rwf_t rw_flags)
 {
 	struct iov_iter i;
 	ssize_t bw;
@@ -246,7 +247,7 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
 	iov_iter_bvec(&i, ITER_SOURCE, bvec, 1, bvec->bv_len);
 
 	file_start_write(file);
-	bw = vfs_iter_write(file, &i, ppos, 0);
+	bw = vfs_iter_write(file, &i, ppos, rw_flags);
 	file_end_write(file);
 
 	if (likely(bw ==  bvec->bv_len))
@@ -261,14 +262,14 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
 }
 
 static int lo_write_simple(struct loop_device *lo, struct request *rq,
-		loff_t pos)
+		loff_t pos, rwf_t rw_flags)
 {
 	struct bio_vec bvec;
 	struct req_iterator iter;
 	int ret = 0;
 
 	rq_for_each_segment(bvec, rq, iter) {
-		ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos);
+		ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos, rw_flags);
 		if (ret < 0)
 			break;
 		cond_resched();
@@ -392,7 +393,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
 }
 
 static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
-		     loff_t pos, int rw)
+		     loff_t pos, int rw, rwf_t rw_flags)
 {
 	struct iov_iter iter;
 	struct req_iterator rq_iter;
@@ -446,6 +447,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
 	cmd->iocb.ki_filp = file;
 	cmd->iocb.ki_complete = lo_rw_aio_complete;
 	cmd->iocb.ki_flags = IOCB_DIRECT;
+	kiocb_set_rw_flags(&cmd->iocb, rw_flags);
 	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
 
 	if (rw == ITER_SOURCE)
@@ -464,6 +466,15 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 	loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+	rwf_t rw_flags = 0;
+
+	/*
+	 * If we are being asked to do a REQ_FUA write, then we convert that to
+	 * a datasync write so that the contents of the write are on stable
+	 * storage before the write completes.
+	 */
+	if (rq->cmd_flags & REQ_FUA)
+		rw_flags = RWF_DSYNC;
 
 	/*
 	 * lo_write_simple and lo_read_simple should have been covered
@@ -490,12 +501,12 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 		return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
 	case REQ_OP_WRITE:
 		if (cmd->use_aio)
-			return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
+			return lo_rw_aio(lo, cmd, pos, ITER_SOURCE, rw_flags);
 		else
-			return lo_write_simple(lo, rq, pos);
+			return lo_write_simple(lo, rq, pos, rw_flags);
 	case REQ_OP_READ:
 		if (cmd->use_aio)
-			return lo_rw_aio(lo, cmd, pos, ITER_DEST);
+			return lo_rw_aio(lo, cmd, pos, ITER_DEST, 0);
 		else
 			return lo_read_simple(lo, rq, pos);
 	default:
@@ -1077,7 +1088,7 @@ static int loop_configure(struct loop_device *lo, blk_mode_t mode,
 	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 
 	if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_write_cache(lo->lo_queue, true, false);
+		blk_queue_write_cache(lo->lo_queue, true, true);
 
 	if (config->block_size)
 		bsize = config->block_size;