lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87il8nhxdm.fsf@x1.mail-host-address-is-not-set>
Date:   Wed, 06 Sep 2023 21:15:01 +0100
From:   Free Ekanayaka <free.ekanayaka@...il.com>
To:     linux-ext4@...r.kernel.org
Subject: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled

(reposting after having subscribed, and from a different address since
the first mail did not show up in the archives and didn't bounce either)

Hello,

I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1
too).

I've created an ext4 filesystem with journalling disabled on an NVMe
drive:

mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6

I have a program that creates and open a new file with O_DIRECT, and
sets its size to 8M with posix_fallocate(), something like:

fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT);
posix_fallocate(fd, 0, 8 * 1024 * 1024);
fsync(fd);
dirfd = open("/dir", O_RDONLY | O_DIRECTORY);
fsync(dirfd);

and then it uses io_uring to perform a single write of 4096 bytes at the
beginning of the file, passing RWF_DSYNC to the submitted
IORING_OP_WRITE_FIXED entry,

I would expect the kernel to tell the NVMe device to actually flush the
write, not only buffer it. However I measured the end-to-end latency of
the io_uring operation and it was very low, as if the write was only
buffered by the NVMe device, but not flushed.

This suspicion seems to be confirmed by tracing the write nvme command
sent to the device:

raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)

notice the "ctrl=0x0" in there. If I run the same program, but I write
directly to the /dev/nvme0n1p6 device instead of using a file, I get:

raft-benchmark-37936   [003] .....  9942.784628: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=29265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)

and here notice the "ctrl=0x4000", which I would expect is some flag to
tell to the NVMe device to also flush the write.

If I enable "write through" on the NVMe device:

echo "write through" | sudo tee /sys/block/nvme0n1/queue/write_cache

then also the direct write to the /dev/nvme0n1p6 device avoids flushing
the command, as expected:

raft-benchmark-38106   [003] .....  9971.741308: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=33361, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x0, dsmgmt=0, reftag=0)

(notice the "ctrl=0x0" there).

My question is: is my interpretation correct? Is it expected that ext4
does not tell the block layer/device to flush this write when
journalling is disabled?

Below you find the complete event trace of the io_uring call, both in
the ext4 write case and in the raw /dev/nvme0n1p6 write case.

I can provide a C reproducer as well if needed, but wanted to first
check if this rings any bell.

Thanks,

Free

== ext4 ==

  raft-benchmark-37801   [003] .....  9904.830974: io_uring_submit_req: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
  raft-benchmark-37801   [003] .....  9904.830982: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.830983: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
  raft-benchmark-37801   [003] .....  9904.830985: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x38/0x80 [ext4]
  raft-benchmark-37801   [003] .....  9904.830987: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_dirty_inode+0x5b/0x80 [ext4]
  raft-benchmark-37801   [003] .....  9904.830989: block_touch_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.830993: block_dirty_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831121: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831122: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831123: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_iomap_begin+0x1c2/0x2f0 [ext4]
  raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831124: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831125: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|PRE_IO
  raft-benchmark-37801   [003] .....  9904.831126: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831127: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
  raft-benchmark-37801   [003] .....  9904.831128: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|PRE_IO|METADATA_NOFAIL allocated 1 newblock 32887
  raft-benchmark-37801   [003] .....  9904.831129: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831130: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_split_extent+0xcd/0x190 [ext4]
  raft-benchmark-37801   [003] .....  9904.831131: block_touch_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831133: block_dirty_buffer: 259,5 sector=135 size=4096
  raft-benchmark-37801   [003] .....  9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1
  raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
  raft-benchmark-37801   [003] .....  9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR
  raft-benchmark-37801   [003] .....  9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
  raft-benchmark-37801   [003] .....  9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096
  raft-benchmark-37801   [003] .....  9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831149: block_plug: [raft-benchmark]
  raft-benchmark-37801   [003] .....  9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
  raft-benchmark-37801   [003] .....  9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark]
  raft-benchmark-37801   [003] d.h..  9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783
  raft-benchmark-37801   [003] d.h..  9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0
  raft-benchmark-37801   [003] d.h..  9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0]
     kworker/3:1-30279   [003] .....  9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4]
     kworker/3:1-30279   [003] .....  9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0
     kworker/3:1-30279   [003] .....  9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U
     kworker/3:1-30279   [003] .....  9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT
     kworker/3:1-30279   [003] .....  9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U
     kworker/3:1-30279   [003] .....  9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1
     kworker/3:1-30279   [003] .....  9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887
     kworker/3:1-30279   [003] .....  9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4]
     kworker/3:1-30279   [003] .....  9904.831197: block_touch_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1
     kworker/3:1-30279   [003] .....  9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W
     kworker/3:1-30279   [003] .....  9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4]
     kworker/3:1-30279   [003] .....  9904.831200: block_touch_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096
     kworker/3:1-30279   [003] .....  9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 
     kworker/3:1-30279   [003] .....  9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0
  raft-benchmark-37801   [003] ...1.  9904.831212: io_uring_complete: ring 0000000011cab2e4, req 00000000c7a7d835, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
  raft-benchmark-37801   [003] .....  9904.831213: io_uring_task_work_run: tctx 000000005c8048e9, count 1, loops 1

== /dev/nvme0n1p6 ==

  raft-benchmark-37936   [003] .....  9942.784613: io_uring_submit_req: ring 00000000c4f84b11, req 00000000058196bf, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
  raft-benchmark-37936   [003] .....  9942.784622: block_bio_remap: 259,0 WFS 498192384 + 8 <- (259,5) 0
  raft-benchmark-37936   [003] .....  9942.784622: block_bio_queue: 259,0 WFS 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] .....  9942.784626: block_getrq: 259,0 WFS 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] .....  9942.784628: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=29265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498192384, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
  raft-benchmark-37936   [003] .....  9942.784634: block_rq_issue: 259,0 WFS 4096 () 498192384 + 8 [raft-benchmark]
  raft-benchmark-37936   [003] d.h..  9942.794331: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=784, tail=784
  raft-benchmark-37936   [003] d.h..  9942.794334: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=29265, res=0x0, retries=0, flags=0x0, status=0x0
  raft-benchmark-37936   [003] d.h..  9942.794334: block_rq_complete: 259,0 WFS () 498192384 + 8 [0]
  raft-benchmark-37936   [003] ...1.  9942.794338: io_uring_complete: ring 00000000c4f84b11, req 00000000058196bf, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 
  raft-benchmark-37936   [003] .....  9942.794338: io_uring_task_work_run: tctx 000000005fccbc0a, count 1, loops 1
-------------------- End of forwarded message --------------------

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ