lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20240110115717.sn5yuyjuwaixs5ra@quack3> Date: Wed, 10 Jan 2024 12:57:17 +0100 From: Jan Kara <jack@...e.cz> To: Free Ekanayaka <free.ekanayaka@...il.com> Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org Subject: Re: direct I/O: ext4 seems to not honor RWF_DSYNC when journal is disabled On Wed 10-01-24 11:19:42, Free Ekanayaka wrote: > Jan Kara <jack@...e.cz> writes: > > [...] > > >> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has > >> changed and an actual NVMe flush command seems to be issued (the flags > >> passed to nvme_setup_cmd match the ones that I see in the case I write > >> to the raw block device). So that part seems fixed. I thought I had > >> tried with 6.5.1 when I had posted this issue, and that it was present > >> there too, but maybe I was mistaken. > >> > >> I'm not entirely sure what you mean about flushing the inode buffer > >> properly. As far as I see, the current behavior I see matches what I'd > >> expect. > > > > The thing is: fallocate(2) does preallocate blocks to the file so > > block allocation is not needed when writing to those blocks. However that > > does not mean metadata changes are not needed when writing to those blocks. > > In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks > > allocated using fallocate(2) are tracked as unwritten in inode's metadata (so > > reads from them return 0 instead of random garbage). After block contents > > is written with iouring write, we need to convert the state of blocks from > > unwritten to written and that needs metadata modifications. So the first > > write to the file after fallocate(2) call still needs to do metadata > > modifications and these need to be persisted as part of fdatasync(2). And > > by mistake this did not happen in nojournal mode. > > Thanks for the explanation. > > Since I want to achieve the lowest possible latency for these writes, is > there a way to not onlly preallocate blocks with fallocate() but also to > avoid the extra write for metadata modifications that you mention? > > I mean, I could of course take the brute force approach of performing > dump "pre" writes of those blocks, but I'm wondering if there's a better > way that doesn't require writing those blocks twice (which might end up > defeating the purpose of lowering overall latency). No, if you really want to make sure there will be no metadata modifications while writing to a file, writing 0's (or whatever you wish) to it is the only way. > >> For reference I'm attaching below the trace of the same user code, this > >> time run on kernel 6.5.0, which is the one currently shipping with > >> Debian/testing. Note that there are quite a bit less trace lines emitted > >> by the ext4 sub-system, not sure if it's related/relevant. > >> > >> raft-benchmark-35708 [000] ..... 9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0 > >> raft-benchmark-35708 [000] ..... 9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0 > >> raft-benchmark-35708 [000] ..... 9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR > >> raft-benchmark-35708 [000] ..... 9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4] > >> raft-benchmark-35708 [000] ..... 9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4] > >> raft-benchmark-35708 [000] ..... 9203.271119: block_touch_buffer: 259,5 sector=135 size=4096 > >> raft-benchmark-35708 [000] ..... 9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096 > >> raft-benchmark-35708 [000] ..... 9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0 > >> raft-benchmark-35708 [000] ..... 9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR > >> raft-benchmark-35708 [000] ..... 9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168 > >> raft-benchmark-35708 [000] ..... 9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark] > >> raft-benchmark-35708 [000] ..... 9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark] > >> raft-benchmark-35708 [000] ..... 9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark] > >> raft-benchmark-35708 [000] ..... 9203.271124: block_plug: [raft-benchmark] > >> raft-benchmark-35708 [000] ..... 9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0) > >> raft-benchmark-35708 [000] ..... 9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark] > >> raft-benchmark-35708 [000] d.h.. 9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79 > >> raft-benchmark-35708 [000] d.h.. 9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0 > >> raft-benchmark-35708 [000] d.h.. 9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0] > >> raft-benchmark-35708 [000] dNh.. 9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark] > >> raft-benchmark-35708 [000] ...1. 9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0 > >> raft-benchmark-35708 [000] ..... 9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1 > > > > So in this case the file blocks seem to have been already written. > > Do you mean that they have been already written at some point in the > past after the file system was created? No, I mean this particular file offset has already been written to since the file was created. The line: raft-benchmark-35708 [000] ..... 9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR at the end has 'WR' which is the status of the found extent (contiguous sequence of blocks) and 'WR' means 'written' and 'referenced'. > I guess it doesn't matter if they were written as part of the new file > that is being written (and that was created with open()/fallocate()), or > if they were written before as part of some other file that was deleted > since then. No, the block must have been written when it was already part of this file. The block being written is tracked on logical-file-offset basis and this tracking is there exactly so that you cannot read some stale data (potentially sensitive data of another user) after fallocating some blocks to the file. I understand this protection need not be that interesting for your usecase and in nojournal mode you still may potentially see such stale data but there's a big difference between possibly seeing such data only after a crash and giving user a way to read such data at will. Honza -- Jan Kara <jack@...e.com> SUSE Labs, CR
Powered by blists - more mailing lists