[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c06c139-f994-442b-925e-e177ef2c5adb@oracle.com>
Date: Mon, 4 Dec 2023 10:36:01 +0000
From: John Garry <john.g.garry@...cle.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-ext4@...r.kernel.org,
Theodore Ts'o <tytso@....edu>
Cc: Ritesh Harjani <ritesh.list@...il.com>,
linux-kernel@...r.kernel.org,
"Darrick J . Wong" <djwong@...nel.org>,
linux-block@...r.kernel.org, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, dchinner@...hat.com
Subject: Re: [RFC 0/7] ext4: Allocator changes for atomic write support with
DIO
On 30/11/2023 13:53, Ojaswin Mujoo wrote:
Thanks for putting this together.
> This patch series builds on top of John Gary's atomic direct write
> patch series [1] and enables this support in ext4. This is a 2 step
> process:
>
> 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate
> power-of-2 aligned physical blocks, which is needed for atomic writes.
>
> 2. Hook the direct IO path in ext4 to use aligned allocation to obtain
> physical blocks at a given alignment, which is needed for atomic IO. If
> for any reason we are not able to obtain blocks at given alignment we
> fail the atomic write.
So are we supposed to be doing atomic writes on unwritten ranges only in
the file to get the aligned allocations?
I actually tried that, and I got a WARN triggered:
# mkfs.ext4 /dev/sda
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 358400 1k blocks and 89760 inodes
Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729, 204801, 221185
Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
[ 12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left
# mount /dev/sda mnt
[ 12.798804] EXT4-fs (sda): mounted filesystem
7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota
mode: none.
# touch mnt/file
#
# /test-statx -a /root/mnt/file
statx(/root/mnt/file) = 0
dump_statx results=5fff
Size: 0 Blocks: 0 IO Block: 1024 regular file
Device: 08:00 Inode: 12 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2023-12-04 10:27:40.002848720+0000
Modify: 2023-12-04 10:27:40.002848720+0000
Change: 2023-12-04 10:27:40.002848720+0000
Birth: 2023-12-04 10:27:40.002848720+0000
stx_attributes_mask=0x703874
STATX_ATTR_WRITE_ATOMIC set
unit min: 1024
uunit max: 524288
Attributes: 0000000000400000 (........ ........ ........ ........
........ .?--.... ..---... .---.-..)
#
looks ok so far, then write 4KB at offset 0:
# /test-pwritev2 -a -d -p 0 -l 4096 /root/mnt/file
file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
[ 46.813720] ------------[ cut here ]------------
[ 46.814934] WARNING: CPU: 1 PID: 158 at fs/ext4/mballoc.c:2991
ext4_mb_regular_allocator+0xeca/0xf20
[ 46.816344] Modules linked in:
[ 46.816831] CPU: 1 PID: 158 Comm: test-pwritev2 Not tainted
6.7.0-rc1-00038-gae3807f27e7d-dirty #968
[ 46.818220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
[ 46.819886] RIP: 0010:ext4_mb_regular_allocator+0xeca/0xf20
[ 46.820734] Code: fd ff ff f0 41 ff 81 e4 03 00 00 e9 63 fd ff ff
90 0f 0b 90 e9 fe f3 ff ff 90 48 c7 c7 e2 7a b2 86 44 89 ca e8 f7 f1
d2 ff 90 <0f> 0b 90 90 45 8b 44 24 3c e9 d4 f3 ff ff 4d 8b 45 08 4c 89
c2 4d
[ 46.823577] RSP: 0018:ffffb77dc056b7c0 EFLAGS: 00010286
[ 46.824379] RAX: 0000000000000000 RBX: ffff9b2ad77dea80 RCX:
0000000000000000
[ 46.825458] RDX: 0000000000000001 RSI: ffff9b2b3491b5c0 RDI:
ffff9b2b3491b5c0
[ 46.826557] RBP: ffff9b2adc7cd000 R08: 0000000000000000 R09:
c0000000ffffdfff
[ 46.827634] R10: ffff9b2adcb9d780 R11: ffffb77dc056b648 R12:
ffff9b2ac6778000
[ 46.828714] R13: ffff9b2adc7cd000 R14: ffff9b2adc7d0000 R15:
000000000000002a
[ 46.829796] FS: 00007f726dece740(0000) GS:ffff9b2b34900000(0000)
knlGS:0000000000000000
[ 46.830706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.831299] CR2: 0000000001ed72b8 CR3: 000000001c794006 CR4:
0000000000370ef0
[ 46.832041] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 46.832813] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 46.833546] Call Trace:
[ 46.833901] <TASK>
[ 46.834163] ? __warn+0x78/0x130
[ 46.834504] ? ext4_mb_regular_allocator+0xeca/0xf20
[ 46.835037] ? report_bug+0xf8/0x1e0
[ 46.835527] ? console_unlock+0x45/0xd0
[ 46.835963] ? handle_bug+0x40/0x70
[ 46.836419] ? exc_invalid_op+0x13/0x70
[ 46.836865] ? asm_exc_invalid_op+0x16/0x20
[ 46.837329] ? ext4_mb_regular_allocator+0xeca/0xf20
[ 46.837852] ext4_mb_new_blocks+0x7e8/0xe60
[ 46.838382] ? __kmalloc+0x4b/0x130
[ 46.838824] ? __kmalloc+0x4b/0x130
[ 46.839243] ? ext4_find_extent+0x347/0x360
[ 46.839743] ext4_ext_map_blocks+0xc44/0xff0
[ 46.840395] ext4_map_blocks+0x162/0x5b0
[ 46.841010] ? jbd2__journal_start+0x84/0x1f0
[ 46.841694] ext4_map_blocks_aligned+0x20/0xa0
[ 46.842382] ext4_iomap_begin+0x1e9/0x320
[ 46.843006] iomap_iter+0x16d/0x350
[ 46.843554] __iomap_dio_rw+0x3be/0x830
[ 46.844150] iomap_dio_rw+0x9/0x30
[ 46.844680] ext4_file_write_iter+0x597/0x800
[ 46.845346] do_iter_readv_writev+0xe1/0x150
[ 46.846029] do_iter_write+0x86/0x1f0
[ 46.846638] vfs_writev+0x96/0x190
[ 46.847176] ? do_pwritev+0x98/0xd0
[ 46.847721] do_pwritev+0x98/0xd0
[ 46.848230] ? syscall_trace_enter.isra.19+0x130/0x1b0
[ 46.849028] do_syscall_64+0x42/0xf0
[ 46.849590] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 46.850405] RIP: 0033:0x7f726df9666f
[ 46.850964] Code: d5 41 54 49 89 f4 55 89 fd 53 44 89 c3 48 83 ec
18 80 3d bb fd 0b 00 00 74 2a 45 89 c1 49 89 ca 45 31 c0 b8 48 01 00
00 0f 05 <48> 3d 00 f0 ff ff 76 5c 48 8b 15 7a 77 0b 00 f7 d8 64 89 02
48 83
[ 46.854020] RSP: 002b:00007fff28b9bff0 EFLAGS: 00000246 ORIG_RAX:
0000000000000148
[ 46.855178] RAX: ffffffffffffffda RBX: 0000000000000024 RCX:
00007f726df9666f
[ 46.856248] RDX: 0000000000000001 RSI: 00007fff28b9c050 RDI:
0000000000000003
[ 46.857303] RBP: 0000000000000003 R08: 0000000000000000 R09:
0000000000000024
[ 46.858365] R10: 0000000000000000 R11: 0000000000000246 R12:
00007fff28b9c050
[ 46.859407] R13: 0000000000000001 R14: 0000000000000000 R15:
00007f726e08aa60
[ 46.860448] </TASK>
[ 46.860797] ---[ end trace 0000000000000000 ]---
[ 46.861497] EXT4-fs warning (device sda):
ext4_map_blocks_aligned:520: Returned extent couldn't satisfy
alignment requirements
main wrote -1 bytes at offset 0
[ 46.863855] test-pwritev2 (158) used greatest stack depth: 11920
bytes left
#
Please note that I tested on my own dev branch, which contains changes
over [1], but I expect it would not make a difference for this test.
>
> Currently this RFC does not impose any restrictions for atomic and non-atomic
> allocations to any inode, which also leaves policy decisions to user-space
> as much as possible. So, for example, the user space can:
>
> * Do an atomic direct IO at any alignment and size provided it
> satisfies underlying device constraints. The only restriction for now
> is that it should be power of 2 len and atleast of FS block size.
>
> * Do any combination of non atomic and atomic writes on the same file
> in any order. As long as the user space is passing the RWF_ATOMIC flag
> to pwritev2() it is guaranteed to do an atomic IO (or fail if not
> possible).
>
> There are some TODOs on the allocator side which are remaining like...
>
> 1. Fallback to original request size when normalized request size (due to
> preallocation) allocation is not possible.
> 2. Testing some edge cases.
>
> But since all the basic test scenarios were covered, hence we wanted to get
> this RFC out for discussion on atomic write support for DIO in ext4.
>
> Further points for discussion -
>
> 1. We might need an inode flag to identify that the inode has blocks/extents
> atomically allocated. So that other userspace tools do not move the blocks of
> the inode for e.g. during resize/fsck etc.
> a. Should inode be marked as atomic similar to how we have IS_DAX(inode)
> implementation? Any thoughts?
>
> 2. Should there be support for open flags like O_ATOMIC. So that in case if
> user wants to do only atomic writes to an open fd, then all writes can be
> considered atomic.
>
> 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look
> like since say if there are block allocations done which were done atomically,
> it should not matter to FS w.r.t compatibility.
>
> 4. Mostly aligned allocations are required when we don't have data=journal
> mode. So should we return -EIO with data journalling mode for DIO request?
>
> Script to test using pwritev2() can be found here:
> https://urldefense.com/v3/__https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9__;!!ACWV5N9M2RV99hQ!LbVSb-43597CLDmYnhgOwH6MAcikusRh75-4fUbUrA_8go3B6JL1lWJPmhij8siPJE031qtQb6-bpdLEa1qrVA$
Please note that the posix_memalign() call in the program should PAGE align.
Thanks,
John
Powered by blists - more mailing lists