lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ko3bgsd2wdluordh6phnmou3232yqlqsehxte6bvq34udq5in7@4phfw73mywjo>
Date: Mon, 26 May 2025 18:20:11 +0200
From: Jan Kara <jack@...e.cz>
To: Thadeu Lima de Souza Cascardo <cascardo@...lia.com>
Cc: Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.com>, 
	Tao Ma <boyu.mt@...bao.com>, Andreas Dilger <adilger.kernel@...ger.ca>, 
	Eric Biggers <ebiggers@...gle.com>, linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, kernel-dev@...lia.com, 
	syzbot+0c89d865531d053abb2d@...kaller.appspotmail.com, stable@...r.kernel.org
Subject: Re: [PATCH v2] ext4: inline: convert when mmap is called, not when
 page is written

On Mon 26-05-25 11:13:34, Thadeu Lima de Souza Cascardo wrote:
> inline data handling has a race between writing and writing to a memory
> map.
> 
> When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which
> destroys the inline data, but if block allocation fails, restores the
> inline data. In that process, we could have:
> 
> CPU1					CPU2
> destroy_inline_data
> 					write_begin (does not see inline data)
> restory_inline_data
> 					write_end (sees inline data)
> 
> This leads to bugs like the one below, as write_begin did not prepare for
> the case of inline data, which is expected by the write_end side of it.
> 
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/inline.c:235!
> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline]
> RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774
> Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90
> RSP: 0018:ffffc900031c7320 EFLAGS: 00010293
> RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000
> RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b
> RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070
> R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a
> R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b
> FS:  00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  generic_perform_write+0x6f8/0x990 mm/filemap.c:4070
>  ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299
>  ext4_file_write_iter+0x892/0x1c50
>  iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743
>  do_splice_from fs/splice.c:941 [inline]
>  direct_splice_actor+0x11d/0x220 fs/splice.c:1164
>  splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108
>  do_splice_direct_actor fs/splice.c:1207 [inline]
>  do_splice_direct+0x289/0x3e0 fs/splice.c:1233
>  do_sendfile+0x564/0x8a0 fs/read_write.c:1363
>  __do_sys_sendfile64 fs/read_write.c:1424 [inline]
>  __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f5c6bb18d09
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
> RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09
> RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004
> RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000
> R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620
> R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> 
> This happens because ext4_page_mkwrite is not protected by the inode_lock.
> The xattr semaphore is not sufficient to protect inline data handling in a
> sane way, so we need to rely on the inode_lock. Adding the inode_lock to
> ext4_page_mkwrite is not an option, otherwise lock-ordering problems with
> mmap_lock may arise.
> 
> The conversion inside ext4_page_mkwrite was introduced at commit
> 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This
> fixes a documented bug in the commit message, which suggests some
> alternative fixes.
> 
> Convert inline data when mmap is called, instead of doing it only when the
> mmapped page is written to. Using the inode_lock there does not lead to
> lock-ordering issues.
> 
> The drawback is that inline conversion will happen when the file is
> mmapped, even though the page will not be written to.
> 
> Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map")
> Reported-by: syzbot+0c89d865531d053abb2d@...kaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@...lia.com>
> Cc: stable@...r.kernel.org
> ---
> Changes in v2:
> - Convert inline data at mmap time, avoiding data loss.
> - Link to v1: https://lore.kernel.org/r/20250519-ext4_inline_page_mkwrite-v1-1-865d9a62b512@igalia.com
> ---
>  fs/ext4/file.c  | 6 ++++++
>  fs/ext4/inode.c | 4 ----
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index beb078ee4811d6092e362e37307e7d87e5276cbc..f2380471df5d99500e49fdc639fa3e56143c328f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -819,6 +819,12 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (!daxdev_mapping_supported(vma, dax_dev))
>  		return -EOPNOTSUPP;
>  
> +	inode_lock(inode);
> +	ret = ext4_convert_inline_data(inode);
> +	inode_unlock(inode);
> +	if (ret)
> +		return ret;
> +
>  	file_accessed(file);
>  	if (IS_DAX(file_inode(file))) {
>  		vma->vm_ops = &ext4_dax_vm_ops;

So I would *love* to do this and was thinking about this as well. But the
trouble is that this causes lock inversion as well because ->mmap callback
is called with mmap_lock held and so we cannot acquire inode_lock here
either.

Recent changes which switch from ->mmap to ->mmap_prepare callback are
actually going in a suitable direction but we'd need a rather larger
rewrite to get from under mmap_lock and I'm not sure that's justified.

Anyway, thanks for having a look!

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ