lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 14 Apr 2022 10:46:36 +0800 From: yebin <yebin10@...wei.com> To: Jan Kara <jack@...e.cz> CC: <tytso@....edu>, <adilger.kernel@...ger.ca>, <linux-ext4@...r.kernel.org>, <linux-kernel@...r.kernel.org>, <lczerner@...hat.com> Subject: Re: [PATCH -next] ext4: fix warning in ext4_handle_inode_extension On 2022/3/30 21:30, Jan Kara wrote: > On Wed 30-03-22 20:08:13, yebin wrote: >> On 2022/3/29 17:28, Jan Kara wrote: >>> On Sat 26-03-22 14:53:51, Ye Bin wrote: >>>> We got issue as follows: >>>> EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory >>>> EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error >>>> EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory >>>> EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc >>>> ------------[ cut here ]------------ >>>> WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 >>>> Modules linked in: >>>> CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 >>>> RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 >>>> RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 >>>> RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 >>>> RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd >>>> RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f >>>> R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a >>>> R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b >>>> FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> do_iter_readv_writev+0x2e5/0x360 >>>> do_iter_write+0x112/0x4c0 >>>> do_pwritev+0x1e5/0x390 >>>> __x64_sys_pwritev2+0x7e/0xa0 >>>> do_syscall_64+0x37/0x50 >>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> >>>> Above issue may happen as follows: >>>> Assume >>>> inode.i_size=4096 >>>> EXT4_I(inode)->i_disksize=4096 >>>> >>>> step 1: set inode->i_isize = 8192 >>>> ext4_setattr >>>> if (attr->ia_size != inode->i_size) >>>> EXT4_I(inode)->i_disksize = attr->ia_size; >>>> rc = ext4_mark_inode_dirty >>>> ext4_reserve_inode_write >>>> ext4_get_inode_loc >>>> __ext4_get_inode_loc >>>> sb_getblk --> return -ENOMEM >>>> ... >>>> if (!error) ->will not update i_size >>>> i_size_write(inode, attr->ia_size); >>>> Now: >>>> inode.i_size=4096 >>>> EXT4_I(inode)->i_disksize=8192 >>>> >>>> step 2: Direct write 4096 bytes >>>> ext4_file_write_iter >>>> ext4_dio_write_iter >>>> iomap_dio_rw ->return error >>>> if (extend) >>>> ext4_handle_inode_extension >>>> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); >>>> ->Then trigger warning. >>>> >>>> To solve above issue, if mark inode dirty failed in ext4_setattr just >>>> set 'EXT4_I(inode)->i_disksize' with old value. >>>> >>>> Signed-off-by: Ye Bin <yebin10@...wei.com> >>> Thanks for the fix! So I think this deserves a further debate. I have two >>> points here: >>> >>> 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback) >>> we must abort the journal because metadata is not guaranteed to be >>> consistent anymore. In this particular callsite of ext4_mark_inode_dirty() >>> you were able to undo the changes but there are many more where it is not >>> sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs >>> to call ext4_journal_abort_handle() (as already happens inside >>> __ext4_journal_get_write_access()) and not just ext4_std_error(). >>> >>> 2) The assertion in ext4_handle_inode_extension() should be conditioned on >>> !is_journal_aborted() to avoid useless warnings for filesystems we know are >>> inconsistent anyway. >>> >>> Thoughts? >>> >>> Honza >> Do you mean call jbd2_abort in ext4_reserve_inode_write() ? > Yes. > >> If we abort journal when metadata is not guaranteed to be consistent. The >> mode of ‘errors=continue’ is unnecessary. > Well, firstly, errors=continue was always the best effort. There are no > guarantees which failures we are able to withstand and which not. > Generally, I think we try to withstand on-disk filesystem inconsistency but > not inconsistency coming from programming errors or other external factors > like out-of-memory conditions. Secondly, we already do abort the journal > when e.g. jbd2_journal_get_write_access() fails (although that generally > means some internal inconsistency) or when say revoke handling fails to > allocate memory for a revoke record. So it won't be a new thing. Thirdly, > and perhaps most importantly, you have found and fixed just one fairly > innocent problem happening due to in memory inode state getting > inconsistent after we fail to record the inode in the journal. There are > almost 80 callsites of ext4_mark_inode_dirty() and honestly I suspect that > e.g. inconsistent states resulting from extent tree manipulations being > aborted in the middle due to ext4_ext_dirty() failing due to ENOMEM will > also trigger all sorts of "interesting" behavior. So that's why I'd rather > abort the journal than try to continue when we almost certainly now we > cannot. > > One other possibility I could see dealing with this class of problems is > using __GFP_NOFAIL for inode buffer head allocation (through > sb_getblk_gfp()) in __ext4_get_inode_loc(). BTW, how did you trigger NULL > sb_getblk() return in the first place? AFAICS grow_buffers() already uses > __GFP_NOFAIL for all the allocations? > > Honza To be honest, I don't know syzkaller how to inject the NOMEM fault. If syzkaller rely on the memory fault injection mode provided by the kernel, should report null pointer access. Anyway, If inject a single point of IO fault, we still have to face the same situation. > >>>> --- >>>> fs/ext4/inode.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index 90fd6f7b6209..8adf1f802f6c 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> if (attr->ia_valid & ATTR_SIZE) { >>>> handle_t *handle; >>>> loff_t oldsize = inode->i_size; >>>> + loff_t old_disksize; >>>> int shrink = (attr->ia_size < inode->i_size); >>>> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { >>>> @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> inode->i_sb->s_blocksize_bits); >>>> down_write(&EXT4_I(inode)->i_data_sem); >>>> + old_disksize = EXT4_I(inode)->i_disksize; >>>> EXT4_I(inode)->i_disksize = attr->ia_size; >>>> rc = ext4_mark_inode_dirty(handle, inode); >>>> if (!error) >>>> @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> */ >>>> if (!error) >>>> i_size_write(inode, attr->ia_size); >>>> + else >>>> + EXT4_I(inode)->i_disksize = old_disksize; >>>> up_write(&EXT4_I(inode)->i_data_sem); >>>> ext4_journal_stop(handle); >>>> if (error) >>>> -- >>>> 2.31.1 >>>>
Powered by blists - more mailing lists