linux-ext4 - Re: [PATCH] jbd2: Avoid long hold times of j_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181106102230.GC25414@quack2.suse.cz>
Date:   Tue, 6 Nov 2018 11:22:30 +0100
From:   Jan Kara <jack@...e.cz>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org,
        Adrian Hunter <adrian.hunter@...el.com>
Subject: Re: [PATCH] jbd2: Avoid long hold times of j_state_lock while
 committing a transaction

On Tue 06-11-18 00:20:51, Theodore Y. Ts'o wrote:
> On Tue, Oct 16, 2018 at 11:42:48AM +0200, Jan Kara wrote:
> > We can hold j_state_lock for writing at the beginning of
> > jbd2_journal_commit_transaction() for a rather long time (reportedly for
> > 30 ms) due cleaning revoke bits of all revoked buffers under it. The
> > handling of revoke tables as well as cleaning of t_reserved_list, and
> > checkpoint lists does not need j_state_lock for anything. Furthermore
> > the transaction is in T_LOCKED state and we waited for all outstanding
> > handles so nobody is going to be adding anything to the transaction.
> > 
> > Just drop the lock for unnecessary operations.
> > 
> > Reported-and-tested-by: Adrian Hunter <adrian.hunter@...el.com>
> > Suggested-by: "Theodore Y. Ts'o" <tytso@....edu>
> > Signed-off-by: Jan Kara <jack@...e.cz>
> 
> Unfortunately, this patch is causing a regression in generic/241 in
> the dioread_nolock configuration.   To reproduce:
> 
> % gce-xftests -c dioread_nolock generic/241
> 
> which will result in the decoded stack trace:
> 
> [   42.926455] run fstests generic/241 at 2018-11-06 00:12:30
> [   51.091678] JBD2: assertion failure: h_type=11 h_line_no=4921 block_no=524337 jlist=3

So the buffer is on BJ_Shadow list while the assertion in
jbd2_journal_dirty_metadata() expects it to be in BJ_Metadata list. This is
really weird as we have also checked that jh->b_transaction ==
handle->h_transaction so the transaction couldn't have passed to commit
phase... Oh, I see, the code in start_this_handle() got racy with the
removal of j_state_lock protection from journal_commit_transaction() so now
transaction can start even though there are handles outstanding! I'll think
about the best solution for this. Thanks for report!

								Honza

> [   51.101179] ------------[ cut here ]------------
> [   51.106191] kernel BUG at fs/jbd2/transaction.c:1372!
> [   51.111716] invalid opcode: 0000 [#1] SMP PTI
> [   51.116198] CPU: 0 PID: 177 Comm: kworker/u4:2 Not tainted 4.19.0-rc6-xfstests-00022-g895dc4ba4f98 #730
> [   51.125809] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [   51.135375] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [   51.141433] RIP: 0010:jbd2_journal_dirty_metadata.cold.7 (??:?) 
> [ 51.147867] Code: c1 ea 0c 66 c1 ee 04 0f b7 d2 40 0f b6 f6 e8 c4 64 de ff 4d 39 6c 24 28 0f 85 22 f4 ff ff 41 83 7c 24 0c 01 0f 84 16 f4 ff ff <0f> 0b 4d 85 c9 74 2c 41 8b 41 08 44 8b 41 08 49 8d b7 c8 05 00 00
> All code
> ========
>    0:	c1 ea 0c             	shr    $0xc,%edx
>    3:	66 c1 ee 04          	shr    $0x4,%si
>    7:	0f b7 d2             	movzwl %dx,%edx
>    a:	40 0f b6 f6          	movzbl %sil,%esi
>    e:	e8 c4 64 de ff       	callq  0xffffffffffde64d7
>   13:	4d 39 6c 24 28       	cmp    %r13,0x28(%r12)
>   18:	0f 85 22 f4 ff ff    	jne    0xfffffffffffff440
>   1e:	41 83 7c 24 0c 01    	cmpl   $0x1,0xc(%r12)
>   24:	0f 84 16 f4 ff ff    	je     0xfffffffffffff440
>   2a:*	0f 0b                	ud2    		<-- trapping instruction
>   2c:	4d 85 c9             	test   %r9,%r9
>   2f:	74 2c                	je     0x5d
>   31:	41 8b 41 08          	mov    0x8(%r9),%eax
>   35:	44 8b 41 08          	mov    0x8(%rcx),%r8d
>   39:	49 8d b7 c8 05 00 00 	lea    0x5c8(%r15),%rsi
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2    
>    2:	4d 85 c9             	test   %r9,%r9
>    5:	74 2c                	je     0x33
>    7:	41 8b 41 08          	mov    0x8(%r9),%eax
>    b:	44 8b 41 08          	mov    0x8(%rcx),%r8d
>    f:	49 8d b7 c8 05 00 00 	lea    0x5c8(%r15),%rsi
> [   51.166848] RSP: 0018:ffffb08481573c60 EFLAGS: 00010202
> [   51.172195] RAX: 0000000000000049 RBX: ffff8e72c98e6c98 RCX: 0000000000000000
> [   51.179737] RDX: 0000000000000000 RSI: ffff8e72dfbd54c8 RDI: ffff8e72dfbd54c8
> [   51.187020] RBP: ffff8e72d3247600 R08: 0000001969fa2eee R09: 0000000000000000
> [   51.194639] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e72c21b22a0
> [   51.201985] R13: ffff8e72c224ce00 R14: 0000000000000000 R15: ffff8e72d2dc9000
> [   51.209795] FS:  0000000000000000(0000) GS:ffff8e72dfa00000(0000) knlGS:0000000000000000
> [   51.218109] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.224006] CR2: 00007f67fc9bc420 CR3: 0000000174612001 CR4: 00000000001606f0
> [   51.232026] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.240026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   51.247810] Call Trace:
> [   51.250394] __ext4_handle_dirty_metadata (include/linux/jbd2.h:1501 fs/ext4/ext4_jbd2.c:271) 
> [   51.255691] ext4_do_update_inode.isra.11 (fs/ext4/inode.c:5324) 
> [   51.261187] ? __ext4_journal_get_write_access (fs/ext4/ext4_jbd2.c:170) 
> [   51.266943] ? ext4_convert_unwritten_extents (fs/ext4/extents.c:4951) 
> [   51.272984] ext4_mark_iloc_dirty (fs/ext4/inode.c:5849) 
> [   51.277389] ext4_mark_inode_dirty (fs/ext4/inode.c:6028) 
> [   51.282260] ext4_convert_unwritten_extents (fs/ext4/extents.c:4951) 
> [   51.287900] ext4_end_io_rsv_work (fs/ext4/page-io.c:161 fs/ext4/page-io.c:234 fs/ext4/page-io.c:248) 
> [   51.292523] process_one_work (arch/x86/include/asm/jump_label.h:36 include/linux/jump_label.h:142 include/trace/events/workqueue.h:114 kernel/workqueue.c:2158) 
> [   51.296937] worker_thread (include/linux/compiler.h:188 include/linux/list.h:203 kernel/workqueue.c:2297) 
> [   51.301379] ? process_one_work (kernel/workqueue.c:2239) 
> [   51.305700] kthread (kernel/kthread.c:246) 
> [   51.309048] ? kthread_park (kernel/kthread.c:206) 
> [   51.312836] ret_from_fork (arch/x86/entry/entry_64.S:419) 
> [   51.316531] Modules linked in: sg
> [   51.320170] ---[ end trace 14f78710c7a5d90b ]---
> [   51.325121] RIP: 0010:jbd2_journal_dirty_metadata.cold.7 (??:?) 
> [ 51.331516] Code: c1 ea 0c 66 c1 ee 04 0f b7 d2 40 0f b6 f6 e8 c4 64 de ff 4d 39 6c 24 28 0f 85 22 f4 ff ff 41 83 7c 24 0c 01 0f 84 16 f4 ff ff <0f> 0b 4d 85 c9 74 2c 41 8b 41 08 44 8b 41 08 49 8d b7 c8 05 00 00
> All code
> ========
>    0:	c1 ea 0c             	shr    $0xc,%edx
>    3:	66 c1 ee 04          	shr    $0x4,%si
>    7:	0f b7 d2             	movzwl %dx,%edx
>    a:	40 0f b6 f6          	movzbl %sil,%esi
>    e:	e8 c4 64 de ff       	callq  0xffffffffffde64d7
>   13:	4d 39 6c 24 28       	cmp    %r13,0x28(%r12)
>   18:	0f 85 22 f4 ff ff    	jne    0xfffffffffffff440
>   1e:	41 83 7c 24 0c 01    	cmpl   $0x1,0xc(%r12)
>   24:	0f 84 16 f4 ff ff    	je     0xfffffffffffff440
>   2a:*	0f 0b                	ud2    		<-- trapping instruction
>   2c:	4d 85 c9             	test   %r9,%r9
>   2f:	74 2c                	je     0x5d
>   31:	41 8b 41 08          	mov    0x8(%r9),%eax
>   35:	44 8b 41 08          	mov    0x8(%rcx),%r8d
>   39:	49 8d b7 c8 05 00 00 	lea    0x5c8(%r15),%rsi
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2    
>    2:	4d 85 c9             	test   %r9,%r9
>    5:	74 2c                	je     0x33
>    7:	41 8b 41 08          	mov    0x8(%r9),%eax
>    b:	44 8b 41 08          	mov    0x8(%rcx),%r8d
>    f:	49 8d b7 c8 05 00 00 	lea    0x5c8(%r15),%rsi
> [   51.350792] RSP: 0018:ffffb08481573c60 EFLAGS: 00010202
> [   51.356407] RAX: 0000000000000049 RBX: ffff8e72c98e6c98 RCX: 0000000000000000
> [   51.363778] RDX: 0000000000000000 RSI: ffff8e72dfbd54c8 RDI: ffff8e72dfbd54c8
> [   51.371485] RBP: ffff8e72d3247600 R08: 0000001969fa2eee R09: 0000000000000000
> [   51.378856] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e72c21b22a0
> [   51.386323] R13: ffff8e72c224ce00 R14: 0000000000000000 R15: ffff8e72d2dc9000
> [   51.393775] FS:  0000000000000000(0000) GS:ffff8e72dfa00000(0000) knlGS:0000000000000000
> [   51.402416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.408605] CR2: 00007f67fc9bc420 CR3: 0000000174612001 CR4: 00000000001606f0
> [   51.416057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   51.423596] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   51.431668] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:34
> [   51.441028] in_atomic(): 1, irqs_disabled(): 0, pid: 177, name: kworker/u4:2
> [   51.448399] INFO: lockdep is turned off.
> [   51.452644] CPU: 0 PID: 177 Comm: kworker/u4:2 Tainted: G      D           4.19.0-rc6-xfstests-00022-g895dc4ba4f98 #730
> [   51.464004] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [   51.473697] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [   51.479745] Call Trace:
> [   51.482499] dump_stack (lib/dump_stack.c:115) 
> [   51.486326] ___might_sleep.cold.12 (kernel/sched/core.c:6145) 
> [   51.491043] exit_signals (include/linux/percpu-rwsem.h:36 include/linux/percpu-rwsem.h:59 include/linux/cgroup-defs.h:691 kernel/signal.c:2598) 
> [   51.494749] do_exit (kernel/exit.c:818) 
> [   51.498387] ? process_one_work (kernel/workqueue.c:2239) 
> [   51.502698] ? kthread (kernel/kthread.c:246) 
> [   51.506218] rewind_stack_do_exit (??:?) 
> [   51.510639] note: kworker/u4:2[177] exited with preempt_count 1
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR