linux-kernel - [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1444318308-27560-1-git-send-email-kernel@kyup.com>
Date:	Thu,  8 Oct 2015 18:31:47 +0300
From:	Nikolay Borisov <kernel@...p.com>
To:	tytso@....edu, adilger.kernel@...ger.ca, viro@...iv.linux.org.uk,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Cc:	operations@...eground.com, mm@...com
Subject: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:

PID: 31111  TASK: ffff881cbb2fb870  CPU: 2   COMMAND: "kworker/u96:0"
 #0 [ffff881fffa46dc0] crash_nmi_callback at ffffffff8106f24e
 #1 [ffff881fffa46de0] nmi_handle at ffffffff8104c152
 #2 [ffff881fffa46e70] do_nmi at ffffffff8104c3b4
 #3 [ffff881fffa46ef0] end_repeat_nmi at ffffffff81656e2e
    [exception RIP: smp_call_function_many+577]
    RIP: ffffffff810e7f81  RSP: ffff880d35b815c8  RFLAGS: 00000202
    RAX: 0000000000000017  RBX: ffffffff81142690  RCX: 0000000000000017
    RDX: ffff883fff375478  RSI: 0000000000000040  RDI: 0000000000000040
    RBP: ffff880d35b81628   R8: ffff881fffa51ec8   R9: 0000000000000000
    R10: 0000000000000000  R11: ffffffff812943f3  R12: 0000000000000000
    R13: ffff881fffa51ec0  R14: ffff881fffa51ec8  R15: 0000000000011f00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff880d35b815c8] smp_call_function_many at ffffffff810e7f81
 #5 [ffff880d35b81630] on_each_cpu_mask at ffffffff810e801c
 #6 [ffff880d35b81660] drain_all_pages at ffffffff81140178
 #7 [ffff880d35b81690] __alloc_pages_nodemask at ffffffff8114310b
 #8 [ffff880d35b81810] alloc_pages_current at ffffffff81181c5e
 #9 [ffff880d35b81860] new_slab at ffffffff81188305

However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:

PID: 34220  TASK: ffff883937660810  CPU: 44  COMMAND: "kworker/u98:39"
 #0 [ffff88209d5b10b8] __schedule at ffffffff81653d5a
 #1 [ffff88209d5b1150] schedule at ffffffff816542f9
 #2 [ffff88209d5b1160] schedule_preempt_disabled at ffffffff81654686
 #3 [ffff88209d5b1180] __mutex_lock_slowpath at ffffffff816521eb
 #4 [ffff88209d5b1200] mutex_lock at ffffffff816522d1
 #5 [ffff88209d5b1220] new_read at ffffffffa0152a7e [dm_bufio]
 #6 [ffff88209d5b1280] dm_bufio_get at ffffffffa0152ba6 [dm_bufio]
 #7 [ffff88209d5b1290] dm_bm_read_try_lock at ffffffffa015c878 [dm_persistent_data]
 #8 [ffff88209d5b12e0] dm_tm_read_lock at ffffffffa015f7ad [dm_persistent_data]
 #9 [ffff88209d5b12f0] bn_read_lock at ffffffffa016281b [dm_persistent_data]

And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core.  For completeness here is how the call stack for hung ext4_bio_finish
would look like:

[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: G        W    3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: ffff8825aa819830 ti: ffff882b19180000 task.ti: ffff882b19180000
[427160.405290] RIP: 0010:[<ffffffff8125be13>]  [<ffffffff8125be13>] ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: 0000:ffff883fff3639b0  EFLAGS: 00000002
[427160.405292] RAX: ffff882b19180000 RBX: ffff883f67480a80 RCX: 0000000000000110
[427160.405292] RDX: ffff882b19180000 RSI: 0000000000000000 RDI: ffff883f67480a80
[427160.405293] RBP: ffff883fff363a70 R08: 0000000000014b80 R09: ffff881fff454f00
[427160.405294] R10: ffffea00473214c0 R11: ffffffff8113bfd7 R12: ffff880826272138
[427160.405295] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea00aeaea400
[427160.405296] FS:  0000000000000000(0000) GS:ffff883fff360000(0000) knlGS:0000000000000000
[427160.405296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[427160.405297] CR2: 0000003c5b009c24 CR3: 0000000001c0b000 CR4: 00000000001407e0
[427160.405297] Stack:
[427160.405305]  0000000000000000 ffffffff8203f230 ffff883fff363a00 ffff882b19180000
[427160.405312]  ffff882b19180000 ffff882b19180000 00000400018e0af8 ffff882b19180000
[427160.405319]  ffff883f67480a80 0000000000000000 0000000000000202 00000000d219e720
[427160.405320] Call Trace:
[427160.405324]  <IRQ>
[427160.405327]  [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
[427160.405335]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405341]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405345]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405350]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405353]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405358]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405362]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405365]  [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405369]  [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405373]  [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405380]  [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
[427160.405385]  [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
[427160.405389]  [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
[427160.405392]  [<ffffffff812fcd20>] blk_end_request+0x10/0x20
[427160.405400]  [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
[427160.405404]  [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
[427160.405408]  [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
[427160.405413]  [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
[427160.405418]  [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
[427160.405422]  [<ffffffff81658a0c>] call_softirq+0x1c/0x30
[427160.405427]  [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
[427160.405428]  [<ffffffff8108e925>] irq_exit+0x95/0xa0
[427160.405431]  [<ffffffff8106f755>] smp_call_function_single_interrupt+0x35/0x40
[427160.405434]  [<ffffffff8165826f>] call_function_single_interrupt+0x6f/0x80
[427160.405436]  <EOI>
[427160.405438]  [<ffffffff813276e6>] ? memcpy+0x6/0x110
[427160.405440]  [<ffffffff811dc6d6>] ? __bio_clone+0x26/0x70
[427160.405442]  [<ffffffff81546db9>] __clone_and_map_data_bio+0x139/0x160
[427160.405445]  [<ffffffff815471cd>] __split_and_process_bio+0x3ed/0x490
[427160.405447]  [<ffffffff815473a6>] dm_request+0x136/0x1e0
[427160.405449]  [<ffffffff812fbe0a>] generic_make_request+0xca/0x100
[427160.405451]  [<ffffffff812fbeb9>] submit_bio+0x79/0x160
[427160.405453]  [<ffffffff81144c3d>] ? account_page_writeback+0x2d/0x40
[427160.405455]  [<ffffffff81144dbd>] ? __test_set_page_writeback+0x16d/0x1f0
[427160.405457]  [<ffffffff8125b7a9>] ext4_io_submit+0x29/0x50
[427160.405459]  [<ffffffff8125b8fb>] ext4_bio_write_page+0x12b/0x2f0
[427160.405461]  [<ffffffff81252fe8>] mpage_submit_page+0x68/0x90
[427160.405463]  [<ffffffff81253100>] mpage_process_page_bufs+0xf0/0x110
[427160.405465]  [<ffffffff81254a80>] mpage_prepare_extent_to_map+0x210/0x310
[427160.405468]  [<ffffffff8125a911>] ? ext4_writepages+0x361/0xc60
[427160.405472]  [<ffffffff81283c09>] ? __ext4_journal_start_sb+0x79/0x110
[427160.405474]  [<ffffffff8125a948>] ext4_writepages+0x398/0xc60
[427160.405477]  [<ffffffff812fd358>] ? blk_finish_plug+0x18/0x50
[427160.405479]  [<ffffffff81146b40>] do_writepages+0x20/0x40
[427160.405483]  [<ffffffff811cec79>] __writeback_single_inode+0x49/0x2b0
[427160.405487]  [<ffffffff810aeeef>] ? wake_up_bit+0x2f/0x40
[427160.405488]  [<ffffffff811cfdee>] writeback_sb_inodes+0x2de/0x540
[427160.405492]  [<ffffffff811a6e65>] ? put_super+0x25/0x50
[427160.405494]  [<ffffffff811d00ee>] __writeback_inodes_wb+0x9e/0xd0
[427160.405495]  [<ffffffff811d035b>] wb_writeback+0x23b/0x340
[427160.405497]  [<ffffffff811d04f9>] wb_do_writeback+0x99/0x230
[427160.405500]  [<ffffffff810a40f1>] ? set_worker_desc+0x81/0x90
[427160.405503]  [<ffffffff810c7a6a>] ? dequeue_task_fair+0x36a/0x4c0
[427160.405505]  [<ffffffff811d0bf8>] bdi_writeback_workfn+0x88/0x260
[427160.405509]  [<ffffffff810bbb3e>] ? finish_task_switch+0x4e/0xe0
[427160.405511]  [<ffffffff81653dac>] ? __schedule+0x2dc/0x760
[427160.405514]  [<ffffffff810a61e5>] process_one_work+0x195/0x550
[427160.405517]  [<ffffffff810a848a>] worker_thread+0x13a/0x430
[427160.405519]  [<ffffffff810a8350>] ? manage_workers+0x2c0/0x2c0
[427160.405521]  [<ffffffff810ae48e>] kthread+0xce/0xe0
[427160.405523]  [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80
[427160.405525]  [<ffffffff816571c8>] ret_from_fork+0x58/0x90
[427160.405527]  [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80

To fix the situation this patch changes the order in which the
bit_spin_lock and interrupts disabling occcurs. The exepected
effect is that even if a core is spinning on the bitlock it will
have its interrupts enabled, thus being able to respond to IPIs.
This eventually would allow memory allocation requiring draining of
the per cpu pages to succeed.

Signed-off-by: Nikolay Borisov <kernel@...p.com>
---
 fs/ext4/page-io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 84ba4d2..095331b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -96,8 +96,8 @@ static void ext4_finish_bio(struct bio *bio)
 		 * We check all buffers in the page under BH_Uptodate_Lock
 		 * to avoid races with other end io clearing async_write flags
 		 */
-		local_irq_save(flags);
 		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+		local_irq_save(flags);
 		do {
 			if (bh_offset(bh) < bio_start ||
 			    bh_offset(bh) + bh->b_size > bio_end) {
@@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
 			if (bio->bi_error)
 				buffer_io_error(bh);
 		} while ((bh = bh->b_this_page) != head);
-		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
 		local_irq_restore(flags);
+		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
 		if (!under_io) {
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 			if (ctx)
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/