[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31b9c76b-6f70-0af4-d854-e02bda25e4c0@kernel.dk>
Date: Mon, 14 Jan 2019 20:41:16 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Ming Lei <ming.lei@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
Clark Williams <williams@...hat.com>,
Bart Van Assche <bvanassche@....org>
Subject: Re: Real deadlock being suppressed in sbitmap
On 1/14/19 8:23 PM, Ming Lei wrote:
> Hi Steven,
>
> On Mon, Jan 14, 2019 at 12:14:14PM -0500, Steven Rostedt wrote:
>> It was brought to my attention (by this creating a splat in the RT tree
>> too) this code:
>>
>> static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
>> {
>> unsigned long mask, val;
>> unsigned long __maybe_unused flags;
>> bool ret = false;
>>
>> /* Silence bogus lockdep warning */
>> #if defined(CONFIG_LOCKDEP)
>> local_irq_save(flags);
>> #endif
>> spin_lock(&sb->map[index].swap_lock);
>>
>> Commit 58ab5e32e6f ("sbitmap: silence bogus lockdep IRQ warning")
>> states the following:
>>
>> For this case, it's a false positive. The swap_lock is used from process
>> context only, when we swap the bits in the word and cleared mask. We
>> also end up doing that when we are getting a driver tag, from the
>> blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
>> IRQs disabled. However, this isn't from an actual IRQ, it's still
>> process context.
>>
>> The thing is, lockdep doesn't define a lock as "irq-safe" based on it
>> being taken under interrupts disabled or not. It detects when locks are
>> used in actual interrupts. Further in that commit we have this:
>>
>> [ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>> [ 106.098231] 000000004c43fa71
>> (&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
>> [ 106.099431]
>> [ 106.099431] and this task is already holding:
>> [ 106.100229] 000000007eec8b2f
>> (&(&hctx->dispatch_wait_lock)->rlock){....}, at:
>> blk_mq_dispatch_rq_list+0x4c1/0xd7c
>> [ 106.101630] which would create a new lock dependency:
>> [ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){....} ->
>> (&(&sb->map[i].swap_lock)->rlock){+.+.}
>>
>> Saying that you are trying to take the swap_lock while holding the
>> dispatch_wait_lock.
>>
>>
>> [ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
>> [ 106.104580] (&sbq->ws[i].wait){..-.}
>>
>> Which means that there's already a chain of:
>>
>> sbq->ws[i].wait -> dispatch_wait_lock
>>
>> [ 106.104582]
>> [ 106.104582] ... which became SOFTIRQ-irq-safe at:
>> [ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82
>> [ 106.106284] __wake_up_common_lock+0x119/0x1b9
>> [ 106.106825] sbitmap_queue_wake_up+0x33f/0x383
>> [ 106.107456] sbitmap_queue_clear+0x4c/0x9a
>> [ 106.108046] __blk_mq_free_request+0x188/0x1d3
>> [ 106.108581] blk_mq_free_request+0x23b/0x26b
>> [ 106.109102] scsi_end_request+0x345/0x5d7
>> [ 106.109587] scsi_io_completion+0x4b5/0x8f0
>> [ 106.110099] scsi_finish_command+0x412/0x456
>> [ 106.110615] scsi_softirq_done+0x23f/0x29b
>> [ 106.111115] blk_done_softirq+0x2a7/0x2e6
>> [ 106.111608] __do_softirq+0x360/0x6ad
>> [ 106.112062] run_ksoftirqd+0x2f/0x5b
>> [ 106.112499] smpboot_thread_fn+0x3a5/0x3db
>> [ 106.113000] kthread+0x1d4/0x1e4
>> [ 106.113457] ret_from_fork+0x3a/0x50
>>
>>
>> We see that sbq->ws[i].wait was taken from a softirq context.
>
> Actually sbq->ws[i].wait is taken from a softirq context only in case
> of single-queue, see __blk_mq_complete_request(). For multiple queue,
> sbq->ws[i].wait is taken from hardirq context.
That's a good point, but that's just current implementation, we can't
assume any of those relationsships. Any completion can happen from
softirq or hardirq. So the patch is inadequate.
> Sounds the correct fix may be the following one, and the irqsave cost
> should be fine given sbitmap_deferred_clear is only triggered when one
> word is run out of.
Yes, the _bh() variant isn't going to cut it. Can you send this patch
against Linus's master?
--
Jens Axboe
Powered by blists - more mailing lists