[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131125103522.6a1c1b36@notabene.brown>
Date: Mon, 25 Nov 2013 10:35:22 +1100
From: NeilBrown <neilb@...e.de>
To: fengguang.wu@...el.com
Cc: "Li, Shaohua" <shaohua.li@...el.com>,
LKML <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: kernel BUG at drivers/md/raid5.c:693!
On Sat, 23 Nov 2013 11:22:05 +0800 fengguang.wu@...el.com wrote:
> Shaohua,
>
> FYI, we are still seeing this bug.. dmesg attached.
Thanks for the report. However the dmesg you attached doesn't mention:
kernel BUG at drivers/md/raid5.c:693!
at all. It is quite different.
The "BUG" it reports is that a spinlock is blocking for too long - over
99msecs.
As far as I can tell, this is a scheduling issue, possibly specific to qemu,
rather than a bug in raid5.
md/raid5 is calling md_wakeup_thread while holding a spinlock, and that seems
to get lost in the scheduler for long enough that a different thread which
also wants the spinlock starts to complain.
It has always been safe to call 'wake_up' from under a spinlock before.
Ingo/Peter: is it considered OK to call wake_up while holding a spinlock?
Could "sleeping spinlocks" affect this at all? (some sample stack traces are
below).
Thanks,
NeilBrown
>
> 566c09c53455d7c4f1130928ef8071da1a24ea65 is the first bad commit
> commit 566c09c53455d7c4f1130928ef8071da1a24ea65
> Author: Shaohua Li <shli@...nel.org>
> Date: Thu Nov 14 15:16:17 2013 +1100
>
> raid5: relieve lock contention in get_active_stripe()
>
> get_active_stripe() is the last place we have lock contention. It has two
> paths. One is stripe isn't found and new stripe is allocated, the other is
> stripe is found.
>
> The first path basically calls __find_stripe and init_stripe. It accesses
> conf->generation, conf->previous_raid_disks, conf->raid_disks,
> conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
> conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
> stripe_hashtbl and inactive_list, other fields are changed very rarely.
>
> With this patch, we split inactive_list and add new hash locks. Each free
> stripe belongs to a specific inactive list. Which inactive list is determined
> by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
> lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
> is determined by it's lock_hash too. The lock_hash is derivied from current
> stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
> to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
> list too. The goal of the new hash locks introduced is we can only use the new
> locks in the first path of get_active_stripe(). Since we have several hash
> locks, lock contention is relieved significantly.
>
> The first path of get_active_stripe() accesses other fields, since they are
> changed rarely, changing them now need take conf->device_lock and all hash
> locks. For a slow path, this isn't a problem.
>
> If we need lock device_lock and hash lock, we always lock hash lock first. The
> tricky part is release_stripe and friends. We need take device_lock first.
> Neil's suggestion is we put inactive stripes to a temporary list and readd it
> to inactive_list after device_lock is released. In this way, we add stripes to
> temporary list with device_lock hold and remove stripes from the list with hash
> lock hold. So we don't allow concurrent access to the temporary list, which
> means we need allocate temporary list for all participants of release_stripe.
>
> One downside is free stripes are maintained in their inactive list, they can't
> across between the lists. By default, we have total 256 stripes and 8 lists, so
> each list will have 32 stripes. It's possible one list has free stripe but
> other list hasn't. The chance should be rare because stripes allocation are
> even distributed. And we can always allocate more stripes for cache, several
> mega bytes memory isn't a big deal.
>
> This completely removes the lock contention of the first path of
> get_active_stripe(). It slows down the second code path a little bit though
> because we now need takes two locks, but since the hash lock isn't contended,
> the overhead should be quite small (several atomic instructions). The second
> path of get_active_stripe() (basically sequential write or big request size
> randwrite) still has lock contentions.
>
> Signed-off-by: Shaohua Li <shli@...ionio.com>
> Signed-off-by: NeilBrown <neilb@...e.de>
>
> :040000 040000 88fa28d1decc5454cf4d58421fa3eb12bc9ad524 5d1f104188f72b17d10cd569bf2924dab5d789cb M drivers
> bisect run success
>
> # bad: [02ffe4cc90dce5a1bbee5daae98a40a431c29c6d] Merge 'yuanhan/slub-experimental' into devel-hourly-2013112217
> # good: [5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52] Linux 3.12
> git bisect start '02ffe4cc90dce5a1bbee5daae98a40a431c29c6d' '5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52' '--'
> # good: [5cbb3d216e2041700231bcfc383ee5f8b7fc8b74] Merge branch 'akpm' (patches from Andrew Morton)
> git bisect good 5cbb3d216e2041700231bcfc383ee5f8b7fc8b74
> # good: [d5ceede8dc86278d16dcad8f916ef323b5672bd8] drivers/rtc/rtc-hid-sensor-time.c: use dev_get_platdata()
> git bisect good d5ceede8dc86278d16dcad8f916ef323b5672bd8
> # good: [ffd3c0260aeeb1fd4d36378d2e06e6410661dd0f] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
> git bisect good ffd3c0260aeeb1fd4d36378d2e06e6410661dd0f
> # good: [1ee2dcc2245340cf4ac94b99c4d00efbeba61824] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> git bisect good 1ee2dcc2245340cf4ac94b99c4d00efbeba61824
> # bad: [a50f09550a7dd58b4514630fec1676855084ec1f] Merge 'vireshk/cpufreq-next' into devel-hourly-2013112217
> git bisect bad a50f09550a7dd58b4514630fec1676855084ec1f
> # bad: [af2e2f328052082f58f041d574ed50c7f21c598f] Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
> git bisect bad af2e2f328052082f58f041d574ed50c7f21c598f
> # good: [df12a3178d340319b1955be6b973a4eb84aff754] Merge commit 'dmaengine-3.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
> git bisect good df12a3178d340319b1955be6b973a4eb84aff754
> # good: [ed6a82546d2e8f6b5902269541733814d4adacc2] Merge branch 'acpi-hotplug'
> git bisect good ed6a82546d2e8f6b5902269541733814d4adacc2
> # bad: [6d6e352c80f22c446d933ca8103e02bac1f09129] Merge tag 'md/3.13' of git://neil.brown.name/md
> git bisect bad 6d6e352c80f22c446d933ca8103e02bac1f09129
> # bad: [30b8feb730f9b9b3c5de02580897da03f59b6b16] md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
> git bisect bad 30b8feb730f9b9b3c5de02580897da03f59b6b16
> # bad: [566c09c53455d7c4f1130928ef8071da1a24ea65] raid5: relieve lock contention in get_active_stripe()
> git bisect bad 566c09c53455d7c4f1130928ef8071da1a24ea65
> # good: [02e5f5c0a0f726e66e3d8506ea1691e344277969] md: fix calculation of stacking limits on level change.
> git bisect good 02e5f5c0a0f726e66e3d8506ea1691e344277969
> # good: [82e06c811163c4d853ed335d56c3378088bc89cc] wait: add wait_event_cmd()
> git bisect good 82e06c811163c4d853ed335d56c3378088bc89cc
> # first bad commit: [566c09c53455d7c4f1130928ef8071da1a24ea65] raid5: relieve lock contention in get_active_stripe()
This process waited too long for a spinlock:
[ 82.440547] BUG: spinlock lockup suspected on CPU#0, btrfs-submit-1/2938
[ 82.440547] lock: 0xffff88007f0aed20, .magic: dead4ead, .owner: md0_raid5/2906, .owner_cpu: 2
[ 82.440547] CPU: 0 PID: 2938 Comm: btrfs-submit-1 Not tainted 3.12.0-11088-gc710863 #1767
[ 82.440547] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 82.440547] 0000000000000001 ffff8803cd1e3a10 ffffffff81a415f8 ffff8804081e8000
[ 82.440547] ffff8803cd1e3a30 ffffffff81a3a102 ffff88007f0aed20 00000000a08d1ad0
[ 82.440547] ffff8803cd1e3a58 ffffffff811055d4 ffff88007f0aed38 ffff88007f0aed20
[ 82.440547] Call Trace:
[ 82.440547] [<ffffffff81a415f8>] dump_stack+0x4d/0x66
[ 82.440547] [<ffffffff81a3a102>] spin_dump+0x91/0x96
[ 82.440547] [<ffffffff811055d4>] do_raw_spin_lock+0xc9/0x10e
[ 82.440547] [<ffffffff81a4c7b3>] _raw_spin_lock+0x5b/0x6d
[ 82.440547] [<ffffffff8189569b>] ? get_active_stripe+0x605/0x6a9
[ 82.440547] [<ffffffff8189569b>] get_active_stripe+0x605/0x6a9
[ 82.440547] [<ffffffff810fc068>] ? prepare_to_wait+0x6f/0x77
[ 82.440547] [<ffffffff81898fd9>] make_request+0x69a/0xac5
[ 82.440547] [<ffffffff818b58d1>] ? md_make_request+0x18b/0x2ce
[ 82.440547] [<ffffffff810fc321>] ? __wake_up_sync+0x12/0x12
[ 82.440547] [<ffffffff818b58d1>] md_make_request+0x18b/0x2ce
[ 82.440547] [<ffffffff818b57ae>] ? md_make_request+0x68/0x2ce
[ 82.440547] [<ffffffff818906e5>] ? raid5_unplug+0x111/0x120
[ 82.440547] [<ffffffff814e68bf>] generic_make_request+0x9f/0xe0
[ 82.440547] [<ffffffff814e69ea>] submit_bio+0xea/0x12a
[ 82.440547] [<ffffffff8145a449>] run_scheduled_bios+0x26c/0x455
[ 82.440547] [<ffffffff8145a647>] pending_bios_fn+0x15/0x17
[ 82.440547] [<ffffffff81461925>] worker_loop+0x15c/0x4ee
[ 82.440547] [<ffffffff814617c9>] ? btrfs_queue_worker+0x280/0x280
[ 82.440547] [<ffffffff810e328d>] kthread+0xf9/0x101
[ 82.440547] [<ffffffff810f21ae>] ? local_clock+0x41/0x5a
[ 82.440547] [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[ 82.440547] [<ffffffff81a5487c>] ret_from_fork+0x7c/0xb0
[ 82.440547] [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
This process was holding it:
[ 78.658626] NMI backtrace for cpu 2
[ 78.658626] CPU: 2 PID: 2906 Comm: md0_raid5 Not tainted 3.12.0-11088-gc710863 #1767
[ 78.658626] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 78.658626] task: ffff8804081e8000 ti: ffff8803cd1e6000 task.ti: ffff8803cd1e6000
[ 78.658626] RIP: 0010:[<ffffffff81067501>] [<ffffffff81067501>] pvclock_clocksource_read+0x10/0xa7
[ 78.658626] RSP: 0018:ffff8803cd1e7b90 EFLAGS: 00000082
[ 78.658626] RAX: 0000000000000002 RBX: ffff88040ebd4ac0 RCX: ffff8803cfba57d8
[ 78.658626] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88041f7e9080
[ 78.658626] RBP: ffff8803cd1e7ba8 R08: 0000000000000000 R09: ffff8803cfba57d8
[ 78.658626] R10: 0000000000000008 R11: ffffffff810fbf46 R12: 0000000000000000
[ 78.658626] R13: 0000000000000002 R14: ffff8803cfba57a8 R15: 0000000000000046
[ 78.658626] FS: 0000000000000000(0000) GS:ffff88040ea00000(0000) knlGS:0000000000000000
[ 78.658626] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 78.658626] CR2: 00000000006f86e4 CR3: 000000007f17e000 CR4: 00000000000006e0
[ 78.658626] Stack:
[ 78.658626] ffff8803cfba57d8 ffff88040ebd4ac0 0000000000000000 ffff8803cd1e7bb8
[ 78.658626] ffffffff81066a05 ffff8803cd1e7bc8 ffffffff8103f9ce ffff8803cd1e7be8
[ 78.658626] ffffffff810f1f6d ffff88040ebd4ac0 00000000001d4ac0 ffff8803cd1e7c10
[ 78.658626] Call Trace:
[ 78.658626] [<ffffffff81066a05>] kvm_clock_read+0x27/0x31
[ 78.658626] [<ffffffff8103f9ce>] sched_clock+0x9/0xd
[ 78.658626] [<ffffffff810f1f6d>] sched_clock_local+0x12/0x72
[ 78.658626] [<ffffffff810f20f5>] sched_clock_cpu+0xc1/0xcc
[ 78.658626] [<ffffffff810f2197>] local_clock+0x2a/0x5a
[ 78.658626] [<ffffffff81101202>] lock_release_holdtime.part.7+0xf/0xd9
[ 78.658626] [<ffffffff810fbf68>] ? __wake_up+0x44/0x4c
[ 78.658626] [<ffffffff811037b0>] lock_release+0x185/0x1ed
[ 78.658626] [<ffffffff81a4cb3e>] _raw_spin_unlock_irqrestore+0x24/0x5e
[ 78.658626] [<ffffffff810fbf68>] __wake_up+0x44/0x4c
[ 78.658626] [<ffffffff818b86f8>] md_wakeup_thread+0x55/0x58
[ 78.658626] [<ffffffff8188e8eb>] raid5_wakeup_stripe_thread+0xab/0x169
[ 78.658626] [<ffffffff81897eba>] raid5d+0x1ac/0x4a8
[ 78.658626] [<ffffffff811012bf>] ? lock_release_holdtime.part.7+0xcc/0xd9
[ 78.658626] [<ffffffff818b8c9c>] md_thread+0x10f/0x125
[ 78.658626] [<ffffffff810fc321>] ? __wake_up_sync+0x12/0x12
[ 78.658626] [<ffffffff818b8b8d>] ? md_register_thread+0xd5/0xd5
[ 78.658626] [<ffffffff810e328d>] kthread+0xf9/0x101
[ 78.658626] [<ffffffff810f21ae>] ? local_clock+0x41/0x5a
[ 78.658626] [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[ 78.658626] [<ffffffff81a5487c>] ret_from_fork+0x7c/0xb0
[ 78.658626] [<ffffffff810e3194>] ? kthread_create_on_node+0x1a4/0x1a4
[ 78.658626] Code: 47 1d 66 66 90 0f ae e8 8b 17 f6 c2 01 75 e6 39 d1 75 e2 22 05 ad b8 2e 01 5d c3 55 48 89 e5 41 54 53 51 8b 37 66 66 90 0f ae e8 <0f> 31 48 c1 e2 20 89 c0 0f be 4f 1c 48 09 c2 48 2b 57 08 44 8b
[ 78.658626] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 99.903 msecs
Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)
Powered by blists - more mailing lists