[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151214153147.GA14957@redhat.com>
Date: Mon, 14 Dec 2015 10:31:47 -0500
From: Mike Snitzer <snitzer@...hat.com>
To: Nikolay Borisov <kernel@...p.com>
Cc: Tejun Heo <tj@...nel.org>,
"Linux-Kernel@...r. Kernel. Org" <linux-kernel@...r.kernel.org>,
SiteGround Operations <operations@...eground.com>,
Alasdair Kergon <agk@...hat.com>, dm-devel@...hat.com
Subject: Re: corruption causing crash in __queue_work
On Mon, Dec 14 2015 at 3:41P -0500,
Nikolay Borisov <kernel@...p.com> wrote:
> Had another poke at the backtrace that is produced and here what the
> delayed_work looks like:
>
> crash> struct delayed_work ffff88036772c8c0
> struct delayed_work {
> work = {
> data = {
> counter = 1537
> },
> entry = {
> next = 0xffff88036772c8c8,
> prev = 0xffff88036772c8c8
> },
> func = 0xffffffffa0211a30 <do_waker>
> },
> timer = {
> entry = {
> next = 0x0,
> prev = 0xdead000000200200
> },
> expires = 4349463655,
> base = 0xffff88047fd2d602,
> function = 0xffffffff8106da40 <delayed_work_timer_fn>,
> data = 18446612146934696128,
> slack = -1,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> wq = 0xffff88030cf65400,
> cpu = 21
> }
>
> From this it seems that the timer is also cancelled/expired judging by
> the values in timer -> entry. But then again in dm-thin the pool is
> first suspended, which implies the following functions were called:
>
> cancel_delayed_work(&pool->waker);
> cancel_delayed_work(&pool->no_space_timeout);
> flush_workqueue(pool->wq);
>
> so at that point dm-thin's workqueue should be empty and it shouldn't be
> possible to queue any more delayed work. But the crashdump clearly shows
> that the opposite is happening. So far all of this points to a race
> condition and inserting some sleeps after umount and after vgchange -Kan
> (command to disable volume group and suspend, so the cancel_delayed_work
> is invoked) seems to reduce the frequency of crashes, though it doesn't
> eliminate them.
'vgchange -Kan' doesn't suspend the pool before it destroys the device.
So the cancel_delayed_work()s you referenced aren't applicable.
Can you try this patch?
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 63903a5..b201d887 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2750,8 +2750,11 @@ static void __pool_destroy(struct pool *pool)
dm_bio_prison_destroy(pool->prison);
dm_kcopyd_client_destroy(pool->copier);
- if (pool->wq)
+ if (pool->wq) {
+ cancel_delayed_work(&pool->waker);
+ cancel_delayed_work(&pool->no_space_timeout);
destroy_workqueue(pool->wq);
+ }
if (pool->next_mapping)
mempool_free(pool->next_mapping, pool->mapping_pool);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists