linux-kernel - Re: corruption causing crash in __queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJFSNy6XgnQJrZCJd-+V9j510P+AWF76btQq3wEg4-VpjAZjFw@mail.gmail.com>
Date:	Fri, 11 Dec 2015 20:00:29 +0200
From:	Nikolay Borisov <n.borisov@...eground.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Nikolay Borisov <kernel@...p.com>,
	"Linux-Kernel@...r. Kernel. Org" <linux-kernel@...r.kernel.org>,
	SiteGround Operations <operations@...eground.com>,
	Alasdair Kergon <agk@...hat.com>,
	Mike Snitzer <snitzer@...hat.com>,
	device-mapper development <dm-devel@...hat.com>
Subject: Re: corruption causing crash in __queue_work

On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo <tj@...nel.org> wrote:
> Hello, Nikolay.
>
> On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote:
>> So I had a server with the patch just crash on me:
>>
>> Here is how the queue looks like:
>>  crash> struct workqueue_struct 0xffff8802420a4a00
>> struct workqueue_struct {
>>   pwqs = {
>>     next = 0xffff8802420a4c00,
>>     prev = 0xffff8802420a4a00
>
> Hmmm... pwq list is already corrupt.  ->prev is terminated but ->next
> isn't.
>
>>   },
>>   list = {
>>     next = 0xffff880351f9b210,
>>     prev = 0xdead000000200200
>
> Followed by by 0xdead000000200200 which is likely from
> CONFIG_ILLEGAL_POINTER_VALUE.
>
> ...
>>   name =
>> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>   rcu = {
>>     next = 0xffff8802531c4c20,
>>     func = 0xffffffff810692e0 <rcu_free_wq>
>
> and call_rcu_sched() already called.  The workqueue has already been
> destroyed.
>
>>   },
>>   flags = 131082,
>>   cpu_pwqs = 0x0,
>>   numa_pwq_tbl = 0xffff8802420a4b10
>> }
>>
>> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
>> '2' argument)
>> ffff8802420a4b10:  0000000000000000 0000000000000000   ................
>>
>> At the same time searching for 0xffff8802420a4a00 in the debug output
>> shows nothing IOW it seems that the numa_pwq_tbl is never installed for
>> this workqueue apparently:
>>
>> [root@...llvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
>>
>> Also dumping all the logs from the dmesg contained in the vmcore image I
>> find nothing and when I do the following correlation:
>> [root@...llvault8 ~]# grep \(null\) wq.log | wc -l
>> 1940
>> [root@...llvault8 ~]# wc -l wq.log
>> 1940 wq.log
>>
>> It seems what's happening is really just changing the numa_pwq_tbl on
>> workqueue creation i.e. it is never re-assigned. So at this point I
>> think it seems that there is a situation where the wqattr are not being
>> applied at all.
>
> Hmmm... No idea why it didn't show up in the debug log but the only
> way a workqueue could be in the above state is either it got
> explicitly destroyed or somehow pwq refcnting is messed up, in both
> cases it should have shown up in the log.
>
> cc'ing dm people.  Is there any chance dm-think could be using
> workqueue after destroying it?

In __pool_destroy in dm-thin.c I don't see a call to
cancel_delayed_work before destroying the workqueue. Is it possible
that this is the causeI


>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/