linux-kernel - RE: [PATCH] workqueue: fix a workqueue kernel panic issue.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4737A960563B524DA805CA602BE04B307EEEA98F6B@SC-VEXCH2.marvell.com>
Date:	Mon, 22 Sep 2014 03:40:55 -0700
From:	Yifan Zhang <zhangyf@...vell.com>
To:	Tejun Heo <tj@...nel.org>
CC:	Jing Xiang <jxiang@...vell.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"yifan.zhangm@...il.com" <yifan.zhangm@...il.com>
Subject: RE: [PATCH] workqueue: fix a workqueue kernel panic issue.

Hi Tejun,

This issue is found when we got a kernel panic:

[  943.673177] c3 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 10ms
[  943.717242] c0 7843 (Thread-162) pwm-vibrator pwm-vibrator: Vibrator: Set duration: 30ms
[  949.571415] c1 779 (Binder_3) B52ISP version: HW 3.36, SWM 1.4, revision 871
[  949.579291] c1 779 (Binder_3) b52_sensor_set_fmt: mbus code not match
[  949.749224] c0 779 (Binder_3) b52_sensor_set_fmt: mbus code not match
[  949.803474] c0 7844 (kworker/0:2) Unable to handle kernel NULL pointer dereference at virtual address 00000008
[  949.813433] c0 7844 (kworker/0:2) pgd = ffffffc00007d000

[  949.818707] c0 7844 (kworker/0:2) [00000008] *pgd=0000000001214003, *pmd=0000000001215003, *pte=00e00000d4200407
[  949.829002] c0 7844 (kworker/0:2) Internal error: Oops: 96000007 [#1] PREEMPT SMP
[  949.836425] Modules linked in: tzdd galcore ssipcmisck(P) audiostub cidatattydev gs_modem ccinetdev cci_datastub citty iml_module seh cploaddev msocketk
[  949.850077] c0 7844 (kworker/0:2) CPU: 0 PID: 7844 Comm: kworker/0:2 Tainted: P             3.10.33 #1
[  949.859318] c0 7844 (kworker/0:2) task: ffffffc015517500 ti: ffffffc027c48000 task.ti: ffffffc027c48000
[  949.868641] c0 7844 (kworker/0:2) PC is at process_one_work+0x38/0x410
[  949.875110] c0 7844 (kworker/0:2) LR is at worker_thread+0x13c/0x3c0
[  949.881407] c0 7844 (kworker/0:2) pc : [<ffffffc0000bbbd4>] lr : [<ffffffc0000bcab8>] pstate: 600001c5
[  949.890634] c0 7844 (kworker/0:2) sp : ffffffc027c4bd60
[  949.895810] R29: ffffffc027c4bd60 R28: 0000000000000009 
[  949.901091] R27: ffffffc0008da108 R26: 0000000000000001 
[  949.906372] R25: ffffffc000aa2902 R24: 0000000000000000 
[  949.911653] R23: ffffffc027c48000 R22: ffffffc02b3b11b0 
[  949.916934] R21: ffffffc034da09c0 R20: ffffffc02b3b1180 
[  949.922215] R19: ffffffc000c05380 R18: 0000000000000000 
[  949.927497] R17: 0000000000000000 R16: ffffffc0001909a4 
[  949.932777] R15: 0000000000000000 R14: 000000000000ca81 
[  949.938059] R13: 00000000f6f99a80 R12: 00000000000003d1 
[  949.943341] R11: 0000000000000004 R10: ffffffc000bef3a8 
[  949.948622] R9 : ffffffc027c4bbe0 R8 : ffffffc015517920 
[  949.953903] R7 : ffffffc000724d98 R6 : ffffffc000724d98 
[  949.959183] R5 : 0000000000000000 R4 : 0000000000000000 
[  949.964465] R3 : ffffffc034da0d40 R2 : ffffffc034da09c0 
[  949.969746] R1 : ffffffc000c05380 R0 : 0000000000000000 
[  949.975019] c0 7844 (kworker/0:2) 
[  949.978390] c0 7844 (kworker/0:2) 
[  949.978390] PC: ffffffc0000bbb54:

You can tell it is a bug when pwq = get_work_pwq() return NULL, and cpu_intensive = pwq->wq->flags use it w/o check. 

Normally get_work_pwq doesn't return NULL, but we had a bug in code which makes INIT_WORK(&work, do_work) is called in multi-thread. In some cases, work_struct is re-init just before get_work_pwq is called, it makes work_struct->data is invalid and thus causes the problem. It is indeed a bug of ourselves, and after fix it there is no such issue. But I wonder we still a NULL check before dereference pwq here anyway, since get_work_pwq may return NULL in some cases.

What do you think :-) ?

BR,
YIfan


-----Original Message-----
From: Tejun Heo [mailto:htejun@...il.com] On Behalf Of Tejun Heo
Sent: 2014年9月22日 11:40
To: Yifan Zhang
Cc: Jing Xiang; linux-kernel@...r.kernel.org
Subject: Re: [PATCH] workqueue: fix a workqueue kernel panic issue.

On Sun, Sep 21, 2014 at 08:30:32PM -0700, Yifan Zhang wrote:
> Hi Tejun,
> 
> What's do you think of this patch ? Any concern ?

Hmmm?  Haven't I already responded to this patch?

> -----Original Message-----
> From: Yifan Zhang [mailto:zhangyf@...vell.com]
> Sent: 2014年9月17日 16:18
> To: Tejun Heo; Jing Xiang; linux-kernel@...r.kernel.org
> Cc: Yifan Zhang
> Subject: [PATCH] workqueue: fix a workqueue kernel panic issue.
> 
> if created workqueue in multi-thread unsynchronized,

Can you please elaborate?

> get_work_pwq() may return NULL, which cause kernel panic. Judge 
> get_work_pwq() return value before use
> pwq->wq->flags.
> 
> Signed-off-by: Yifan Zhang <zhangyf@...vell.com>
> ---
>  kernel/workqueue.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 
> 5dbe22a..d3ac87f 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1947,9 +1947,19 @@ __acquires(&pool->lock)  {
>  	struct pool_workqueue *pwq = get_work_pwq(work);
>  	struct worker_pool *pool = worker->pool;
> -	bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
> +	bool cpu_intensive;
>  	int work_color;
>  	struct worker *collision;
> +
> +	if (pwq == NULL) {
> +		pr_err("BUG: invalid struct work_struct.data %lu\n",
> +				atomic_long_read(&work->data));
> +		dump_stack();
> +		return;

I have difficult time seeing how the above piece of code would be acceptable but maybe the situation you're trying to explain is weird enough to justify it.

Thanks.

--
tejun