linux-kernel - Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <56CF8B66.8070108@linaro.org>
Date:	Thu, 25 Feb 2016 15:16:54 -0800
From:	"Shi, Yang" <yang.shi@...aro.org>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	tj@...nel.org, jack@...e.cz, axboe@...com, fengguang.wu@...el.com,
	tglx@...utronix.de, bigeasy@...utronix.de,
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, linux-rt-users@...r.kernel.org,
	linaro-kernel@...ts.linaro.org
Subject: Re: [PATCH] writeback: call writeback tracepoints withoud holding
 list_lock in wb_writeback()

On 2/25/2016 11:54 AM, Steven Rostedt wrote:
> On Thu, 25 Feb 2016 11:38:48 -0800
> "Shi, Yang" <yang.shi@...aro.org> wrote:
>
>> On 2/24/2016 6:40 PM, Steven Rostedt wrote:
>>> On Wed, 24 Feb 2016 14:47:23 -0800
>>> Yang Shi <yang.shi@...aro.org> wrote:
>>>
>>>> commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
>>>> tracepoints to report cgroup") made writeback tracepoints report cgroup
>>>> writeback, but it may trigger the below bug on -rt kernel due to the list_lock
>>>> held for the for loop in wb_writeback().
>>>
>>> list_lock is a sleeping mutex, it's not disabling preemption. Moving it
>>> doesn't make a difference.
>>>
>>>>
>>>> BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
>>>> in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
>>>
>>> Something else disabled preemption. And note, nothing in the tracepoint
>>> should have called a sleeping function.
>>
>> Yes, it makes me confused too. It sounds like the preempt_ip address is
>> not that accurate.
>
> Yep, but the change you made doesn't look to be the fix.

Actually, regardless whether this is the right fix for the splat, it 
makes me be wondering if the spin lock which protects the whole for loop 
is really necessary. It sounds feasible to move it into the for loop and 
just protect the necessary area.

>
>>
>>>
>>>
>>>> INFO: lockdep is turned off.
>>>> Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
>
> Can you disassemble the vmlinux file to see exactly where that call is.
> I use gdb to find the right locations.
>
>   gdb> li *0xffffffc000374a5c
>   gdb> disass 0xffffffc000374a5c

I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffffffc000374a5c
0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617            oldest_jif = jiffies;
1618            work->older_than_this = &oldest_jif;
1619
1620            blk_start_plug(&plug);
1621            spin_lock(&wb->list_lock);
1622            for (;;) {
1623                    /*
1624                     * Stop writeback when nr_pages has been consumed
1625                     */


The disassemble:
    0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0 
<migrate_disable>
    0xffffffc000374a5c <+236>:   mov     x0, x22
    0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518 <rt_spin_lock>

>
>>>>
>>>> CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
>>>> Hardware name: Freescale Layerscape 2085a RDB Board (DT)
>>>> Workqueue: writeback wb_workfn (flush-7:0)
>>>> Call trace:
>>>> [<ffffffc00008d708>] dump_backtrace+0x0/0x200
>>>> [<ffffffc00008d92c>] show_stack+0x24/0x30
>>>> [<ffffffc0007b0f40>] dump_stack+0x88/0xa8
>>>> [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
>>>> [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
>>>> [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
>>>> [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
>>>
>>> How accurate is this trace back? Here's the code that is executed in
>>> this tracepoint:
>>>
>>> 	TP_fast_assign(
>>> 		struct device *dev = bdi->dev;
>>> 		if (!dev)
>>> 			dev = default_backing_dev_info.dev;
>>> 		strncpy(__entry->name, dev_name(dev), 32);
>>> 		__entry->nr_pages = work->nr_pages;
>>> 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
>>> 		__entry->sync_mode = work->sync_mode;
>>> 		__entry->for_kupdate = work->for_kupdate;
>>> 		__entry->range_cyclic = work->range_cyclic;
>>> 		__entry->for_background	= work->for_background;
>>> 		__entry->reason = work->reason;
>>> 	),
>>>
>>> See anything that would sleep?
>>
>> According to the stack backtrace, kernfs_path_len calls slepping lock,
>> which is called by __trace_wb_cgroup_size(wb) in __dynamic_array(char,
>> cgroup, __trace_wb_cgroup_size(wb)).
>>
>> The below is the definition:
>>
>> DECLARE_EVENT_CLASS(writeback_work_class,
>>           TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
>>           TP_ARGS(wb, work),
>>           TP_STRUCT__entry(
>>                   __array(char, name, 32)
>>                   __field(long, nr_pages)
>>                   __field(dev_t, sb_dev)
>>                   __field(int, sync_mode)
>>                   __field(int, for_kupdate)
>>                   __field(int, range_cyclic)
>>                   __field(int, for_background)
>>                   __field(int, reason)
>>                   __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
>>
>
> Ah, thanks for pointing that out. I missed that.

It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions, 
but it sounds not reasonable since they are used heavily by cgroup.

Thanks,
Yang

>
> -- Steve
>