linux-kernel - Re: [PATCH] sched: Avoid that __wait_on_bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160816130559.GA14022@redhat.com>
Date:	Tue, 16 Aug 2016 15:06:00 +0200
From:	Oleg Nesterov <oleg@...hat.com>
To:	Bart Van Assche <bart.vanassche@...disk.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	"mingo@...nel.org" <mingo@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Neil Brown <neilb@...e.de>,
	Michael Shaver <jmshaver@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs

On 08/15, Bart Van Assche wrote:
>
> On 08/13/2016 09:32 AM, Oleg Nesterov wrote:
>> On 08/12, Bart Van Assche wrote:
>>> before I started testing. It took some time
>>> before I could reproduce the hang in truncate_inode_pages_range().
>>
>> all I can say this contradicts with the previous testing results with
>> my previous patch or with your change in abort_exclusive_wait().
>
> Hello Oleg,
>
> My opinion is that all this means is that we do not yet have a full
> understanding of what is going on.

Sure.

> BTW, I have improved my page lock owner instrumentation patch such that
> it prints a call stack of the lock owner if lock_page() takes too long.
> The following call stack was reported:
>
> __lock_page / pid 8549 / m 0x2: timeout - continuing to wait for 8549
>   [<ffffffff8102b316>] save_stack_trace+0x26/0x50
>   [<ffffffff81152bee>] add_to_page_cache_lru+0x7e/0x170
>   [<ffffffff8121bfc5>] mpage_readpages+0xc5/0x170
>   [<ffffffff81215548>] blkdev_readpages+0x18/0x20
>   [<ffffffff81163a68>] __do_page_cache_readahead+0x268/0x310
>   [<ffffffff811640a8>] force_page_cache_readahead+0xa8/0x100
>   [<ffffffff81164139>] page_cache_sync_readahead+0x39/0x40
>   [<ffffffff81153967>] generic_file_read_iter+0x707/0x920
>   [<ffffffff81215920>] blkdev_read_iter+0x30/0x40
>   [<ffffffff811d4b4b>] __vfs_read+0xbb/0x130
>   [<ffffffff811d4f31>] vfs_read+0x91/0x130
>   [<ffffffff811d62b4>] SyS_read+0x44/0xa0
>   [<ffffffff816281e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
>
> My understanding of mpage_readpages() is that the page unlock happens
> after readahead I/O completed (see also page_endio()). So this probably
> means that an I/O request submitted because of readahead code did not
> get completed. I will see whether I can find anything that's wrong in
> the block layer.

Perhaps. But this means another problem! Or you didn't wait enough. Or
your previous testing was wrong.

Because, once again, your changes in abort_exclusive_wait(), and my
debugging patch which adds wakeup into ClearPageLocked() suggest that
the problem is NOT that the page is still locked.


I'd still like to know what happens with the last patch I sent (without
any other changes)... but now I am totally confused.

If only I could reproduce. Or at least understand what are you doing to
hit thi bug ;)

Oleg.