lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160816130559.GA14022@redhat.com>
Date:	Tue, 16 Aug 2016 15:06:00 +0200
From:	Oleg Nesterov <oleg@...hat.com>
To:	Bart Van Assche <bart.vanassche@...disk.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	"mingo@...nel.org" <mingo@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Neil Brown <neilb@...e.de>,
	Michael Shaver <jmshaver@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs

On 08/15, Bart Van Assche wrote:
>
> On 08/13/2016 09:32 AM, Oleg Nesterov wrote:
>> On 08/12, Bart Van Assche wrote:
>>> before I started testing. It took some time
>>> before I could reproduce the hang in truncate_inode_pages_range().
>>
>> all I can say this contradicts with the previous testing results with
>> my previous patch or with your change in abort_exclusive_wait().
>
> Hello Oleg,
>
> My opinion is that all this means is that we do not yet have a full
> understanding of what is going on.

Sure.

> BTW, I have improved my page lock owner instrumentation patch such that
> it prints a call stack of the lock owner if lock_page() takes too long.
> The following call stack was reported:
>
> __lock_page / pid 8549 / m 0x2: timeout - continuing to wait for 8549
>   [<ffffffff8102b316>] save_stack_trace+0x26/0x50
>   [<ffffffff81152bee>] add_to_page_cache_lru+0x7e/0x170
>   [<ffffffff8121bfc5>] mpage_readpages+0xc5/0x170
>   [<ffffffff81215548>] blkdev_readpages+0x18/0x20
>   [<ffffffff81163a68>] __do_page_cache_readahead+0x268/0x310
>   [<ffffffff811640a8>] force_page_cache_readahead+0xa8/0x100
>   [<ffffffff81164139>] page_cache_sync_readahead+0x39/0x40
>   [<ffffffff81153967>] generic_file_read_iter+0x707/0x920
>   [<ffffffff81215920>] blkdev_read_iter+0x30/0x40
>   [<ffffffff811d4b4b>] __vfs_read+0xbb/0x130
>   [<ffffffff811d4f31>] vfs_read+0x91/0x130
>   [<ffffffff811d62b4>] SyS_read+0x44/0xa0
>   [<ffffffff816281e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
>
> My understanding of mpage_readpages() is that the page unlock happens
> after readahead I/O completed (see also page_endio()). So this probably
> means that an I/O request submitted because of readahead code did not
> get completed. I will see whether I can find anything that's wrong in
> the block layer.

Perhaps. But this means another problem! Or you didn't wait enough. Or
your previous testing was wrong.

Because, once again, your changes in abort_exclusive_wait(), and my
debugging patch which adds wakeup into ClearPageLocked() suggest that
the problem is NOT that the page is still locked.


I'd still like to know what happens with the last patch I sent (without
any other changes)... but now I am totally confused.

If only I could reproduce. Or at least understand what are you doing to
hit thi bug ;)

Oleg.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ