linux-kernel - Re: s2disk hang update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B718F11.2010402@tuffmail.co.uk>
Date:	Tue, 09 Feb 2010 16:36:33 +0000
From:	Alan Jenkins <alan-jenkins@...fmail.co.uk>
To:	Alan Jenkins <sourcejedi.lkml@...glemail.com>
CC:	"Rafael J. Wysocki" <rjw@...k.pl>, Mel Gorman <mel@....ul.ie>,
	hugh.dickins@...cali.co.uk, Pavel Machek <pavel@....cz>,
	pm list <linux-pm@...ts.linux-foundation.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>
Subject: Re: s2disk hang update

Alan Jenkins wrote:
> On 2/2/10, Rafael J. Wysocki <rjw@...k.pl> wrote:
>   
>> On Tuesday 02 February 2010, Alan Jenkins wrote:
>>     
>>> On 1/2/10, Rafael J. Wysocki <rjw@...k.pl> wrote:
>>>       
>>>> On Saturday 02 January 2010, Alan Jenkins wrote:
>>>> Hi,
>>>>
>>>>         
>>>>> I've been suffering from s2disk hangs again.  This time, the hangs
>>>>> were always before the hibernation image was written out.
>>>>>
>>>>> They're still frustratingly random.  I just started trying to work out
>>>>> whether doubling PAGES_FOR_IO makes them go away, but they went away
>>>>> on their own again.
>>>>>
>>>>> I did manage to capture a backtrace with debug info though.  Here it
>>>>> is for 2.6.33-rc2.  (It has also happened on rc1).  I was able to get
>>>>> the line numbers (using gdb, e.g.  "info line
>>>>> *stop_machine_create+0x27"), having built the kernel with debug info.
>>>>>
>>>>> [top of trace lost due to screen height]
>>>>> ? sync_page	(filemap.c:183)
>>>>> ? wait_on_page_bit	(filemap.c:506)
>>>>> ? wake_bit_function	(wait.c:174)
>>>>> ? shrink_page_list	(vmscan.c:696)
>>>>> ? __delayacct_blkio_end	(delayacct.c:94)
>>>>> ? finish_wait	(list.h:142)
>>>>> ? congestion_wait	(backing-dev.c:761)
>>>>> ? shrink_inactive_list	(vmscan.c:1193)
>>>>> ? scsi_request_fn	(spinlock.h:306)
>>>>> ? blk_run_queue	(blk-core.c:434)
>>>>> ? shrink_zone	(vmscan.c:1484)
>>>>> ? do_try_to_free_pages	(vmscan.c:1684)
>>>>> ? try_to_free_pages	(vmscan.c:1848)
>>>>> ? isolate_pages_global	(vmscan.c:980)
>>>>> ? __alloc_pages_nodemask	(page_alloc.c:1702)
>>>>> ? __get_free_pages	(page_alloc.c:1990)
>>>>> ? copy_process	(fork.c:237)
>>>>> ? do_fork	(fork.c:1443)
>>>>> ? rb_erase
>>>>> ? __switch_to
>>>>> ? kthread
>>>>> ? kernel_thread
>>>>> ? kthread
>>>>> ? kernel_thread_helper
>>>>> ? kthreadd
>>>>> ? kthreadd
>>>>> ? kernel_thread_helper
>>>>>
>>>>> INFO: task s2disk:2174 blocked for more than 120 seconds
>>>>>           
>>>> This looks like we have run out of memory while creating a new kernel
>>>> thread
>>>> and we have blocked on I/O while trying to free some space (quite
>>>> obviously,
>>>> because the I/O doesn't work at this point).
>>>>         
>>> For context, the kernel thread being created here is the stop_machine
>>> thread.  It is created by disable_nonboot_cpus(), called from
>>> hibernation_snapshot().  See e.g. this hung task backtrace -
>>>
>>> http://picasaweb.google.com/lh/photo/BkKUwZCrQ2ceBIM9ZOh7Ow?feat=directlink
>>>
>>>       
>>>> I think it should help if you increase PAGES_FOR_IO, then.
>>>>         
>>> Ok, it's been happening again on 2.6.33-rc6.  Unfortunately increasing
>>> PAGES_FOR_IO doesn't help.
>>>
>>> I've been using a test patch to make PAGES_FOR_IO tunable at run time.
>>>  I get the same hang if I increase it by a factor of 10, to 10240:
>>>
>>> # cd /sys/module/kernel/parameters/
>>> # ls
>>> consoleblank  initcall_debug  PAGES_FOR_IO  panic  pause_on_oops
>>> SPARE_PAGES
>>> # echo 10240 > PAGES_FOR_IO
>>> # echo 2560 > SPARE_PAGES
>>> # cat SPARE_PAGES
>>> 2560
>>> # cat PAGES_FOR_IO
>>> 10240
>>>
>>> I also added a debug patch to try and understand the calculations with
>>> PAGES_FOR_IO in hibernate_preallocate_memory().  I still don't really
>>> understand them and there could easily be errors in my debug patch,
>>> but the output is interesting.
>>>
>>> Increasing PAGES_FOR_IO by almost 10000 has the expected effect of
>>> decreasing "max_size" by the same amount.  However it doesn't appear
>>> to increase the number of free pages at the critical moment.
>>>
>>> PAGES_FOR_IO = 1024:
>>> http://picasaweb.google.com/lh/photo/DYQGvB_4hvCvVuxZf2ibxg?feat=directlink
>>>
>>> PAGES_FOR_IO = 10240:
>>> http://picasaweb.google.com/lh/photo/AIkV_ZBwt22nzN-JdOJCWA?feat=directlink
>>>
>>>
>>> You may remember that I was originally able to avoid the hang by
>>> reverting commit 5f8dcc2.  It doesn't revert cleanly any more.
>>> However, I tried applying my test&debug patches on top of 5f8dcc2~1
>>> (just before the commit that triggered the hang).  That kernel
>>> apparently left ~5000 pages free at hibernation time, v.s. ~1200 when
>>> testing the same scenario on 2.6.33-rc6.  (As before, the number of
>>> free pages remained the same if I increased PAGES_FOR_IO to 10240).
>>>       
>> I think the hang may be avoided by using this patch
>> http://patchwork.kernel.org/patch/74740/
>> but the hibernation will fail instead.
>>
>> Can you please repeat your experiments with the patch below applied and
>> report back?
>>
>> Rafael
>>     
>
> It causes hibernation to succeed <grin>.
>   

Perhaps I spoke too soon.  I see the same hang if I run too many 
applications.  The first hibernation fails with "not enough swap" as 
expected, but the second or third attempt hangs (with the same backtrace 
as before).

The patch definitely helps though.  Without the patch, I see a hang the 
first time I try to hibernate with too many applications running.

Regards
Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/