linux-kernel - Re: [Bug #14141] order 2 page allocation failures in iwlagn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2f11576a0910270816s3e1b268ah91b5f2d0cc0d562e@mail.gmail.com>
Date:	Wed, 28 Oct 2009 00:16:30 +0900
From:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Frans Pop <elendil@...net.nl>,
	Chris Mason <chris.mason@...cle.com>,
	David Rientjes <rientjes@...gle.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Reinette Chatre <reinette.chatre@...el.com>,
	Bartlomiej Zolnierkiewicz <bzolnier@...il.com>,
	Karol Lewandowski <karol.k.lewandowski@...il.com>,
	Mohamed Abbas <mohamed.abbas@...el.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	"John W. Linville" <linville@...driver.com>, linux-mm@...ck.org
Subject: Re: [Bug #14141] order 2 page allocation failures in iwlagn

2009/10/27 Mel Gorman <mel@....ul.ie>:
> On Mon, Oct 26, 2009 at 10:06:09PM +0100, Frans Pop wrote:
>> On Tuesday 20 October 2009, Mel Gorman wrote:
>> > I've attached a patch below that should allow us to cheat. When it's
>> > applied, it outputs who called congestion_wait(), how long the timeout
>> > was and how long it waited for. By comparing before and after sleep
>> > times, we should be able to see which of the callers has significantly
>> > changed and if it's something easily addressable.
>>
>> The results from this look fairly interesting (although I may be a bad
>> judge as I don't really know what I'm looking at ;-).
>>
>> I've tested with two kernels:
>> 1) 2.6.31.1: 1 test run
>> 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
>>
>> The 1st kernel had the expected "freeze" while reading commits in gitk;
>> reading commits with the 2nd kernel was more fluent.
>> I did 2 runs with the 2nd kernel as the first run had a fairly long music
>> skip and more SKB errors than expected. The second run was fairly normal
>> with no music skips at all even though it had a few SKB errors.
>>
>> Data for the tests:
>>                               1st kernel      2nd kernel 1    2nd kernel 2
>> end reading commits           1:15            1:00            0:55
>>   "freeze"                    yes             no              no
>> branch data shown             1:55            1:15            1:10
>> system quiet                  2:25            1:50            1:45
>> # SKB allocation errors               10              53              5
>>
>> Note that the test is substantially faster with the 2nd kernel and that the
>> SKB errors don't really affect the duration of the test.
>>
>
> Ok. I think that despite expectations, the writeback changes have
> changed the timing significantly enough to be worth examining closer.
>
>>
>> - without the revert 'background_writeout' is called a lot less frequently,
>>   but when it's called it gets long delays
>> - without the revert you have 'wb_kupdate', which is relatively expensive
>> - with the revert 'shrink_list' is relatively expensive, although not
>>   really in absolute terms
>>
>
> Lets look at the callers that waited in congestion_wait() for at least
> 25 jiffies.
>
> 2.6.31.1-async-sync-congestion-wait i.e. vanilla kernel
> generated with: cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
>     24  background_writeout  congestion_wait sync=0 delay 25 timeout 25
>    203  kswapd               congestion_wait sync=0 delay 25 timeout 25
>      5  shrink_list          congestion_wait sync=0 delay 25 timeout 25
>    155  try_to_free_pages    congestion_wait sync=0 delay 25 timeout 25
>    145  wb_kupdate           congestion_wait sync=0 delay 25 timeout 25
>      2  kswapd               congestion_wait sync=0 delay 26 timeout 25
>      8  wb_kupdate           congestion_wait sync=0 delay 26 timeout 25
>      1  try_to_free_pages    congestion_wait sync=0 delay 54 timeout 25
>
> 2.6.31.1-write-congestion-wait i.e. kernel with patch reverted
> generated with: cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
>      2  background_writeout  congestion_wait rw=1 delay 25 timeout 25
>    188  kswapd               congestion_wait rw=1 delay 25 timeout 25
>     14  shrink_list          congestion_wait rw=1 delay 25 timeout 25
>    181  try_to_free_pages    congestion_wait rw=1 delay 25 timeout 25
>      5  kswapd               congestion_wait rw=1 delay 26 timeout 25
>     10  try_to_free_pages    congestion_wait rw=1 delay 26 timeout 25
>      3  try_to_free_pages    congestion_wait rw=1 delay 27 timeout 25
>      1  kswapd               congestion_wait rw=1 delay 29 timeout 25
>      1  __alloc_pages_nodemask congestion_wait rw=1 delay 30 timeout 5
>      1  try_to_free_pages    congestion_wait rw=1 delay 31 timeout 25
>      1  try_to_free_pages    congestion_wait rw=1 delay 35 timeout 25
>      1  kswapd               congestion_wait rw=1 delay 51 timeout 25
>      1  try_to_free_pages    congestion_wait rw=1 delay 56 timeout 25
>
> So, wb_kupdate and background_writeout are the big movers in terms of waiting,
> not the direct reclaimers which is what we were expecting. Of those big
> movers, wb_kupdate is the most interested because compare the following
>
> $ cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> [ no output ]
> $ $ cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
>      1  wb_kupdate           congestion_wait sync=0 delay 15 timeout 25
>      1  wb_kupdate           congestion_wait sync=0 delay 23 timeout 25
>    145  wb_kupdate           congestion_wait sync=0 delay 25 timeout 25
>      8  wb_kupdate           congestion_wait sync=0 delay 26 timeout 25
>
> The vanilla kernel is not waiting in wb_kupdate at all.
>
> Jens, before the congestion_wait() changes, wb_kupdate was waiting on
> congestion and afterwards it's not. Furthermore, look at the number of pages
> that are queued for writeback in the two page allocation failure reports.
>
> without-revert: writeback:65653
> with-revert:    writeback:21713
>
> So, after the move to async/sync, a lot more pages are getting queued
> for writeback - more than three times the number of pages are queued for
> writeback with the vanilla kernel. This amount of congestion might be why
> direct reclaimers and kswapd's timings have changed so much.
>
> Chris Mason hinted at this but I didn't quite "get it" at the time but is it
> possible that writeback_inodes() is converting what is expected to be async
> IO into sync IO? One way of checking this is if Frans could test the patch
> below that makes wb_kupdate wait on sync instead of async.
>
> If this makes a difference, I think the three main areas of trouble we
> are now seeing are
>
>        1. page allocator regressions - mostly fixed hopefully
>        2. page writeback change in timing - theory yet to be confirmed
>        3. drivers using more atomics - iwlagn specific, being dealt with
>
> Of course, the big problem is if the changes are due to major timing
> differences in page writeback, then mainline is a totally different
> shape of problem as pdflush has been replaced there.
>
> ====
> Have wb_kupdate wait on sync IO congestion instead of async
>
> wb_kupdate is expected to only have queued up pages for async IO.
> However, something screwy is happening because it never appears to go to
> sleep. Frans, can you test with this patch instead of the revert please?
> Preferably, keep the verbose-congestion_wait patch applied so we can
> still get an idea who is going to sleep and for how long when calling
> congestion_wait. thanks
>
> Not-signed-off-hacket-job: Mel Gorman <mel@....ul.ie>
> ---
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 81627eb..cb646dd 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
>                writeback_inodes(&wbc);
>                if (wbc.nr_to_write > 0) {
>                        if (wbc.encountered_congestion || wbc.more_io)
> -                               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +                               congestion_wait(BLK_RW_SYNC, HZ/10);
>                        else
>                                break;  /* All the old data is written */
>                }

Hmm, This doesn't looks correct to me.

BLK_RW_ASYNC mean async write.
BLK_RW_SYNC  mean read and sync-write.

wb_kupdate use WB_SYNC_NONE. it's async write.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/