linux-kernel - Re: [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100730181014.4AEA.A69D9226@jp.fujitsu.com>
Date:	Fri, 30 Jul 2010 18:22:18 +0900 (JST)
From:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	kosaki.motohiro@...fujitsu.com, Dave Chinner <david@...morbit.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Chris Mason <chris.mason@...cle.com>,
	Nick Piggin <npiggin@...e.de>, Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Christoph Hellwig <hch@...radead.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Mel Gorman <mel@....ul.ie>, Minchan Kim <minchan.kim@...il.com>
Subject: Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads

> On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > > 
> > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > there are inevitably lots of details to sort out. I don't know where it will
> > > go after tests and discussions, so the patches are intentionally kept simple.
> > > 
> > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > > 
> > > let the flusher threads do ASYNC writeback for pageout()
> > > 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher
> > 
> > I really do not like this - all it does is transfer random page writeback
> > from vmscan to the flusher threads rather than avoiding random page
> > writeback altogether. Random page writeback is nasty - just say no.
> 
> There are cases we have to do pageout().
> 
> - a stressed memcg with lots of dirty pages
> - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages

- 32bit highmem system too

can you please see following commit? this describe current design.




commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date:   Sun Dec 22 01:07:33 2002 +0000

    [PATCH] Give kswapd writeback higher priority than pdflush

    The `low latency page reclaim' design works by preventing page
    allocators from blocking on request queues (and by preventing them from
    blocking against writeback of individual pages, but that is immaterial
    here).

    This has a problem under some situations.  pdflush (or a write(2)
    caller) could be saturating the queue with highmem pages.  This
    prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
    enormous amounts of scenning.

    A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
    then kill the mmapping applications.  The machine instantly goes from
    0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
    the least-recently-dirtied pages, which are all highmem.  The queue is
    congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
    50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
    efficiency (pages_reclaimed/pages_scanned) falls to 2%.

    So this patch changes the policy for kswapd.  kswapd may use all of a
    request queue, and is prepared to block on request queues.

    What will now happen in the above scenario is:

    1: The page alloctor scans some pages, fails to reclaim enough
       memory and takes a nap in blk_congetion_wait().

    2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
       back pages.  (These pages will be rotated to the tail of the
       inactive list at IO-completion interrupt time).

       This writeback will saturate the queue with ZONE_NORMAL pages.
       Conveniently, pdflush will avoid the congested queues.  So we end up
       writing the correct pages.

    In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
    efficiency rises from 2% to 40% and things are generally a lot happier.


    The downside is that kswapd may now do a lot less page reclaim,
    increasing page allocation latency, causing more direct reclaim,
    increasing lock contention in the VM, etc.  But I have not been able to
    demonstrate that in testing.


    The other problem is that there is only one kswapd, and there are lots
    of disks.  That is a generic problem - without being able to co-opt
    user processes we don't have enough threads to keep lots of disks saturated.

    One fix for this would be to add an additional "really congested"
    threshold in the request queues, so kswapd can still perform
    nonblocking writeout.  This gives kswapd priority over pdflush while
    allowing kswapd to feed many disk queues.  I doubt if this will be
    called for.

    BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/