linux-kernel - Why PAGEOUT_IO_SYNC stalls for a long time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20100728191322.4A85.A69D9226@jp.fujitsu.com>
Date:	Wed, 28 Jul 2010 20:40:21 +0900 (JST)
From:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	kosaki.motohiro@...fujitsu.com,
	Andrew Morton <akpm@...ux-foundation.org>, stable@...nel.org,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mel@....ul.ie>,
	Christoph Hellwig <hch@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...cle.com>,
	Nick Piggin <npiggin@...e.de>,
	Johannes Weiner <hannes@...xchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Minchan Kim <minchan.kim@...il.com>,
	Andreas Mohr <andi@...as.de>, Bill Davidsen <davidsen@....com>,
	Ben Gamari <bgamari.foss@...il.com>
Subject: Why PAGEOUT_IO_SYNC stalls for a long time

In this week, I've tested some IO congested workload for a while. and probably
I did reproduced Andreas's issue.

So, I would like to explain current lumpy reclaim how works and why so much sucks.

1. Now isolate_lru_pages() have following pfn neighber grabbing logic.

                for (; pfn < end_pfn; pfn++) {
(snip)
                        if (__isolate_lru_page(cursor_page, mode, file) == 0) {
                                list_move(&cursor_page->lru, dst);
                                mem_cgroup_del_lru(cursor_page);
                                nr_taken++;
                                nr_lumpy_taken++;
                                if (PageDirty(cursor_page))
                                        nr_lumpy_dirty++;
                                scan++;
                        } else {
                                if (mode == ISOLATE_BOTH &&
                                                page_count(cursor_page))
                                        nr_lumpy_failed++;
                        }
                }

Mainly, __isolate_lru_page() failure can be caused following reasons.
  (1) the page have already been freed and is in buddy.
  (2) the page is used for non user process purpose
  (3) the page is unevictable (e.g. mlocked)

(2), (3) have very different characteristic from (1). the lumpy reclaim
mean 'contenious physical memory reclaiming'. that said, if we are trying
order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success
are completely differennt. former mean lumpy reclaim successfull, latter mean
failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy
reclaim successfull. then, we should stop pfn neighbor search immediately and
try to get lru next page. (i.e. we should use 'break' statement instead 'continue')

2. synchronous lumpy reclaim condition is insane.

currently, synchrounous lumpy reclaim will be invoked when following
condition.

        if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
                        sc->lumpy_reclaim_mode) {

but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have
much dirty pages, pageout() only issue first 113 IOs.
(if io queue have >113 requests, bdi_write_congested() return true and
 may_write_to_queue() return false)

So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback()
are surely stupid.

3. pageout() is intended anynchronous api. but doesn't works so.

pageout() call ->writepage with wbc->nonblocking=1. because if the system have
default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck
on one page is stupid, we should scan much pages as soon as possible.

HOWEVER, block layer ignore this argument. if slow usb memory device connect
to the system, ->writepage() will sleep long time. because submit_bio() call
get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task
bonus.

4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly.

Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim.
Then, In almostly case, PageActive() mean "swap device is full". Therefore,
waiting IO and retry pageout() are just silly.

In andres's case, congestion_wait() and get_request_wait() are root cause.
Other issue is problematic when more higher order lumpy reclaim.

Now, I'm preparing some patches and probably I can send them tommorow.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/