linux-kernel - Re: zone_reclaimable() leads to livelock in __alloc_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 23 May 2016 09:29:04 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Mel Gorman <mgorman@...hsingularity.net>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

Hi,
Tetsuo has already pointed you at my oom detection rework which removes
the zone_reclaimable ugliness (btw. one of the top reasons to rework
this area) and it is likely to fix your problem. I would still like to
understand what happens with your test case because we might want to
prepare a stable patch for older kernels.

On Fri 20-05-16 22:28:17, Oleg Nesterov wrote:
> I don't understand vmscan.c, and in fact I don't even understand NR_PAGES_SCANNED
[...]
> counter... why it has to be atomic/per-cpu? It is always updated under ->lru_lock
> except free_pcppages_bulk/free_one_page try to reset this counter. But note that
> they both do

It doesn't really have to be atomic/per-cpu because it is really updated
under the lock. It just uses the generic vmstat infrastructure...

> 	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> 	if (nr_scanned)
> 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> 
> and this doesn't look exactly right: zone_page_state() ignores the per-cpu
> ->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
> loop here). And I do not know if this is really bad or not, but note that if
> I change calculate_normal_threshold() to return 0, the problem goes away too.

You are absolutely right that this is racy. In the worst case we would
end up missing nr_cpus*threshold scanned pages which would stay behind.
But

bool zone_reclaimable(struct zone *zone)
{
	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
		zone_reclaimable_pages(zone) * 6;
}

So the left over shouldn't cause it to return true all the time. In
fact it could prematurely say false, right? (note that _snapshot variant
considers per-cpu diffs [1]).

That being said I am not really sure why would the 0 threshold help for
your test case. Could you add some tracing and see what are the numbers
above? Is it possible that zone_reclaimable_pages is some small number
which actuall prevents us to scan anything? Aka a bug is get_scan_count
or somewhere else?

[1] I am not really sure which kernel version have you tested - your
config says 4.6.0-rc7 but this is true since 0db2cb8da89d ("mm, vmscan:
make zone_reclaimable_pages more precise") which is 4.6-rc1.
-- 
Michal Hocko
SUSE Labs