lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1283276257-1793-1-git-send-email-mel@csn.ul.ie>
Date:	Tue, 31 Aug 2010 18:37:34 +0100
From:	Mel Gorman <mel@....ul.ie>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Linux Kernel List <linux-kernel@...r.kernel.org>,
	linux-mm@...ck.org, Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Minchan Kim <minchan.kim@...il.com>,
	Christoph Lameter <cl@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mel@....ul.ie>
Subject: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3

Changelog since V3
  o Minor clarifications
  o Rebase to 2.6.36-rc3

Changelog since V1
  o Fix for !CONFIG_SMP
  o Correct spelling mistakes
  o Clarify a ChangeLog
  o Only check for counter drift on machines large enough for the counter
    drift to breach the min watermark when NR_FREE_PAGES report the low
    watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc1 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

Christoph Lameter made two suggestions that I did not take action on. The
first was to make a generic helper that could be used to get a semi-accurate
reading of any vmstat counter.  However, there is no evidence this is
necessary and it would be better to get a clear understanding of what counter
other than NR_FREE_PAGES would need special treatment by making it obvious
when such a helper is introduced. The second suggestion was to shrink the
threshold that vmstat got updated for affecting all counters. It was also
unclear if this was sufficient or necessary as again. Only NR_FREE_PAGES
is thhe problem counter so why affect every other counter? Also, shrinking
the threshold just shrinks the window the race can occur in. Hence, I'm
reposting the series as-is to see if there are any current objections to
deal with or if we can close up this problem now.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch . If we can
agree on it, it's a stable candidate.

 include/linux/mmzone.h |   13 +++++++++++++
 mm/mmzone.c            |   29 +++++++++++++++++++++++++++++
 mm/page_alloc.c        |   29 +++++++++++++++++++++--------
 mm/vmstat.c            |   15 ++++++++++++++-
 4 files changed, 77 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ