linux-kernel - Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20111207135802.GF12673@cmpxchg.org>
Date:	Wed, 7 Dec 2011 14:58:02 +0100
From:	Johannes Weiner <hannes@...xchg.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Minchan Kim <minchan.kim@...il.com>,
	Michal Hocko <mhocko@...e.cz>,
	Christoph Hellwig <hch@...radead.org>,
	Wu Fengguang <fengguang.wu@...el.com>,
	Dave Chinner <david@...morbit.com>, Jan Kara <jack@...e.cz>,
	Shaohua Li <shaohua.li@...el.com>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

On Tue, Nov 29, 2011 at 04:20:14PM -0800, Andrew Morton wrote:
> On Wed, 23 Nov 2011 14:34:14 +0100
> Johannes Weiner <hannes@...xchg.org> wrote:
> 
> > From: Johannes Weiner <jweiner@...hat.com>
> > 
> > The amount of dirtyable pages should not include the full number of
> > free pages: there is a number of reserved pages that the page
> > allocator and kswapd always try to keep free.
> > 
> > The closer (reclaimable pages - dirty pages) is to the number of
> > reserved pages, the more likely it becomes for reclaim to run into
> > dirty pages:
> > 
> >        +----------+ ---
> >        |   anon   |  |
> >        +----------+  |
> >        |          |  |
> >        |          |  -- dirty limit new    -- flusher new
> >        |   file   |  |                     |
> >        |          |  |                     |
> >        |          |  -- dirty limit old    -- flusher old
> >        |          |                        |
> >        +----------+                       --- reclaim
> >        | reserved |
> >        +----------+
> >        |  kernel  |
> >        +----------+
> > 
> > This patch introduces a per-zone dirty reserve that takes both the
> > lowmem reserve as well as the high watermark of the zone into account,
> > and a global sum of those per-zone values that is subtracted from the
> > global amount of dirtyable pages.  The lowmem reserve is unavailable
> > to page cache allocations and kswapd tries to keep the high watermark
> > free.  We don't want to end up in a situation where reclaim has to
> > clean pages in order to balance zones.
> > 
> > Not treating reserved pages as dirtyable on a global level is only a
> > conceptual fix.  In reality, dirty pages are not distributed equally
> > across zones and reclaim runs into dirty pages on a regular basis.
> > 
> > But it is important to get this right before tackling the problem on a
> > per-zone level, where the distance between reclaim and the dirty pages
> > is mostly much smaller in absolute numbers.
> > 
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -327,7 +327,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >  			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> >  
> >  		x += zone_page_state(z, NR_FREE_PAGES) +
> > -		     zone_reclaimable_pages(z);
> > +		     zone_reclaimable_pages(z) -
> > +		     zone->dirty_balance_reserve;
> 
> Doesn't compile.  s/zone/z/.
> 
> Which makes me suspect it wasn't tested on a highmem box.  This is
> rather worrisome, as highmem machines tend to have acute and unique
> zone balancing issues.

You are right, so I ran fs_mark on an x86 machine with 8GB and a
32-bit kernel.

fs_mark  -S  0  -d  work-01  -d  work-02  -d  work-03  -d  work-04  -D  128  -N  128  -L  16  -n  512  -s  655360

This translates to 4 threads doing 16 iterations over a new set of 512
files each time, where each file is 640k in size, which adds up to 20G
of written data per run.  The results are gathered over 5 runs.  Data
are written to an ext4 on a standard consumer rotational disk.

The overall runtime for the loads were the same:

		seconds
		mean(stddev)
 vanilla:	242.061(0.953)
 patched:	242.726(1.714)

Allocation counts confirm that allocation placement does not change:

		pgalloc_dma		pgalloc_normal				pgalloc_high
		min|median|max	
 vanilla:	0.000|0.000|0.000	3733291.000|3742709.000|4034662.000	5189412.000|5202220.000|5208743.000
 patched:	0.000|0.000|0.000	3716148.000|3733269.000|4032205.000	5212301.000|5216834.000|5227756.000

Kswapd in both kernels did the same amount of work in each zone over
the course of the workload; direct reclaim was never invoked:

		pgscan_kswapd_dma	pgscan_kswapd_normal			pgscan_kswapd_high
		min|median|max
 vanilla:	0.000|0.000|0.000	109919.000|115773.000|117952.000	3235879.000|3246707.000|3255205.000
 patched:	0.000|0.000|0.000	104169.000|114845.000|117657.000	3241327.000|3246835.000|3257843.000

		pgsteal_dma		pgsteal_normal				pgsteal_high
		min|median|max
 vanilla:	0.000|0.000|0.000	109912.000|115766.000|117945.000	3235318.000|3246632.000|3255098.000
 patched:	0.000|0.000|0.000	104163.000|114839.000|117651.000	3240765.000|3246760.000|3257768.000

and the distribution of scans over time was equivalent, with no new
hickups or scan spikes:

		pgscan_kswapd_dma/s	pgscan_kswapd_normal/s			pgscan_kswapd_high/s
		min|median|max
 vanilla:	0.000|0.000|0.000	0.000|144.000|2100.000			0.000|15582.500|44916.000
 patched:	0.000|0.000|0.000	0.000|152.000|2058.000			0.000|15361.000|44453.000

		pgsteal_dma/s		pgsteal_normal/s			pgsteal_high/s
		min|median|max
 vanilla:	0.000|0.000|0.000	0.000|144.000|2094.000			0.000|15582.500|44916.000
 patched:	0.000|0.000|0.000	0.000|152.000|2058.000			0.000|15361.000|44453.000


				fs_mark 1G

The same fs_mark load was run on the system limited to 1G memory
(booted with mem=1G), to have a highmem zone that is much smaller
compared to the rest of the system.

		seconds
		mean(stddev)
 vanilla:	238.428(3.810)	
 patched:	241.392(0.221)	

In this case, allocation placement did shift slightly towards lower
zones, to protect the tiny highmem zone from being unreclaimable due
to dirty pages:

		pgalloc_dma			pgalloc_normal				pgalloc_high
		min|median|max
 vanilla:	20658.000|21863.000|23231.000	4017580.000|4023331.000|4038774.000	1057246.000|1076280.000|1083824.000
 patched:	25403.000|27679.000|28556.000	4163538.000|4172116.000|4179151.000	 917054.000| 922206.000| 933609.000

However, while there were in total more allocations in the DMA and
Normal zone, the utilization peaks of the zones individually were
actually reduced due to smoother distribution:

		DMA min nr_free_pages		Normal min nr_free_pages		HighMem min nr_free_pages
 vanilla:	1244.000			14819.000				432.000
 patched:	1337.000			14850.000				439.000

Keep in mind that the lower zones are only used more often for
allocation because they are providing dirtyable memory in this
scenario, i.e. they have space to spare.

With increasing lowmem usage for stuff that is truly lowmem, like
dcache and page tables, the amount of memory we consider dirtyable
(free pages + file pages) shrinks, so when highmem is not allowed to
take anymore dirty pages, we will not thrash on the lower zones:
either they have space left or the dirtiers are already being
throttled in balance_dirty_pages().

Reclaim numbers suggests that kswapd can easily keep up with the the
allocation frequency increase in the Normal zone.  But for DMA, it
looks like the unpatched kernel flooded the zone with dirty pages
every once in a while, making it ineligible for allocations until
those pages were cleaned.  Through better distribution, the patch
improves reclaim efficiency (reclaimed/scanned) from 32% to 100% for
DMA:

		pgscan_kswapd_dma		pgscan_kswapd_normal			pgscan_kswapd_high
		min|median|max
 vanilla:	39734.000|41248.000|41965.000	3692050.000|3696209.000|3716653.000	970411.000|987483.000|991469.000
 patched:	21204.000|23901.000|25141.000	3874782.000|3879125.000|3888302.000	793141.000|795631.000|803482.000

		pgsteal_dma			pgsteal_normal				pgsteal_high
		min|median|max
 vanilla:	12932.000|14044.000|16957.000	3692025.000|3696183.000|3716626.000	966050.000|987386.000|991405.000
 patched:	21204.000|23901.000|25141.000	3874771.000|3879095.000|3888284.000	792079.000|795572.000|803370.000

And the increased reclaim efficiency in the DMA zone indeed correlates
with the reduced likelyhood of reclaim running into dirty pages:

	DMA						Normal				Highmem
	nr_vmscan_write	nr_vmscan_immediate_reclaim

vanilla:
	26.0	19614.0					0.0	0.0			1174.0	0.0
	0.0	21737.0					0.0	1.0			0.0	0.0
	0.0	22101.0					0.0	0.0			0.0	0.0
        0.0	21906.0					0.0	0.0			0.0	0.0
	0.0	21880.0					0.0	0.0			0.0	0.0

patched:
	0.0	0.0					0.0	1.0			502.0	0.0
	0.0	0.0					0.0	0.0			0.0	0.0
	0.0	0.0					0.0	0.0			0.0	0.0
	0.0	0.0					0.0	0.0			0.0	0.0
	0.0	0.0					0.0	1.0			0.0	0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/