linux-kernel - Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100324103749.GB21147@csn.ul.ie>
Date:	Wed, 24 Mar 2010 10:37:49 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Christoph Lameter <cl@...ux-foundation.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Adam Litke <agl@...ibm.com>, Avi Kivity <avi@...hat.com>,
	David Rientjes <rientjes@...gle.com>,
	Minchan Kim <minchan.kim@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Rik van Riel <riel@...hat.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [PATCH 11/11] Do not compact within a preferred zone after a
	compaction failure

On Tue, Mar 23, 2010 at 02:27:08PM -0500, Christoph Lameter wrote:
> On Tue, 23 Mar 2010, Mel Gorman wrote:
> 
> > I was having some sort of fit when I wrote that obviously. Try this on
> > for size
> >
> > The fragmentation index may indicate that a failure is due to external
> > fragmentation but after a compaction run completes, it is still possible
> > for an allocation to fail.
> 
> Ok.
> 
> > > > fail. There are two obvious reasons as to why
> > > >
> > > >   o Page migration cannot move all pages so fragmentation remains
> > > >   o A suitable page may exist but watermarks are not met
> > > >
> > > > In the event of compaction and allocation failure, this patch prevents
> > > > compaction happening for a short interval. It's only recorded on the
> > >
> > > compaction is "recorded"? deferred?
> > >
> >
> > deferred makes more sense.
> >
> > What I was thinking at the time was that compact_resume was stored in struct
> > zone - i.e. that is where it is recorded.
> 
> Ok adding a dozen or more words here may be useful.
> 

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to
be stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.

?

> > > > preferred zone but that should be enough coverage. This could have been
> > > > implemented similar to the zonelist_cache but the increased size of the
> > > > zonelist did not appear to be justified.
> > >
> > > > @@ -1787,6 +1787,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > > >  			 */
> > > >  			count_vm_event(COMPACTFAIL);
> > > >
> > > > +			/* On failure, avoid compaction for a short time. */
> > > > +			defer_compaction(preferred_zone, jiffies + HZ/50);
> > > > +
> > >
> > > 20ms? How was that interval determined?
> > >
> >
> > Matches the time the page allocator would defer to an event like
> > congestion. The choice is somewhat arbitrary. Ideally, there would be
> > some sort of event that would re-enable compaction but there wasn't an
> > obvious candidate so I used time.
> 
> There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
> time may be better? HZ/50 looks like an interval for writeout. But this
> is related to reclaim?
> 

HZ/10 is somewhat of an arbitrary choice as well and there isn't data on
which is better and which is worse. If the zone is full of dirty data, then
HZ/10 makes sense for IO. If it happened to be mainly clean cache but under
heavy memory pressure, then reclaim would be a relatively fast event and a
shorter wait makes sense of HZ/50.

Thing is, if we start with a short timer and it's too short, COMPACTFAIL
will be growing steadily. If we choose a long time and it's too long, there
is no counter to indicate it was a bad choice. Hence, I'd prefer the short
timer to start with and ideally resume compaction after some event in the
future rather than depending on time.

Does that make sense?

> 
>  backing-dev.h    <global>                      283 long congestion_wait(int sync, long timeout);
> 1 backing-dev.c    <global>                      762 EXPORT_SYMBOL(congestion_wait);
> 2 usercopy_32.c    __copy_to_user_ll             754 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 3 pktcdvd.c        pkt_make_request             2557 congestion_wait(BLK_RW_ASYNC, HZ);
> 4 dm-crypt.c       kcryptd_crypt_write_convert   834 congestion_wait(BLK_RW_ASYNC, HZ/100);
> 5 file.c           fat_file_release              137 congestion_wait(BLK_RW_ASYNC, HZ/10);
> 6 journal.c        reiserfs_async_progress_wait  990 congestion_wait(BLK_RW_ASYNC, HZ / 10);
> 7 kmem.c           kmem_alloc                     61 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 8 kmem.c           kmem_zone_alloc               117 congestion_wait(BLK_RW_ASYNC, HZ/50);
> 9 xfs_buf.c        _xfs_buf_lookup_pages         343 congestion_wait(BLK_RW_ASYNC, HZ/50);
> a backing-dev.c    congestion_wait               751 long congestion_wait(int sync, long timeout)
> b memcontrol.c     mem_cgroup_force_empty       2858 congestion_wait(BLK_RW_ASYNC, HZ/10);
> c page-writeback.c throttle_vm_writeout          674 congestion_wait(BLK_RW_ASYNC, HZ/10);
> d page_alloc.c     __alloc_pages_high_priority  1753 congestion_wait(BLK_RW_ASYNC, HZ/50);
> e page_alloc.c     __alloc_pages_slowpath       1924 congestion_wait(BLK_RW_ASYNC, HZ/50);
> f vmscan.c         shrink_inactive_list         1136 congestion_wait(BLK_RW_ASYNC, HZ/10);
> g vmscan.c         shrink_inactive_list         1220 congestion_wait(BLK_RW_ASYNC, HZ/10);
> h vmscan.c         do_try_to_free_pages         1837 congestion_wait(BLK_RW_ASYNC, HZ/10);
> i vmscan.c         balance_pgdat                2161 congestion_wait(BLK_RW_ASYNC, HZ/10);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/