linux-kernel - Re: [00/41] Large Blocksize Support V7 (adds memmap support)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070917100350.GC25706@skynet.ie>
Date:	Mon, 17 Sep 2007 11:03:50 +0100
From:	mel@...net.ie (Mel Gorman)
To:	Goswin von Brederlow <brederlo@...ormatik.uni-tuebingen.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Joern Engel <joern@...fs.org>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	Christoph Lameter <clameter@....com>, andrea@...e.de,
	torvalds@...ux-foundation.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Christoph Hellwig <hch@....de>,
	William Lee Irwin III <wli@...omorphy.com>,
	David Chinner <dgc@....com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Badari Pulavarty <pbadari@...il.com>,
	Maxim Levitsky <maximlevitsky@...il.com>,
	Fengguang Wu <fengguang.wu@...il.com>,
	swin wang <wangswin@...il.com>, totty.lu@...il.com,
	hugh@...itas.com
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
> mel@...net.ie (Mel Gorman) writes:
> 
> > On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> >> Andrew Morton <akpm@...ux-foundation.org> writes:
> >> 
> >> > On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <joern@...fs.org> wrote:
> >> >
> >> >> While I agree with your concern, those numbers are quite silly.  The
> >> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> >> creating a collision.
> >> >
> >> > Actually it'd be pretty easy to craft an application which allocates seven
> >> > pages for pagecache, then one for <something>, then seven for pagecache, then
> >> > one for <something>, etc.
> >> >
> >> > I've had test apps which do that sort of thing accidentally.  The result
> >> > wasn't pretty.
> >> 
> >> Except that the applications 7 pages are movable and the <something>
> >> would have to be unmovable. And then they should not share the same
> >> memory region. At least they should never be allowed to interleave in
> >> such a pattern on a larger scale.
> >> 
> >
> > It is actually really easy to force regions to never share. At the
> > moment, there is a fallback list that determines a preference for what
> > block to mix.
> >
> > The reason why this isn't enforced is the cost of moving. On x86 and
> > x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> > those pages to prevent any mixing would be bad enough. On PowerPC, it's
> > potentially 16MB. On IA64, it's 1GB.
> >
> > As this was fragmentation avoidance, not guarantees, the decision was
> > made to not strictly enforce the types of pages within a block as the
> > cost cannot be made back unless the system was making agressive use of
> > large pages. This is not the case with Linux.
> 
> I don't say the group should never be mixed. The movable objects could
> be moved out on demand. If 64k get allocated then up to 64k get
> moved.

This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general
case, at least not yet.

> That would reduce the impact as the kernel does not hang while
> it moves 2MB or even 1GB. It also allows objects to be freed and the
> space reused in the unmovable and mixed groups. There could also be a
> certain number or percentage of mixed groupd be allowed to further
> increase the chance of movable objects freeing themself from mixed
> groups.
> 
> But when you already have say 10% of the ram in mixed groups then it
> is a sign the external fragmentation happens and some time should be
> spend on moving movable objects.
> 

I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't
been widely tested with what exists today.

> >> The only way a fragmentation catastroph can be (proovable) avoided is
> >> by having so few unmovable objects that size + max waste << ram
> >> size. The smaller the better. Allowing movable and unmovable objects
> >> to mix means that max waste goes way up. In your example waste would
> >> be 7*size. With 2MB uper order limit it would be 511*size.
> >> 
> >> I keep coming back to the fact that movable objects should be moved
> >> out of the way for unmovable ones. Anything else just allows
> >> fragmentation to build up.
> >> 
> >
> > This is easily achieved, just really really expensive because of the
> > amount of copying that would have to take place. It would also compel
> > that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> > MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> > a lot of free memory to keep around which is why fragmentation avoidance
> > doesn't do it.
> 
> In your sample graphics you had 1152 groups. Reserving a few of those
> doesnt sound too bad.

No, which on those systems, I would suggest setting min_free_kbytes to a
higher value. Doesn't work as well on IA-64.

> And how many migrate types do we talk about.
> So
> far we only had movable and unmovable.

Movable, unmovable, reclaimable and reserve in the current incarnation
of grouping pages by mobility.

> I would split unmovable into
> short term (caches, I/O pages) and long term (task structures,
> dentries).

Mostly done as you suggest already. Dentry are considered reclaimable, not
long-lived though and are grouped with things like inode caches for example.

> Reserving 6 groups for schort term unmovable and long term
> unmovable would be 1% of ram in your situation.
> 

More groups = more cost although very easy to add them. A mixed type
used to exist but was removed again because it couldn't be proved to be
useful at the time.

> Maybe instead of reserving one could say that you can have up to 6
> groups of space

And if the groups are 1GB in size? I tried something like this already.
It didn't work out well at the time although I could revisit.

> not used by unmovable objects before aggressive moving
> starts. I don't quite see why you NEED reserving as long as there is
> enough space free alltogether in case something needs moving.

hence, increase min_free_kbytes.

> 1 group
> worth of space free might be plenty to move stuff too. Note that all
> the virtual pages can be stuffed in every little free space there is
> and reassembled by the MMU. There is no space lost there.
> 

What you suggest sounds similar to having a type MIGRATE_MIXED where you
allocate from when the preferred lists are full. It became a sizing
problem that never really worked out. As I said, I can try again.

> But until one tries one can't say.
> 
> MfG
>         Goswin
> 
> PS: How do allocations pick groups?

Using GFP flags to identify the type.

> Could one use the oldest group
> dedicated to each MIGRATE_TYPE?

Age is difficult to determine so probably not.

> Or lowest address for unmovable and
> highest address for movable? Something to better keep the two out of
> each other way.

We bias the location of unmovable and reclaimable allocations already. It's
not done for movable because it wasn't necessary (as they are easily
reclaimed or moved anyway).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/