linux-kernel - Re: [00/41] Large Blocksize Support V7 (adds memmap support)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070916213126.GH6708@v2.random>
Date:	Sun, 16 Sep 2007 23:31:26 +0200
From:	Andrea Arcangeli <andrea@...e.de>
To:	Mel Gorman <mel@...net.ie>
Cc:	Goswin von Brederlow <brederlo@...ormatik.uni-tuebingen.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Joern Engel <joern@...fs.org>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	Christoph Lameter <clameter@....com>,
	torvalds@...ux-foundation.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Christoph Hellwig <hch@....de>,
	William Lee Irwin III <wli@...omorphy.com>,
	David Chinner <dgc@....com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Badari Pulavarty <pbadari@...il.com>,
	Maxim Levitsky <maximlevitsky@...il.com>,
	Fengguang Wu <fengguang.wu@...il.com>,
	swin wang <wangswin@...il.com>, totty.lu@...il.com,
	hugh@...itas.com
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> The 16MB is the size of a hugepage, the size of interest as far as I am
> concerned. Your idea makes sense for large block support, but much less
> for huge pages because you are incurring a cost in the general case for
> something that may not be used.

Sorry for the misunderstanding, I totally agree!

> There is nothing to say that both can't be done. Raise the size of
> order-0 for large block support and continue trying to group the block
> to make hugepage allocations likely succeed during the lifetime of the
> system.

Sure, I completely agree.

> At the risk of repeating, your approach will be adding a new and
> significant dimension to the internal fragmentation problem where a
> kernel allocation may fail because the larger order-0 pages are all
> being pinned by userspace pages.

This is exactly correct, some memory will be wasted. It'll reach 0
free memory more quickly depending on which kind of applications are
being run.

> It improves the probabilty of hugepage allocations working because the
> blocks with slab pages can be targetted and cleared if necessary.

Agreed.

> That suggestion was aimed at the large block support more than
> hugepages. It helps large blocks because we'll be allocating and freeing
> as more or less the same size. It certainly is easier to set
> slub_min_order to the same size as what is needed for large blocks in
> the system than introducing the necessary mechanics to allocate
> pagetable pages and userspace pages from slab.

Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
really complex indeed. I'm not planning for that in the short term but
it remains a possibility to make the kernel more generic. Perhaps it
could worth it...

Allocating ptes from slab is fairly simple but I think it would be
better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
nearby ptes in the per-task local pagetable tree, to reduce the number
of locks taken and not to enter the slab at all for that. Infact we
could allocate the 4 levels (or anyway more than one level) in one
single alloc_pages(0) and track the leftovers in the mm (or similar).

> I'm not sure what you are getting at here. I think it would make more
> sense if you said "when you read /proc/buddyinfo, you know the order-0
> pages are really free for use with large blocks" with your approach.

I'm unsure who reads /proc/buddyinfo (that can change a lot and that
is not very significant information if the vm can defrag well inside
the reclaim code), but it's not much different and it's more about
knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
anon ram, and free memory.

The idea is that to succeed an mmap over a large xfs file with
mlockall being invoked, those largepages must become available or
it'll be oom despite there are still 512M free... I'm quite sure
admins will gets confused if they get oom killer invoked with lots of
ram still free.

The overcommit feature will also break, just to make an example (so
much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
killage ;).

> All this aside, there is nothing mutually exclusive with what you are proposing
> and what grouping pages by mobility does. Your stuff can exist even if grouping
> pages by mobility is in place. In fact, it'll help us by giving an important
> comparison point as grouping pages by mobility can be trivially disabled with
> a one-liner for the purposes of testing. If your approach is brought to being
> a general solution that also helps hugepage allocation, then we can revisit
> grouping pages by mobility by comparing kernels with it enabled and without.

Yes, I totally agree. It sounds worthwhile to have a good defrag logic
in the VM. Even allocating the kernel stack in today kernels should be
able to benefit from your work. It's just comparing a fork() failure
(something that will happen with ulimit -n too and that apps must be
able to deal with) with an I/O failure that worries me a bit. I'm
quite sure a db failing I/O will not recover too nicely. If fork fails
that's most certainly ok... at worst a new client won't be able to
connect and he can retry later. Plus order 1 isn't really a big deal,
you know the probability to succeeds decreases exponentially with the
order.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/