linux-kernel - Re: [00/17] Large Blocksize Support V3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070426084012.GA26926@skynet.ie>
Date:	Thu, 26 Apr 2007 09:40:12 +0100
From:	mel@...net.ie (Mel Gorman)
To:	Nick Piggin <nickpiggin@...oo.com.au>
Cc:	David Chinner <dgc@....com>,
	"Eric W. Biederman" <ebiederm@...ssion.com>, clameter@....com,
	linux-kernel@...r.kernel.org,
	William Lee Irwin III <wli@...omorphy.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Badari Pulavarty <pbadari@...il.com>,
	Maxim Levitsky <maximlevitsky@...il.com>
Subject: Re: [00/17] Large Blocksize Support V3

On (26/04/07 16:50), Nick Piggin didst pronounce:
> David Chinner wrote:
> >On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
> >
> >>I think starting with the assumption that we _want_ to use higher order
> >>allocations, and then creating all this complexity around that is not a
> >>good one, and if we start introducing things that _require_ significant
> >>higher order allocations to function then it is a nasty thing for
> >>robustness.
> >
> >
> >>From my POV, we started with the problem of how to provide atomic
> >access to a multi-page block in the page cache. For example, we want
> >to lock the filesystem block and prevent any updates to it, so we
> >need to lock all the pages in it. And then when we write them back,
> >they all need to change state at the same time, and they all need to
> >have their radix tree tags changed at the same time, the problem of
> >mapping them to disk, getting writeback to do block aligned and
> >sized writeback chunks, and so on.
> >
> >And then there's the problem that most hardware is limited to 128
> >s/g entries and that means 128 non-contiguous pages in memory is the
> >maximum I/O size we can issue to these devices. We have RAID arrays
> >that go twice as fast if we can send them 1MB I/Os instead of 512k
> >I/Os and that means we need contiguous pages to be handled to the
> >devices....
> >
> >All of these things require some aggregating structure to
> >co-ordinate. In times gone by on other OSs, this has been done with
> >a buffer cache, but Linux got rid of that long ago and we don't want
> >to reintroduce one. We can't use buffer heads - they can only point
> >to one page. So what do we do?
> 
> Improving the buffer layer would be a good way. Of course, that is
> a long and difficult task, so nobody wants to do it.
> 
> 
> >That's where compound pages are so nice - they solve all of these
> >problems with only a very small amount of code perturbation and they
> >don't change any algorithms or fundamental design of the OS at all.
> >
> >We don't have to rewrite filesystems to support this. We don't have
> >to redesign the VM to support this. We don't have to do very much
> >work to the block layer and drivers to make this work.
> 
> Fragmentation is the problem. The anti-frag patches don't actually
> guarantee anything about fragmentation, and even if they did, then

The grouping pages by mobility do not guarantee anything but the memory
partition (kernelcore= boot parameter) does give hard guarantees about
the amount of memory that is "movable". Of course, the partition requires
configuration at boot-time so it's less than ideal but it does give hard
guarantees.

> it takes more effort to actually find and reclaim higher order
> pages (even after the lumpy reclaim thing). So we've redesigned
> the page allocator and page reclaim and still don't have something
> that you should rely on without fallbacks.
> 
> 
> >FWIW, if you want 32 bit machines to support larger than 16TB
> >devices, you need high order page indexing in the page cache....
> 
> How about a 64-bit pgoff_t and radix-tree key? Or larger PAGE_SIZE?
> 

Larger page size can result in wastage for small files and there is no
guarantee that the hardware supports the large page size so you now have to
deal with inserting pages into PTEs that are smaller than PAGE_SIZE.

> 
> >>I am working now and again on some code to do this, it is a big job but
> >>I think it is the right way to do it. But it would take a long time to
> >>get stable and supported by filesystems...
> >
> >
> >Compared to a method that requires almost no change to the
> >filesystems that want to support large block sizes?
> 
> Luckily we have filesystem maintainers who want to support larger
> block sizes ;)
> 
> 
> >>>That would make the choice of using larger order pages (essentially
> >>>increasing PAGE_SIZE) something that can be investigated in parallel.
> >>
> >>I agree that hardware inefficiencies should be handled by increasing
> >>PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
> >
> >
> >And what do we do for arches that can't do multiple page sizes, only
> >only have a limited and mostly useless set of page sizes to choose
> >from?
> 
> Well, for those architectures (and this would solve your large block
> size and 16TB pagecache size without any core kernel changes), you
> can manage 1<<order hardware ptes as a single Linux pte. There is
> nothing that says you must implement PAGE_SIZE as a single TLB sized
> page.

Indeed but then you have to deal with internal fragmentation 
for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
come with it's own set of issues.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/