[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070426063830.GE32602149@melbourne.sgi.com>
Date: Thu, 26 Apr 2007 16:38:30 +1000
From: David Chinner <dgc@....com>
To: Nick Piggin <nickpiggin@...oo.com.au>
Cc: "Eric W. Biederman" <ebiederm@...ssion.com>, clameter@....com,
linux-kernel@...r.kernel.org, Mel Gorman <mel@...net.ie>,
William Lee Irwin III <wli@...omorphy.com>,
David Chinner <dgc@....com>,
Jens Axboe <jens.axboe@...cle.com>,
Badari Pulavarty <pbadari@...il.com>,
Maxim Levitsky <maximlevitsky@...il.com>
Subject: Re: [00/17] Large Blocksize Support V3
On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
> I think starting with the assumption that we _want_ to use higher order
> allocations, and then creating all this complexity around that is not a
> good one, and if we start introducing things that _require_ significant
> higher order allocations to function then it is a nasty thing for
> robustness.
>From my POV, we started with the problem of how to provide atomic
access to a multi-page block in the page cache. For example, we want
to lock the filesystem block and prevent any updates to it, so we
need to lock all the pages in it. And then when we write them back,
they all need to change state at the same time, and they all need to
have their radix tree tags changed at the same time, the problem of
mapping them to disk, getting writeback to do block aligned and
sized writeback chunks, and so on.
And then there's the problem that most hardware is limited to 128
s/g entries and that means 128 non-contiguous pages in memory is the
maximum I/O size we can issue to these devices. We have RAID arrays
that go twice as fast if we can send them 1MB I/Os instead of 512k
I/Os and that means we need contiguous pages to be handled to the
devices....
All of these things require some aggregating structure to
co-ordinate. In times gone by on other OSs, this has been done with
a buffer cache, but Linux got rid of that long ago and we don't want
to reintroduce one. We can't use buffer heads - they can only point
to one page. So what do we do?
That's where compound pages are so nice - they solve all of these
problems with only a very small amount of code perturbation and they
don't change any algorithms or fundamental design of the OS at all.
We don't have to rewrite filesystems to support this. We don't have
to redesign the VM to support this. We don't have to do very much
work to the block layer and drivers to make this work.
FWIW, if you want 32 bit machines to support larger than 16TB
devices, you need high order page indexing in the page cache....
> >Is it common for hardware that supports large block sizes to not
> >support splitting those blocks apart during DMA? Unless it is common
> >the whole premise of this patchset seems broken.
> >
> >I suspect what needs to be fixed is the page cache block device
> >interface so that we have helper functions that know how to stuff
> >a single block into several pages.
>
> I am working now and again on some code to do this, it is a big job but
> I think it is the right way to do it. But it would take a long time to
> get stable and supported by filesystems...
Compared to a method that requires almost no change to the
filesystems that want to support large block sizes?
> >That would make the choice of using larger order pages (essentially
> >increasing PAGE_SIZE) something that can be investigated in parallel.
>
> I agree that hardware inefficiencies should be handled by increasing
> PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.
And what do we do for arches that can't do multiple page sizes, only
only have a limited and mostly useless set of page sizes to choose
from?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists