[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <31B65C6A-BECB-4B95-B2D8-ADF422AA6B77@cam.ac.uk>
Date: Wed, 27 Jun 2007 16:18:32 +0100
From: Anton Altaparmakov <aia21@....ac.uk>
To: Chris Mason <chris.mason@...cle.com>
Cc: Nick Piggin <npiggin@...e.de>, David Chinner <dgc@....com>,
Nick Piggin <nickpiggin@...oo.com.au>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Memory Management List <linux-mm@...ck.org>,
linux-fsdevel@...r.kernel.org
Subject: Re: [RFC] fsblock
On 27 Jun 2007, at 12:50, Chris Mason wrote:
> On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:
>> On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:
>>> On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:
>>>> On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:
>>>
>>> [ ... fsblocks vs extent range mapping ]
>>>
>>>> iomaps can double as range locks simply because iomaps are
>>>> expressions of ranges within the file. Seeing as you can only
>>>> access a given range exclusively to modify it, inserting an empty
>>>> mapping into the tree as a range lock gives an effective method of
>>>> allowing safe parallel reads, writes and allocation into the file.
>>>>
>>>> The fsblocks and the vm page cache interface cannot be used to
>>>> facilitate this because a radix tree is the wrong type of tree to
>>>> store this information in. A sparse, range based tree (e.g. btree)
>>>> is the right way to do this and it matches very well with
>>>> a range based API.
>>>
>>> I'm really not against the extent based page cache idea, but I
>>> kind of
>>> assumed it would be too big a change for this kind of generic
>>> setup. At
>>> any rate, if we'd like to do it, it may be best to ditch the idea of
>>> "attach mapping information to a page", and switch to "lookup
>>> mapping
>>> information and range locking for a page".
>>
>> Well the get_block equivalent API is extent based one now, and I'll
>> look at what is required in making map_fsblock a more generic call
>> that could be used for an extent-based scheme.
>>
>> An extent based thing IMO really isn't appropriate as the main
>> generic
>> layer here though. If it is really useful and popular, then it could
>> be turned into generic code and sit along side fsblock or underneath
>> fsblock...
>
> Lets look at a typical example of how IO actually gets done today,
> starting with sys_write():
Yes, this is very inefficient which is one of the reasons I don't use
the generic file write helpers in NTFS. The other reasons are that
supporting larger logical block sizes than PAGE_CACHE_SIZE becomes a
pain if it is not done this way when the write targets a hole as that
requires all pages in the hole to be locked simultaneously which
would mean dropping the page lock to acquire the others that are of
lower page index and to then re-take the page lock which is horrible
- much better to lock all at once from the outset and the other
reason is that in NTFS there is such a thing as the initialized size
of an attribute which basically states "anything past this byte
offset must be returned as 0 on read, i.e. it does not have to be
read from disk at all, and on write beyond the initialized_size you
have to zero on disk everything between the old initialized size and
the start of the write before you begin writing and certainly before
you update the initalized_size otherwise a concurrent read would see
random old data from the disk.
For NTFS this effectively becomes:
> sys_write(file, buffer, 1MB)
allocate space for the entire 1MB write
if write offset past the initialized_size zero out on disk starting
at initialized_size up to the start offset for the write and update
the initialized size to be equal to the start offset of the write
do {
if (current position is in a hole and the NTFS logical block size is
> PAGE_CACHE_SIZE) {
work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in one go;
do_pages = vol->cluster_size / PAGE_CACHE_SIZE;
} else {
work on only one page;
do_pages = 1;
}
fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of source
pages
grab do_pages worth of pages
prepare_write - attach buffers to grabbed pages
copy data from source to grabbed&prepared pages
commit_write the copied pages by dirtying their buffers
} while (data left to write);
The allocation in advance is a huge win both in terms of avoiding
fragmentation (NTFS still uses a very simple/stupid allocator so you
get a lot of fragmentation if two processes write to different files
simultaneously and do so in small chunks) and in terms of performance.
I have wondered whether I should perhaps turn on the "multi page"
stuff on for all writes rather than just for ones that go into a hole
and the logical size is greater than the PAGE_CACHE_SIZE as that
might improve performance even further but I haven't had the time/
inclination to experiment...
And I have also wondered whether to go direct to bio/wholes pages at
once instead of bothering with dirtying each buffer but the buffers
(which are always 512 bytes on NTFS) allow me to easily support
dirtying smaller parts of the page which is desired at least on
volumes with a logical block size < PAGE_CACHE_SIZE as different bits
of the page could then reside on completely different locations on
disk so writing out unneeded bits of the page could result in a lot
of wasted disk head seek times.
Best regards,
Anton
> for each page:
> prepare_write()
> allocate contiguous chunks of disk
> attach buffers
> copy_from_user()
> commit_write()
> dirty buffers
>
> pdflush:
> writepages()
> find pages with contiguous chunks of disk
> build and submit large bios
>
> So, we replace prepare_write and commit_write with an extent based
> api,
> but we keep the dirty each buffer part. writepages has to turn that
> back into extents (bio sized), and the result is completely full of
> dark
> dark corner cases.
>
> I do think fsblocks is a nice cleanup on its own, but Dave has a good
> point that it makes sense to look for ways generalize things even
> more.
>
> -chris
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists