[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <F9375C81-741A-4E2E-A441-8CD6A08F620F@cam.ac.uk>
Date: Fri, 13 Apr 2007 08:46:18 +0100
From: Anton Altaparmakov <aia21@....ac.uk>
To: Andreas Dilger <adilger@...sterfs.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
xfs@....sgi.com, hch@...radead.org
Subject: Re: [RFC] add FIEMAP ioctl to efficiently map file allocation
Hi Andreas,
On 13 Apr 2007, at 05:01, Andreas Dilger wrote:
> On Apr 12, 2007 12:22 +0100, Anton Altaparmakov wrote:
>> On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
>>> I'm interested in getting input for implementing an ioctl to
>>> efficiently map file extents & holes (FIEMAP) instead of looping
>>> over FIBMAP a billion times. We already have customers with single
>>> files in the 10TB range and we additionally need to get the mapping
>>> over the network so it needs to be efficient in terms of how data
>>> is passed, and how easily it can be extracted from the filesystem.
>>>
>>> struct fibmap_extent {
>>> __u64 fe_start; /* starting offset in bytes */
>>> __u64 fe_len; /* length in bytes */
>>> }
>>>
>>> struct fibmap {
>>> struct fibmap_extent fm_start; /* offset, length of desired
>>> mapping */
>>> __u32 fm_extent_count; /* number of extents in array */
>>> __u32 fm_flags; /* flags for input request */
>>> XFS_IOC_GETBMAP) */
>>> __u64 unused;
>>> struct fibmap_extent fm_extents[0];
>>> }
>>>
>>> #define FIEMAP_LEN_MASK 0xff000000000000
>>> #define FIEMAP_LEN_HOLE 0x01000000000000
>>> #define FIEMAP_LEN_UNWRITTEN 0x02000000000000
>>
>> Sound good but I would add:
>>
>> #define FIEMAP_LEN_NO_DIRECT_ACCESS
>>
>> This would say that the offset on disk can move at any time or that
>> the data is compressed or encrypted on disk thus the data is not
>> useful for direct disk access.
>
> This makes sense. Even for Reiserfs the same is true with packed
> tails,
> and I believe if FIBMAP is called on a tail it will migrate the
> tail into
> a block because this is might be a sign that the file is a kernel that
> LILO wants to boot.
>
> I'd rather not have any such feature in FIEMAP, and just return the
> on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
> My main reason for FIEMAP is being able to investigate allocation
> patterns
> of files.
>
> By no means is my flag list exhaustive, just the ones that I
> thought would
> be needed to implement this for ext4 and Lustre.
Sure, hence why I made my comment for NTFS. (-: And yes, ReiserFS
and even ext* could use such flag. I believe there is a compression
patch for ext somewhere isn't there? (Or at least there was one at
some point I think...)
>> Also why are you not using 0xff00000000000000, i.e. two more zeroes
>> at the end? Seems unnecessary to drop an extra 8 bits of
>> significance from the byte size...
>
> It was actually just a typo (this was the first time I'd written the
> structs and flags down, it is just at the discussion stage). I'd
> meant
> for it to be 2^56 bytes for the file size as I wrote later in the
> email.
Ok. (-:
> That said, I think that 2^48 bytes is probably sufficient for most
> uses,
> so that we get 16 bits for flags. As it is this email already
> discusses
> 5 flags, and that would give little room for expansion in the future.
>
> Remember, this is the mapping for a single file (which can't
> practially
> be beyond 2^64 bytes as yet) so it wouldn't be hard for the
> filesystem to
> return a few separate extents which are actually contiguous
> (assuming that
> there will actually be files in filesystems with > 2^48 bytes of
> contiguous
> space). Since the API is that it will return the extent that
> contains the
> requested "start" byte, the kernel will be able to detect this case
> also,
> since it won't be able to specify a length for the extent that
> contains the
> start byte.
Valid point. As long as the "on-disk location" is maintained as full
64 bits then you are right we could just return multiple extents if
the space does not fit. A bit of a kludge but it would certainly
work. An alternative would be to have the flags in a separate field
but that would add 8-bytes to the structure size if you want to
maintain 8-byte alignment so that would not be great...
> At most we'd have to call the ioctl() 65536 times for a completely
> contiguous 2^64 byte file if the buffer was only large enough for a
> single extent. In reality, I expect any file to have some
> discontinuities
> and the buffer to be large enough for a thousand or more entries so
> the
> corner case is not very bad.
>
>> Finally please make sure that the file system can return in one way
>> or another errors for example when it fails to determine the extents
>> because the system ran out of memory, there was an i/o error,
>> whatever... It may even be useful to be able to say "here is an
>> extent of size X bytes but we do not know where it is on disk because
>> there was an error determining this particular extent's on-disk
>> location for some reason or other"...
>
> Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
> FIEMAP_LEN_ERROR. Consider FIEMAP on a file that was migrated
> to tape and currently has no blocks allocated in the filesystem. We
> want to return some indication that there is actual file data and not
> just a hole, but at the same time we don't want this to actually
> return
> the file from tape just to generate block mappings for it.
Yes, NTFS also has off line storage (DFS - the Distributed File
System I think it is called) but we don't support any of that.
Perhaps one day...
> This concept is also present in XFS_IOC_GETBMAPX -
> BMV_IF_NO_DMAPI_READ,
> but this needs to be specified on input to prevent the file being
> mapped
> and I'd rather the opposite (not getting file from tape) be the
> default,
> by principle of least surprise.
>
>>> block-aligned/sized allocations (e.g. tail packing). The
>>> fm_extents array
>>> returned contains the packed list of allocation extents for the
>>> file,
>>> including entries for holes (which have fe_start == 0, and a flag).
>>
>> Why the fe_start == 0? Surely just the flag is sufficient... On
>> NTFS it is perfectly valid to have fe_start == 0 and to have that not
>> be sparse (normally the $Boot system file is stored in the first 8
>> sectors of the volume)...
>
> I thought fe_start = 0 was pretty standard for a hole. It should be
> something and I'd rather 0 than anything else. The _HOLE flag is
> enough
> as you say though.
It is standard on Unix. I am trying to fight this standard because
of NTFS... On NTFS a hole is -1 not 0 and zero is a valid block.
But on NTFS device locations are "s64" not "u64" so the -1 is logical
to use...
As long as it is made clear that people MUST check the flag when
fe_start == 0 rather than assume that fe_start == 0 means a hole I am
happy with that. Hopefully not too many programmers will be lazy
gits who will ignore this and just check fe_start == 0 or they will
fail on NTFS and assume $Boot is sparse when it is not...
> PS - I'd thought about adding you to the CC list for this, because
> I know
> you've had opinions on FIBMAP in the past, but I didn't have
> your email handy and it was late, and I know you saw the NTFS
> kmap
> patch on fsdevel so I figured you would see this too...
Thanks. Yes, I try to follow fsdevel closely and LKML not so closely
(I often read it with "select all new, delete")...
> Thanks for your input.
You are welcome.
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists