lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 14 Apr 2011 23:01:04 -0600
From:	Andreas Dilger <adilger@...ger.ca>
To:	Dave Chinner <david@...morbit.com>
Cc:	Pádraig Brady <P@...igBrady.com>,
	Eric Sandeen <sandeen@...deen.net>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"coreutils@....org" <coreutils@....org>,
	Markus Trippelsdorf <markus@...ppelsdorf.de>,
	xfs-oss <xfs@....sgi.com>
Subject: Re: Files full of zeros with coreutils-8.11 and xfs (FIEMAP related?)

On 2011-04-14, at 6:09 PM, Dave Chinner <david@...morbit.com> wrote:
> On Fri, Apr 15, 2011 at 12:29:46AM +0100, Pádraig Brady wrote:
>> On 14/04/11 23:59, Dave Chinner wrote:
>>> On Thu, Apr 14, 2011 at 10:50:10AM -0500, Eric Sandeen wrote:
>>>> On 4/14/11 9:59 AM, Pádraig Brady wrote:
>>>>> On 14/04/11 15:02, Markus Trippelsdorf wrote:
>>>>>>>> Hi Pádraig,
>>>>>>>> 
>>>>>>>> here you go:
>>>>>>>> + filefrag -v unwritten.withdata                                                                                                                     
>>>>>>>> Filesystem type is: ef53                                                                                                                             
>>>>>>>> File size of unwritten.withdata is 5120 (2 blocks, blocksize 4096)                                                                                   
>>>>>>>> ext logical physical expected length flags                                                                                                          
>>>>>>>>   0       0   274432            2560 unwritten,eof                                                                                                  
>>>>>>>> unwritten.withdata: 1 extent found
>>>>>>>> 
>>>>>>>> Please notice that this also happens with ext4 on the same kernel. 
>>>>>>>> Btrfs is fine.
>>>>>>> 
>>>>>> `filefrag -vs` fixes the issue on both xfs and ext4.
>>>>> 
>>>>> So in summary, currently on (2.6.39-rc3), the following
>>>>> will (usually?) report a single unwritten extent,
>>>>> on both ext4 and xfs
>>>>> 
>>>>>  fallocate -l 10MiB -n k
>>>>>  dd count=10 if=/dev/urandom conv=notrunc iflag=fullblock of=k
>>>>>  filefrag -v k # grep for an extent without unwritten || fail
>>>> 
>>>> right, that's what I see too in testing.
>>>> 
>>>> But would the coreutils install have done a preallocation of the destination file?
>>>> 
>>>> Otherwise this looks like a different bug...
>>>> 
>>>>> This particular issue has been discussed so far at:
>>>>> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8411
>>>>> Note there it was stated there that ext4 had this
>>>>> fixed as of 2.6.39-rc1, so maybe there is something lurking?
>>>> 
>>>> ext4 got a fix, but not xfs, I guess.  My poor brain can't remember, I think I started looking into it, but it's clearly still broken.
>>>> 
>>>> Still, I don't know for sure what happened to Markus - did something preallocate, in his case?
>>> 
>>> Unwritten extent mapping behaves in an unexpected way due to
>>> buffered writeback not occurring immediately. Extent conversion
>>> doesn't occur until the data is on disk, and for buffered IO you
>>> need an fdatasync to ensure that has occurred.
>>> 
>>> That is: 
>>> 
>>> $ xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c "bmap -vp" /mnt/test/foo
>>> wrote 5120/5120 bytes at offset 0
>>> 5 KiB, 2 ops; 0.0000 sec (62.600 MiB/sec and 25641.0256 ops/sec)
>>> /mnt/test/foo:
>>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>>   0: [0..20479]:      268984..289463    0 (268984..289463) 20480 10000
>>> 
>>> Data has not been written yet, so it is still unwritten. The same
>>> test with a fsync shows:
>>> 
>>> $ sudo xfs_io -f -c "resvsp 0 10m" -c "pwrite 0 5120" -c fsync -c "bmap -vp" /mnt/test/foo
>>> wrote 5120/5120 bytes at offset 0
>>> 5 KiB, 2 ops; 0.0000 sec (87.193 MiB/sec and 35714.2857 ops/sec)
>>> /mnt/test/foo:
>>> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>>>   0: [0..15]:         268984..268999    0 (268984..268999)    16 00000
>>>   1: [16..20479]:     269000..289463    0 (269000..289463) 20464 10000
>>> 
>>> Everything is fine.
>>> 
>>> So this seems like an application error to me. If you are going to
>>> use fiemap to determine what ranges to copy, then you have to
>>> fdatasync the source file first to guarantee that preallocated
>>> extents have been converted to written state before mapping the
>>> file....
>> 
>> Well IMHO there should be a difference between
>> knowing where you are going to write, and actually writing to disk.
>> I.E. one shouldn't need to write the whole way to the device
>> before returning a valid fiemap.  If a particular file system
>> implementation needs to sync to return a valid fiemap,
>> then it should be implicit.
> 
> No, this was explicitly laid out in the fiemap interface discussions
> - it's up to the applicaiton to decide if it needs to do a sync
> first. That's what the FIEMAP_FLAG_SYNC control flag is for.
> This forces the fiemap call to do a fsync _before_ getting the
> mapping. If you want to know the exact layout of the file is, then
> you must use this flag.
> 
> Even so, it is recognised that this is racy - any use of the block
> map has a time-of-read-to-time-of-use race condition that means you
> have to _verify_ the copy after it completes. FYI, that's what
> xfs_fsr does when copying based on extent maps - if the inode has
> changed in _any way_ during the copy, it aborts the copy of that
> file.
> 
> i.e. using fiemap for copying is at best a *hint* about the regions
> that need copying, and it is in no way a guarantee that you'll get
> all the information you need to make accurate copy even if you do
> use the synchronous variant.

I would tend to agree with Pádraig. If there is data in the mapping (regardless of whether it is on disk or not), the FIEMAP should return this to the caller. The SYNC flag is only intended to flush the data to disk for tools that are doing direct-to-disk operations on the data. 

Otherwise the UNMAPPED flag is useless, since even with "check, copy, check" there is no guarantee that the inode is changed _during_ the copy operation. It could have been written into the cache _before_ the FIEMAP and remain unchanged and in your case there would be no way to know any data was ever written to the file without SYNC on ever single file before FIEMAP.

Cheers, Andreas--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists