[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221031112237.kgr64levqo3dxoj5@quack3>
Date: Mon, 31 Oct 2022 12:22:37 +0100
From: Jan Kara <jack@...e.cz>
To: Matt Bobrowski <repnop@...gle.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org
Subject: Re: General Filesystem Question - Interesting Unexplainable
Observation
Hi Matthew!
[added ext4 mailing list to CC, maybe others have more ideas]
On Fri 28-10-22 23:23:14, Matt Bobrowski wrote:
> Just had a general question in regards to some recent filesystem (ext4)
> behaviour I've recently observed, which kind of made my eyebrows raise a
> little and I wanted to understand why this was happening.
>
> We have an application (single threaded process) that basically performs
> the following sequence of filesystem operations using buffered I/O:
>
> ---
> fd = open("dir/tmp/filename.new", O_WRONLY | O_CREAT | O_TRUNC, 0400);
> ...
> write(fd, buf, sizeof(buf));
> ...
> rename("dir/tmp/filename.new", "dir/new/filename");
> ---
>
> At times, I see the "dir/new/filename" file size reporting 0 bytes, despite
> sizeof(buf) written to "dir/tmp/filename.new" always guaranteed to be > 0
> and the result of the write reported as being successful. This is the part
> I cannot come up with a valid explanation for (yet).
So by "file size reporting 0 bytes" do you mean that
stat("dir/new/filename") from a concurrent process returns file size 0
sometimes? Or do you refer to a situation after an unclean filesystem
shutdown?
> Understandably,
> there's no fsync being currently performed post calling write, which I
> think needs to be corrected, but I also can't see how not using fsync post
> write would result in the file size for "dir/new/filename" being reported
> as 0 bytes? One of the things that crossed my mind was that the rename
> operation was possibly being committed prior to the dirty pages from the
> pagecache being flushed, but regardless I don't see how a rename would
> result in the data blocks associated to the write not ever being committed
> for the same underlying inode?
>
> What are your thoughts? Any plausible explanation why I might be seeing
> this odd behaviour?
Ext4 uses delayed allocation. That means that write(2) just stores data in
the page cache but no blocks are allocated yet. So indeed rename(2) can be
fully committed in the journal before any of the data gets to persistent
storage. That being said ext4 has a workaround for buggy applications (can
be disabled with "noauto_da_alloc" mount option) that starts data writeback
before rename is done so at least in data=ordered mode you should not see 0
length files after a crash with the above scheme.
WRT concurrent process seeing 0 file length, I would not have a great
explanation for that because once data is written to the inode,
inode->i_size is set to the final inode size which is what stat(2) reports.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists