linux-ext4 - Re: [BUG] ext2/3/4: dio reads stale data when we do some append dio writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131119120112.GN11434@dastard>
Date:	Tue, 19 Nov 2013 23:01:12 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Christoph Hellwig <hch@...radead.org>
Cc:	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	xfs@....sgi.com
Subject: Re: [BUG] ext2/3/4: dio reads stale data when we do some append dio
 writes

On Tue, Nov 19, 2013 at 03:18:26AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 19, 2013 at 07:19:47PM +0800, Zheng Liu wrote:
> > Yes, I know that XFS has a shared/exclusive lock.  I guess that is why
> > it can pass the test.  But another question is why xfs fails when we do
> > some append dio writes with doing buffered read.
> 
> Can you provide a test case for that issue?

For XFS, appending direct IO writes only hold the IOLOCK exclusive
for as long as it takes to guarantee that the the region between the
old EOF and the new EOF is full of zeros before it is demoted.  i.e.
once the region is guaranteed not to expose stale data, the
exclusive IO lock is demoted to to a shared lock and a buffered read
is then allowed to proceed concurrently with the DIO write.

Hence even appending writes occur concurrently with buffered reads,
and if the read overlaps the block at the old EOF then the page
brought into the page cache will have zeros in it.

FWIW, there's a wonderful comment in generic_file_direct_write()
that pretty much covers this case:

        /*
         * Finally, try again to invalidate clean pages which might have been
         * cached by non-direct readahead, or faulted in by get_user_pages()
         * if the source of the write was an mmap'ed region of the file
         * we're writing.  Either one is a pretty crazy thing to do,
         * so we don't support it 100%.  If this invalidation
         * fails, tough, the write still worked...
         */

The kernel code simply does not have the exclusion mechanisms to
make concurrent buffered and direct IO robust. This is one of the
problems (amongst many) that we've been looking to solve with an VFS
level IO range lock of some kind....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html