lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 07 Apr 2022 16:21:42 -0400
From:   Jeff Layton <jlayton@...nel.org>
To:     Xiubo Li <xiubli@...hat.com>,
        Luís Henriques <lhenriques@...e.de>,
        Ilya Dryomov <idryomov@...il.com>
Cc:     ceph-devel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] ceph: invalidate pages when doing direct/sync writes

On Fri, 2022-04-08 at 03:24 +0800, Xiubo Li wrote:
> On 4/8/22 3:16 AM, Jeff Layton wrote:
> > On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
> > > On 4/7/22 11:15 PM, Luís Henriques wrote:
> > > > When doing a direct/sync write, we need to invalidate the page cache in
> > > > the range being written to.  If we don't do this, the cache will include
> > > > invalid data as we just did a write that avoided the page cache.
> > > > 
> > > > Signed-off-by: Luís Henriques <lhenriques@...e.de>
> > > > ---
> > > >    fs/ceph/file.c | 19 ++++++++++++++-----
> > > >    1 file changed, 14 insertions(+), 5 deletions(-)
> > > > 
> > > > Changes since v3:
> > > > - Dropped initial call to invalidate_inode_pages2_range()
> > > > - Added extra comment to document invalidation
> > > > 
> > > > Changes since v2:
> > > > - Invalidation needs to be done after a write
> > > > 
> > > > Changes since v1:
> > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > 
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 5072570c2203..97f764b2fbdd 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > >    		return ret;
> > > >    
> > > >    	ceph_fscache_invalidate(inode, false);
> > > > -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > -					    pos >> PAGE_SHIFT,
> > > > -					    (pos + count - 1) >> PAGE_SHIFT);
> > > > -	if (ret < 0)
> > > > -		dout("invalidate_inode_pages2_range returned %d\n", ret);
> > > >    
> > > >    	while ((len = iov_iter_count(from)) > 0) {
> > > >    		size_t left;
> > > > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > >    			break;
> > > >    		}
> > > >    		ceph_clear_error_write(ci);
> > > > +
> > > > +		/*
> > > > +		 * we need to invalidate the page cache here, otherwise the
> > > > +		 * cache will include invalid data in direct/sync writes.
> > > > +		 */
> > > > +		ret = invalidate_inode_pages2_range(
> > > IMO we'd better use truncate_inode_pages_range() after write. The above
> > > means it's possibly will write the dirty pagecache back, which will
> > > overwrite and corrupt the disk data just wrote.
> > > 
> > I disagree. We call filemap_write_and_wait_range at the start of this,
> > so any data that was dirty when we called write() will be written back
> > before the sync write.
> > 
> > If we truncate the range, then we'll potentially lose writes that came
> > in after write was issued but before truncate_inode_pages_range. I think
> > we'd rather let what we just wrote be clobbered in this situation than
> > lose a write altogether.
> > 
> > All of this is somewhat academic though. If you're mixing buffered and
> > direct writes like this without some sort of locking, then you're just
> > asking for trouble. The aim here is "sane behavior to the best of our
> > ability", but we can't expect it to always be sane when people do insane
> > things. ;)
> 
> Just in the case Luis hit. Before writing the new data the mapping 
> happen when reading the src in copy_from_usr(). So once the writing done 
> the pagecache is caching the stale contents.
> 

Not just in that case.

You could have 2 unrelated processes, one doing DIO writes and one doing
mmap writes. You're likely to end up with a mess unless you're very
careful with what you're doing, but there should be some expectation
that it will work if you serialize things correctly and/or have them
writing to their own areas of the file, etc.

In any case, we'll never get perfect cache coherency, and I figure that
until the write returns, what's in the pagecache ought to be considered
valid.

> > > Though it seems impossible that these pagecaches will be marked dirty,
> > > but this call is misleading ?
> > > 
> > Not impossible at all. You can open a file O_DIRECT and then mmap the fd
> > for PROT_WRITE (or just open the file a second time and do it).
> > 
> > We definitely recommend against mixing buffered and direct I/O, but
> > nothing really prevents someone from doing it. If the user is properly
> > using file locking, then there's really no reason it shouldn't work.
> > 
> > > > +				inode->i_mapping,
> > > > +				pos >> PAGE_SHIFT,
> > > > +				(pos + len - 1) >> PAGE_SHIFT);
> > > > +		if (ret < 0) {
> > > > +			dout("invalidate_inode_pages2_range returned %d\n",
> > > > +			     ret);
> > > > +			ret = 0;
> > > > +		}
> > > >    		pos += len;
> > > >    		written += len;
> > > >    		dout("sync_write written %d\n", written);
> > > > 
> 

-- 
Jeff Layton <jlayton@...nel.org>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ