linux-kernel - Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 8 Apr 2014 16:21:02 -0400
From:	Matthew Wilcox <willy@...ux.intel.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Matthew Wilcox <matthew.r.wilcox@...el.com>,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 06/22] Replace XIP read and write with DAX I/O

On Tue, Apr 08, 2014 at 07:56:00PM +0200, Jan Kara wrote:
> > +static void dax_new_buf(void *addr, unsigned size, unsigned first,
> > +					loff_t offset, loff_t end, int rw)
> > +{
> > +	loff_t final = end - offset + first; /* The final byte of the buffer */
> > +	if (rw != WRITE) {
> > +		memset(addr, 0, size);
> > +		return;
> > +	}
>   It seems counterintuitive to zero out "on-disk" blocks (it seems you'd do
> this for unwritten blocks) when reading from them. Presumably it could also
> have undesired effects on endurance of persistent memory. Instead I'd expect
> that you simply zero out user provided buffer the same way as you do it for
> holes.

I think we have to zero it here, because the second time we call
get_block() for a given block, it won't be BH_New any more, so we won't
know that it's supposed to be zeroed.

> > +/*
> > + * When ext4 encounters a hole, it likes to return without modifying the
> > + * buffer_head which means that we can't trust b_size.  To cope with this,
> > + * we set b_state to 0 before calling get_block and, if any bit is set, we
> > + * know we can trust b_size.  Unfortunate, really, since ext4 does know
> > + * precisely how long a hole is and would save us time calling get_block
> > + * repeatedly.
>   Well, this is really a problem of get_blocks() returning the result in
> struct buffer_head which is used for input as well. I don't think it is
> actually ext4 specific.

Of course it's ext4 specific!  It's the ext4_get_block() implementation
which is choosing not to return the length of the hole.  XFS does return
the length of the hole.  I think something like this would fix it:

+++ b/fs/ext4/inode.c
@@ -727,14 +727,14 @@ static int _ext4_get_block(struct inode *inode, sector_t i
        }
 
        ret = ext4_map_blocks(handle, inode, &map, flags);
+       map_bh(bh, inode->i_sb, map.m_pblk);
+       bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
+       bh->b_size = inode->i_sb->s_blocksize * map.m_len;
        if (ret > 0) {
                ext4_io_end_t *io_end = ext4_inode_aio(inode);
 
-               map_bh(bh, inode->i_sb, map.m_pblk);
-               bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
                if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
                        set_buffer_defer_completion(bh);
-               bh->b_size = inode->i_sb->s_blocksize * map.m_len;
                ret = 0;
        }
        if (started)

(completely untested).

> > +	while (offset < end) {
> > +		void __user *buf = iov[seg].iov_base + copied;
> > +
> > +		if (offset == max) {
> > +			sector_t block = offset >> inode->i_blkbits;
> > +			unsigned first = offset - (block << inode->i_blkbits);
> > +			long size;
> > +
> > +			if (offset == bh_max) {
> > +				bh->b_size = PAGE_ALIGN(end - offset);
> > +				bh->b_state = 0;
> > +				retval = get_block(inode, block, bh,
> > +								rw == WRITE);
> > +				if (retval)
> > +					break;
> > +				if (!buffer_size_valid(bh))
> > +					bh->b_size = 1 << inode->i_blkbits;
> > +				bh_max = offset - first + bh->b_size;
> > +			} else {
> > +				unsigned done = bh->b_size - (bh_max -
> > +							(offset - first));
> > +				bh->b_blocknr += done >> inode->i_blkbits;
> > +				bh->b_size -= done;
>   It took me quite some time to figure out what this does and whether it is
> correct :). Why isn't this at the place where we advance all other
> iterators like offset, addr, etc.?

It'll be kind of tricky to move it because 'len' is not necessarily
a multiple of i_blkbits, so we can't necessarily maintain b_blocknr
accurately.

> > +			if (rw == WRITE) {
> > +				if (!buffer_mapped(bh)) {
> > +					retval = -EIO;
> > +					break;
>   -EIO looks like a wrong error here. Or maybe it is the right one and it
> only needs some explanation? The thing is that for direct IO some
> filesystems choose not to fill holes for direct IO and fall back to
> buffered IO instead (to avoid exposure of uninitialized blocks if the
> system crashes after blocks have been added to a file but before they were
> written out). For DAX you are pretty much free to define what you ask from
> the get_blocks() (and this fallback behavior is somewhat disputed behavior
> in direct IO case so you might want to differ here) but you should document
> it somewhere.

Hmm ... I thought that calling get_block() with the create argument would
force the return of a bh with the Mapped bit set.  Did I misunderstand that
aspect of the undocumented get_block() API too?

> > +	if ((flags & DIO_LOCKING) && (rw == READ)) {
> > +		struct address_space *mapping = inode->i_mapping;
> > +		mutex_lock(&inode->i_mutex);
> > +		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
> > +		if (retval) {
> > +			mutex_unlock(&inode->i_mutex);
> > +			goto out;
> > +		}
>   Is there a reason for this? I'd assume DAX has no pages in pagecache...

There will be pages in the page cache for holes that we page faulted on.
They must go!  :-)

> > @@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> >  	struct inode *inode = mapping->host;
> >  	ssize_t ret;
> >  
> > -	ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
> > +	if (IS_DAX(inode))
> > +		ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
> > +				ext2_get_block, NULL, DIO_LOCKING);
> > +	else
> > +		ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
> >  				 ext2_get_block);
>   I'd somewhat prefer to have a ext2_direct_IO() as is and have
> ext2_dax_IO() call only dax_do_io() (and use that as .direct_io in
> ext2_aops_xip). Then there's no need to check IS_DAX() and the code would
> look more obvious to me. But I don't feel strongly about it.

I can look at that ... but I was hoping to not have separate aops for
XIP and non-XIP files.

> > @@ -2681,6 +2686,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
> >  extern void save_mount_options(struct super_block *sb, char *options);
> >  extern void replace_mount_options(struct super_block *sb, char *options);
> >  
> > +static inline bool io_is_direct(struct file *filp)
> > +{
> > +	return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
> > +}
> > +
>   BTW: It seems fs/open.c: open_check_o_direct() can be simplified to not
> check for get_xip_mem(), cannot it?

That's in a later patch

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/