linux-ext4 - Re: [PATCH 1/2] direct-io: Implement generic deferred AIO completions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130716210027.GA9595@quack.suse.cz>
Date:	Tue, 16 Jul 2013 23:00:27 +0200
From:	Jan Kara <jack@...e.cz>
To:	Dave Chinner <david@...morbit.com>
Cc:	Jan Kara <jack@...e.cz>, linux-fsdevel@...r.kernel.org,
	xfs@....sgi.com, hch@...radead.org, Jeff Moyer <jmoyer@...hat.com>,
	Al Viro <viro@...IV.linux.org.uk>, linux-ext4@...r.kernel.org,
	Christoph Hellwig <hch@....de>
Subject: Re: [PATCH 1/2] direct-io: Implement generic deferred AIO completions

On Fri 12-07-13 10:44:21, Dave Chinner wrote:
> On Thu, Jul 11, 2013 at 12:02:18AM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <hch@...radead.org>
> > 
> > Add support to the core direct-io code to defer AIO completions to user
> > context using a workqueue.  This replaces opencoded and less efficient
> > code in XFS and ext4 and will be needed to properly support O_(D)SYNC
> > for AIO.
> 
> I don't see how this is any more efficient than the deferral
> that XFS already does for direct IO completion - it just changes
> what queues IO for completions.
  I didn't change the changelog so it's upto Christoph to tell. But I can
remove the 'less efficient' part from the changelog if you want.

> And on that topic:
> 
> > Currently this creates a per-superblock unbound workqueue for these
> > completions, which is taken from an earlier patch by Jan Kara.  I'm
> > not really convinced about this use and would prefer a "normal" global
> > workqueue with a high concurrency limit, but this needs further discussion.
> 
> Unbound workqueues sacrifice locality for immediate dispatch.
> AIO-DIO performance at the high end is determined by IO submission
> and completion locality - we want submission and completion to occur
> on the same CPU as much as possible so cachelines are not bounced
> aroundi needlessly. An unbound workqueue would seem to be the wrong
> choice on this basis alone.
> 
> Further the use of @max_active = 1 means that it is a global, single
> threaded workqueue. While ext4 might require single threaded
> execution of unwritten extent completion, it would be introducing a
> massive bottleneck into XFS which currently uses the default
> concurrency depth of 256 worker threads per CPU per superblock just
> for unwritten extent conversion.
> 
> Hmmmm. I notice that the next patch then does generic_write_sync()
> is the IO completion handler, too. In XFS that causes log forces and
> will block, so that's yet more concurrency required that is required
> for this workqueue. Doing this in a single threaded workqueue and
> marshalling all sync AIO through it is highly unfriendly to
> concurrent IO completion.
> 
> FWIW, in XFS we queue unwritten extent conversion completions on a
> different workqueue to EOF size update completions because the
> latter are small, fast and rarely require IO or get blocked. The
> data IO completion workqueue for EOF updates has the same
> concurrency and depth as the unwritten extent work queue (i.e. 256
> workers per cpu per superblock). So pushing all of this DIO and EOF
> completion work into a single threaded global workqueue that can
> block in every IO completion doesn't seem like a very smart idea to
> me...
  OK, so you'd rather have the workqueue created like:

alloc_workqueue("dio-sync", WQ_MEM_RECLAIM, 0)

  I can certainly try that. ext4 should handle that fine. And I agree with
your arguments for high end devices. I'm just wondering whether it won't do
some harm to a simple SATA drive. Having more threads trying to do extent
conversion in parallel might cause the drive to seek more.

> > -static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is_async)
> > +static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret,
> > +		bool is_async)
> >  {
> >  	ssize_t transferred = 0;
> >  
> > @@ -258,19 +262,26 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
> >  	if (ret == 0)
> >  		ret = transferred;
> >  
> > -	if (dio->end_io && dio->result) {
> > -		dio->end_io(dio->iocb, offset, transferred,
> > -			    dio->private, ret, is_async);
> > -	} else {
> > -		inode_dio_done(dio->inode);
> > -		if (is_async)
> > -			aio_complete(dio->iocb, ret, 0);
> > -	}
> > +	if (dio->end_io && dio->result)
> > +		dio->end_io(dio->iocb, offset, transferred, dio->private);
> 
> Ok, so by here we are assuming all filesystem IO completion
> processing is completed.
  Yes.

> > +
> > +	inode_dio_done(dio->inode);
> > +	if (is_async)
> > +		aio_complete(dio->iocb, ret, 0);
> >  
> > +	kmem_cache_free(dio_cache, dio);
> >  	return ret;
> 
> Hmmm. I started wonder if this is valid, because XFS is supposed to
> be doing workqueue based IO completion for appending writes and they
> are supposed to be run in a workqueue.
> 
> But, looking at the XFS code, there's actually a bug in the direct
> IO completion code and we are not defering completion to a workqueue
> like we should be for the append case.  This code in
> xfs_finish_ioend() demonstrates the intent:
> 
>                 if (ioend->io_type == XFS_IO_UNWRITTEN)
>                         queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>                 else if (ioend->io_append_trans ||
> >>>>>>                   (ioend->io_isdirect && xfs_ioend_is_append(ioend)))
>                         queue_work(mp->m_data_workqueue, &ioend->io_work);
> 
> The problem is that we only ever call xfs_finish_ioend() if is_async
> is true, and that will never be true for direct IO beyond the current
> EOF. That's a bug, and is Bad(tm).
> 
> [ Interestingly, it turns out that dio->is_async is never set for
> writes beyond EOF because of filesystems that update file sizes
> before data IO completion occurs (stale data exposure issues). For
> XFS, that can't happen and so dio->is_async could always be set.... ]
> 
> What this means is that there's a condition for work queue deferral
> of DIO IO completion that this patch doesn't handle. Setting deferral
> only on unwritten extents like this:
> 
> > @@ -1292,8 +1281,10 @@ __xfs_get_blocks(
> >  		if (create || !ISUNWRITTEN(&imap))
> >  			xfs_map_buffer(inode, bh_result, &imap, offset);
> >  		if (create && ISUNWRITTEN(&imap)) {
> > -			if (direct)
> > +			if (direct) {
> >  				bh_result->b_private = inode;
> > +				set_buffer_defer_completion(bh_result);
> > +			}
> >  			set_buffer_unwritten(bh_result);
> >  		}
> >  	}
> 
> misses that case. I suspect Christoph's original patch predated the
> changes to XFS that added transactional file size updates to IO
> completion and so didn't take it into account.
  OK, thanks for catching this. Doing the i_size check in
_xfs_get_blocks() would be somewhat cumbersome I presume so I guess we can
handle that case by adding __blockdev_direct_IO() flag to defer dio
completion to a workqueue. XFS can then set the flag when it sees it needs
and i_size update. OK?

> > @@ -1390,9 +1381,7 @@ xfs_end_io_direct_write(
> >  	struct kiocb		*iocb,
> >  	loff_t			offset,
> >  	ssize_t			size,
> > -	void			*private,
> > -	int			ret,
> > -	bool			is_async)
> > +	void			*private)
> >  {
> >  	struct xfs_ioend	*ioend = iocb->private;
> >  
> > @@ -1414,17 +1403,10 @@ xfs_end_io_direct_write(
> >  
> >  	ioend->io_offset = offset;
> >  	ioend->io_size = size;
> > -	ioend->io_iocb = iocb;
> > -	ioend->io_result = ret;
> >  	if (private && size > 0)
> >  		ioend->io_type = XFS_IO_UNWRITTEN;
> >  
> > -	if (is_async) {
> > -		ioend->io_isasync = 1;
> > -		xfs_finish_ioend(ioend);
> > -	} else {
> > -		xfs_finish_ioend_sync(ioend);
> > -	}
> > +	xfs_finish_ioend_sync(ioend);
> >  }
> 
> As i mentioned, the existing code here is buggy. What we should be
> doing here is:
> 
> 	if (is_async) {
> 		ioend->io_isasync = 1;
> 		xfs_finish_ioend(ioend);
> 	} else if (xfs_ioend_is_append(ioend))) {
> 		xfs_finish_ioend(ioend);
> 	} else {
> 		xfs_finish_ioend_sync(ioend);
> 	}
  Umm, I started to wonder why do you actually need to defer the completion
for appending ioend. Because if DIO isn't async, dio_complete() won't be
called from interrupt context but from __blockdev_direct_IO(). So it seems
you can do everything you need directly there without deferring to a
workqueue. But maybe there's some locking subtlety I'm missing.

> Which means that the new code boils down to:
> 
> 	if (xfs_ioend_is_append(ioend)))
> 		xfs_finish_ioend(ioend);
> 	else
> 		xfs_finish_ioend_sync(ioend);
> 
> And now we see the problem with the only defering unwritten IO -
> the new direct IO code will think IO is completed when all we've
> done is queued it to another workqueue.
  I agree. New direct IO completion handlers have to really finish
everything without deferring to a separate workqueue. ext4 had a same bug
but there I caught it and fixed.

> I'm not sure we can handle this in get_blocks - we don't have the
> context to know how to treat allocation beyond the current EOF.
> Indeed, the current block being mapped might not be beyond EOF, but
> some of the IO might be (e.g. lies across an extent boundary),
> so setting the deferred completion in get_blocks doesn't allow the
> entire IO to be treated the same.
> 
> So I think there's a bit of re-thinking needed to aspects of this
> patch to be done.
  Additional flag to __blockdev_direct_IO() should solve this as I wrote
above (if you really need it).

> > +		 * avoids problems with pseudo filesystems which get initialized
> > +		 * before workqueues can be created
> > +		 */
> > +		if (type->fs_flags & FS_REQUIRES_DEV) {
> > +			s->s_dio_done_wq =
> > +				alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
> 
> The workqueue name needs to be combined with the fs_name so we know
> what filesystem is having problems in sysrq-w output.
  Yes, that's reasonable. However that means we'll have to initialize the
workqueue later. Hum.. I think I will return to the on-demand creation of
the workqueue as I originally had in my patch set.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html