linux-kernel - Re: [PATCH] direct-io: only inc/dec inode->i_dio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <552EBB7B.4030107@fb.com>
Date:	Wed, 15 Apr 2015 13:26:51 -0600
From:	Jens Axboe <axboe@...com>
To:	Andrew Morton <akpm@...ux-foundation.org>
CC:	<linux-fsdevel@...r.kernel.org>, <hch@...radead.org>,
	<viro@...iv.linux.org.uk>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems

On 04/15/2015 12:56 PM, Andrew Morton wrote:
> On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@...com> wrote:
>
>> Hi,
>>
>> This is a reposting of a patch that was originally in the blk-mq series.
>> It has huge upside on shared access to a multiqueue device doing
>> O_DIRECT, it's basically the scaling block that ends up killing
>> performance. A quick test here reveals that we spend 30% of all system
>> time just incrementing and decremening inode->i_dio_count. For block
>> devices this isn't useful at all, as we don't need protection against
>> truncate. For that test case, performance increases about 3.6x (!!) by
>> getting rid of this inc/dec per IO.
>>
>> I've cleaned it up a bit since last time, integrating the checks in
>> inode_dio_done() and adding a inode_dio_begin() so that callers don't
>> need to know about this.
>>
>> We've been running a variant of this patch in the FB kernel for a while.
>> I'd like to finally get this upstream.
>
> 30% overhead for one atomic_inc+atomic_dec+wake_up_bit() per IO?  That
> seems very high!  Is there something else going on?

Nope, that is it. But for this simple test case, that is essentially the 
only shared state that exists and is dirtied between all CPUs that are 
running an application that does O_DIRECT to the device. The test case 
ran at ~9.6M IOPS after the patch, and at ~2.5M IOPS before. On a simple 
2 socket box, no less, nothing fancy there.

And it's _just_ the atomic inc/dec. I ran with the exact patch posted, 
and that still does (needlessly, I think) the wake_up_bit() for 
inode_dio_done().

> Is there similar impact to direct-io-to-file?  It would be nice to fix
> that up also.  Many filesystems do something along the lines of
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	atomic_dev(i_dio_count);
> 	__blockdev_direct_IO(...);
>
> and with your patch I think we could change them to
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> 	atomic_dev(i_dio_count);
>
> which would halve the atomic op load.

I haven't checked pure file, but without extending, I suspect that we 
should see similar benefits there. In any case, it'd make sense doing, 
having twice the atomic inc/dec is just a bad idea in general, if we can 
get rid of it.

A quick grep doesn't show this use case, or I'm just blind. Where do you 
see that?

> But that's piling hack on top of hack.  Can we change the

I'd more view it as reducing the hack, the real hack is the way that we 
manually do atomic_inc() on i_dio_count, and then call a magic 
inode_dio_done() that decrements it again. It's not very pretty, I'm 
just reducing the scope of the hack :-)

> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> increment i_dio_count"?  ie: exclusion against truncate is wholly the
> caller's responsibility.  That way, this awkward sharing of
> responsibility between caller and callee gets cleaned up and
> DIO_IGNORE_TRUNCATE goes away.

That would clean it up, at the expense of touching more churn. I'd be 
fine with doing it that way.

> inode_dio_begin() would be a good place to assert that i_mutex is held,
> btw.

How would you check it? If the i_dio_count is bumped, then you'd not 
need to hold i_mutex.

> This whole i_dio_count thing is pretty nasty, really.  If you stand
> back and squint, it's basically an rwsem.  I wonder if we can use an
> rwsem...

We could, but a bit orthogonal to the patch since we'd do the same 
avoidance for blkdevs.

> What's the reason for DIO_IGNORE_TRUNCATE rather than boring old
> !S_ISBLK?

Didn't want to tie it to block devices, but rather just keep the "need 
lock or not" as a separate flag in case there were more use cases.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/