linux-ext4 - Re: [PATCH 0/3] Revert parallel dio reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20190910141041.134674C072@d06av22.portsmouth.uk.ibm.com>
Date:   Tue, 10 Sep 2019 19:40:40 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>,
        "Theodore Ts'o" <tytso@....edu>, hch@...radead.org
Cc:     Joseph Qi <joseph.qi@...ux.alibaba.com>,
        Andreas Dilger <adilger@...ger.ca>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH 0/3] Revert parallel dio reads

Hello,

>> Before doing this, you might want to have a chat and co-ordinate
>> with the folks that are currently trying to port the ext4 direct IO
>> code to use the iomap infrastructure:
>>
>> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
>>
>> That is going to need the shared locking on read and will work just
>> fine with shared locking on write, too (it's the code that XFS uses
>> for direct IO). So it might be best here if you work towards shared
>> locking on the write side rather than just revert the shared locking
>> on the read side....
> 
> Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
> inode lock for all aligned non-extending DIO writes will be easy so I'd
> prefer if we didn't have to redo the iomap conversion patches due to these
> reverts.

I was looking into this to see what is required to convert this into 
shared locking for DIO write side.
Why do you say shared inode lock only for *aligned non-extending DIO 
writes*?
non-extending means overwrite case only, which still in today's code is
not under any exclusive inode_lock (except the orphan handling and 
truncate handling in case of error after IO is completed). But even with
current code the perf problem is reported right?

If we see in today's code (including in iomap new code for overwrite 
case where we downgrade the lock).
We call for inode_unlock in case of overwrite and then do the IO, since 
we don't have to allocate any blocks in this case.

	/*
	 * Make all waiters for direct IO properly wait also for extent
	 * conversion. This also disallows race between truncate() and
	 * overwrite DIO as i_dio_count needs to be incremented under
  	 * i_mutex.
	 */
	inode_dio_begin(inode);
	/* If we do a overwrite dio, i_mutex locking can be released */
	overwrite = *((int *)iocb->private);
	if (overwrite)
		inode_unlock(inode);

	write data (via __blockdev_direct_IO)

	inode_dio_end(inode);
	/* take i_mutex locking again if we do a ovewrite dio */
	if (overwrite)
		inode_lock(inode);

What can we do next:-

Earlier the DIO reads was not having any inode_locking.
IIUC, the inode_lock_shared() in the DIO reads case was added to
protect the race against reading back uninitialized/stale data for e.g.
in case of truncate or writeback etc.

So as Dave suggested, shouldn't we add the complete shared locking in 
DIO writes cases as well (except for unaligned writes, since it may 
required zeroing of partial blocks).

Could you please help me understand how we can go about it?
I was thinking if we can create uninitialized blocks beyond EOF, so that
any parallel read should only read 0 from that area and we may not 
require exclusive inode_lock. The block creation is anyway protected
with i_data_sem in ext4_map_blocks.

I do see that in case of bufferedIO writes, we use 
ext4_get_block_unwritten for dioread_nolock case. Which should
be true for even writes extending beyond EOF. This will
create uninitialized and mapped blocks only (even beyond EOF).
But all of this happen under exclusive inode_lock() in ext4_file_write_iter.

Whereas in case of DIO writes extending beyond EOF, we pass
EXT4_GET_BLOCKS_CREATE in ext4_map_blocks which may allocate
initialized & mapped blocks.
Do you know why so?

Do you think we can create uninit. blocks in dioread_nolock AIO/non-AIO 
DIO writes cases as well with only shared inode lock mode?
Do you see any problems with this?
Or do you have any other suggestion?

In case of XFS -
I do see it promotes the demotes the shared lock (IOLOCK_XXX) in 
xfs_file_aio_write_checks. But I don't know if after that, whether
it only holds the shared lock while doing the DIO IO?
And how does it protects against the parallel DIO reads which can 
potentially expose any stale data?

Appreciate any help & inputs from FS developers.

-ritesh