linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1cd75bf6-875f-400a-9158-d747e422ad78@163.com>
Date: Mon, 10 Feb 2025 09:44:43 +0800
From: Chi Zhiling <chizhiling@....com>
To: Dave Chinner <david@...morbit.com>
Cc: Christoph Hellwig <hch@...radead.org>, Brian Foster <bfoster@...hat.com>,
 "Darrick J. Wong" <djwong@...nel.org>, Amir Goldstein <amir73il@...il.com>,
 cem@...nel.org, linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 Chi Zhiling <chizhiling@...inos.cn>, John Garry <john.g.garry@...cle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On 2025/1/28 04:49, Dave Chinner wrote:
> On Fri, Jan 24, 2025 at 03:57:43PM +0800, Chi Zhiling wrote:
>> On 2025/1/16 05:41, Dave Chinner wrote:
>>> IOWs, a two-state shared lock provides the mechanism to allow DIO
>>> to be done without holding the i_rwsem at all, as well as being able
>>> to elide two atomic operations per DIO to track in-flight DIOs.
>>>
>>> We'd get this whilst maintaining buffered/DIO coherency without
>>> adding any new overhead to the DIO path, and allow concurrent
>>> buffered reads and writes that have their atomicity defined by the
>>> batched folio locking strategy that Brian is working on...
>>>
>>> This only leaves DIO coherency issues with mmap() based IO as an
>>> issue, but that's a problem for a different day...
>>
>> When I try to implement those features, I think we could use exclusive
>> locks for RWF_APPEND writes, and shared locks for other writes.
>>
>> The reason is that concurrent operations are also possible for extended
>> writes if there is no overlap in the regions being written.
>> So there is no need to determine whether it is an extended write in
>> advance.
>>
>> As for why an exclusive lock is needed for append writes, it's because
>> we don't want the EOF to be modified during the append write.
> 
> We don't care if the EOF moves during the append write at the
> filesystem level. We set kiocb->ki_pos = i_size_read() from
> generic_write_checks() under shared locking, and if we then race
> with another extending append write there are two cases:
> 
> 	1. the other task has already extended i_size; or
> 	2. we have two IOs at the same offset (i.e. at i_size).
> 
> In either case, we don't need exclusive locking for the IO because
> the worst thing that happens is that two IOs hit the same file
> offset. IOWs, it has always been left up to the application
> serialise RWF_APPEND writes on XFS, not the filesystem.
> 
> There is good reason for not doing exclusive locking for extending
> writes. When extending the file naturally (think sequential writes),
> we need those IOs be able to run concurrently with other writes
> in progress. i.e. it is common for
> applications to submit multiple sequential extending writes in a
> batch, and as long as we submit them all in order, they all hit the
> (moving) EOF exactly and hence get run concurrently.
> 
> This also works with batch-submitted RWF_APPEND AIO-DIO - they just
> turn into concurrent in-flight extending writes...
> 
> IOWs, forcing exclusive locking for writes at exactly EOF is
> definitely not desirable, and existing behaviour is that they use
> shared locking.
> 
> The only time we use exclusive locking for extending writes is when
> they -start- beyond EOF (i.e. leave a hole between EOF and
> kiocb->ki_pos) and so we have to guarantee that range does not have
> stale data exposed from it while the write is in progress. i.e. we
> have to zero the hole that moving EOF exposes if it contains
> written extents, and we cannot allow reads or writes to that hole
> before we are done.

Sorry for the late reply.

I'm not sure why stale data would be exposed. If we zero this hole
first, then write data, and finally update the i_size, before we move
the EOF, other operations won't be able to read the data in this hole,
So I think the stale data won't be exposed.

And I think an exclusive lock is indeed needed here. The main reason is
that we need to zero this hole, and the offset of this hole is based
on EOF, so an exclusive lock is required to ensure that the EOF is not
moved.

Thanks,
Chi Zhiling

> 
>> The code is like that:
>> +       if (iocb->ki_flags & IOCB_APPEND)
>> +               iolock = XFS_IOLOCK_EXCL;
>> +       else
>> +               iolock = XFS_IOLOCK_SHARED;
>>
>>
>> If we use exclusive locks for all extended writes, we need to check if
>> it is an extended write before acquiring the lock, but this value could
>> become outdated if we do not hold the lock.
>>
>> So if we do an extended write,
>> we might need to follow this process:
>>
>> 1. Get read lock.
>> 2. Check if it is an extended write.
>> 3. Release read lock.
>> 4. Get write lock.
>> 5. Do the write operation.
> 
> Please see xfs_file_write_checks() - it should already have all the
> shared/exclusive relocking logic we need for temporarily excluding
> buffered reads whilst submitting concurrent buffered writes (i.e. it
> is the same as what we already do for concurrent DIO writes).
> 
> -Dave.