linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250125084327.154410-1-alexjlzheng@tencent.com>
Date: Sat, 25 Jan 2025 16:43:25 +0800
From: Jinliang Zheng <alexjlzheng@...il.com>
To: amir73il@...il.com
Cc: cem@...nel.org,
	chizhiling@....com,
	chizhiling@...inos.cn,
	david@...morbit.com,
	djwong@...nel.org,
	john.g.garry@...cle.com,
	linux-kernel@...r.kernel.org,
	linux-xfs@...r.kernel.org
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@...il.com> wrote:
> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@....com> wrote:
> >
> >
> >
> > On 2024/12/29 06:17, Dave Chinner wrote:
> > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > >>
> > >>
> > >> On 2024/12/27 05:50, Dave Chinner wrote:
> > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > >>>> From: Chi Zhiling <chizhiling@...inos.cn>
> > >>>>
> > >>>> Using an rwsem to protect file data ensures that we can always obtain a
> > >>>> completed modification. But due to the lock, we need to wait for the
> > >>>> write process to release the rwsem before we can read it, even if we are
> > >>>> reading a different region of the file. This could take a lot of time
> > >>>> when many processes need to write and read this file.
> > >>>>
> > >>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > >>>> during buffered reading, which make the ext4 have better performance in
> > >>>> that case. Therefore, I think it will be fine if we remove the lock in
> > >>>> xfs, as most applications can handle this situation.
> > >>>
> > >>> Nope.
> > >>>
> > >>> This means that XFS loses high level serialisation of incoming IO
> > >>> against operations like truncate, fallocate, pnfs operations, etc.
> > >>>
> > >>> We've been through this multiple times before; the solution lies in
> > >>> doing the work to make buffered writes use shared locking, not
> > >>> removing shared locking from buffered reads.
> > >>
> > >> You mean using shared locking for buffered reads and writes, right?
> > >>
> > >> I think it's a great idea. In theory, write operations can be performed
> > >> simultaneously if they write to different ranges.
> > >
> > > Even if they overlap, the folio locks will prevent concurrent writes
> > > to the same range.
> > >
> > > Now that we have atomic write support as native functionality (i.e.
> > > RWF_ATOMIC), we really should not have to care that much about
> > > normal buffered IO being atomic. i.e. if the application wants
> > > atomic writes, it can now specify that it wants atomic writes and so
> > > we can relax the constraints we have on existing IO...
> >
> > Yes, I'm not particularly concerned about whether buffered I/O is
> > atomic. I'm more concerned about the multithreading performance of
> > buffered I/O.
> 
> Hi Chi,
> 
> Sorry for joining late, I was on vacation.
> I am very happy that you have taken on this task and your timing is good,
> because  John Garry just posted his patches for large atomic writes [1].
> 
> I need to explain the relation to atomic buffered I/O, because it is not
> easy to follow it from the old discussions and also some of the discussions
> about the solution were held in-person during LSFMM2024.
> 
> Naturally, your interest is improving multithreading performance of
> buffered I/O, so was mine when I first posted this question [2].
> 
> The issue with atomicity of buffered I/O is the xfs has traditionally
> provided atomicity of write vs. read (a.k.a no torn writes), which is
> not required by POSIX standard (because POSIX was not written with
> threads in mind) and is not respected by any other in-tree filesystem.
> 
> It is obvious that the exclusive rw lock for buffered write hurts performance
> in the case of mixed read/write on xfs, so the question was - if xfs provides
> atomic semantics that portable applications cannot rely on, why bother
> providing these atomic semantics at all?
> 
> Dave's answer to this question was that there are some legacy applications
> (database applications IIRC) on production systems that do rely on the fact
> that xfs provides this semantics and on the prerequisite that they run on xfs.
> 
> However, it was noted that:
> 1. Those application do not require atomicity for any size of IO, they
>     typically work in I/O size that is larger than block size (e.g. 16K or 64K)
>     and they only require no torn writes for this I/O size
> 2. Large folios and iomap can usually provide this semantics via folio lock,
>     but application has currently no way of knowing if the semantics are
>     provided or not
> 3. The upcoming ability of application to opt-in for atomic writes larger
>     than fs block size [1] can be used to facilitate the applications that
>     want to legacy xfs semantics and avoid the need to enforce the legacy
>     semantics all the time for no good reason
> 
> Disclaimer: this is not a standard way to deal with potentially breaking
> legacy semantics, because old applications will not usually be rebuilt
> to opt-in for the old semantics, but the use case in hand is probably
> so specific, of a specific application that relies on xfs specific semantics
> that there are currently no objections for trying this solution.
> 
> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> 
> >
> > Last week, it was mentioned that removing i_rwsem would have some
> > impacts on truncate, fallocate, and PNFS operations.
> >
> > (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> 
> You are not wrong. pNFS uses a "layout lease", which requires
> that the blockmap of the file will not be modified while the lease is held.
> but I think that block that are not mapped (i.e. holes) can be mapped
> while the lease is held.
> 
> >
> > My understanding is that the current i_rwsem is used to protect both
> > the file's data and its size. Operations like truncate, fallocate,
> > and PNFS use i_rwsem because they modify both the file's data and its
> > size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > only the file's size, without protecting the file's data.
> >
> 
> It also protects the file's blockmap, for example, punch hole
> does not change the size, but it unmaps blocks from the blockmap,
> leading to races that could end up reading stale data from disk
> if the lock wasn't taken.
> 
> > So operations that modify the file's size need to be executed
> > sequentially. For example, buffered writes to the EOF, fallocate
> > operations without the "keep size" requirement, and truncate operations,
> > etc, all need to hold an exclusive lock.
> >
> > Other operations require a shared lock because they only need to access
> > the file's size without modifying it.
> >
> 
> As far as I understand, exclusive lock is not needed for non-extending
> writes, because it is ok to map new blocks.
> I guess the need for exclusive lock for extending writes is related to
> update of file size, but not 100% sure.


> Anyway, exclusive lock on extending write is widely common in other fs,
> while exclusive lock for non-extending write is unique to xfs.

I am sorry but I can't understand, because I found ext4_buffered_write_iter()
take exclusive lock unconditionally.

Did I miss some context of the discussion?

Thank you very much.
Jinliang Zheng

> 
> > >
> > >> So we should track all the ranges we are reading or writing,
> > >> and check whether the new read or write operations can be performed
> > >> concurrently with the current operations.
> > >
> > > That is all discussed in detail in the discussions I linked.
> >
> > Sorry, I overlooked some details from old discussion last time.
> > It seems that you are not satisfied with the effectiveness of
> > range locks.
> >
> 
> Correct. This solution was taken off the table.
> 
> I hope my explanation was correct and clear.
> If anything else is not clear, please feel free to ask.
> 
> Thanks,
> Amir.