[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250125084327.154410-1-alexjlzheng@tencent.com>
Date: Sat, 25 Jan 2025 16:43:25 +0800
From: Jinliang Zheng <alexjlzheng@...il.com>
To: amir73il@...il.com
Cc: cem@...nel.org,
chizhiling@....com,
chizhiling@...inos.cn,
david@...morbit.com,
djwong@...nel.org,
john.g.garry@...cle.com,
linux-kernel@...r.kernel.org,
linux-xfs@...r.kernel.org
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read
On Tue, 7 Jan 2025 13:13:17 +0100, Amir Goldstein <amir73il@...il.com> wrote:
> On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@....com> wrote:
> >
> >
> >
> > On 2024/12/29 06:17, Dave Chinner wrote:
> > > On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > >>
> > >>
> > >> On 2024/12/27 05:50, Dave Chinner wrote:
> > >>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > >>>> From: Chi Zhiling <chizhiling@...inos.cn>
> > >>>>
> > >>>> Using an rwsem to protect file data ensures that we can always obtain a
> > >>>> completed modification. But due to the lock, we need to wait for the
> > >>>> write process to release the rwsem before we can read it, even if we are
> > >>>> reading a different region of the file. This could take a lot of time
> > >>>> when many processes need to write and read this file.
> > >>>>
> > >>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > >>>> during buffered reading, which make the ext4 have better performance in
> > >>>> that case. Therefore, I think it will be fine if we remove the lock in
> > >>>> xfs, as most applications can handle this situation.
> > >>>
> > >>> Nope.
> > >>>
> > >>> This means that XFS loses high level serialisation of incoming IO
> > >>> against operations like truncate, fallocate, pnfs operations, etc.
> > >>>
> > >>> We've been through this multiple times before; the solution lies in
> > >>> doing the work to make buffered writes use shared locking, not
> > >>> removing shared locking from buffered reads.
> > >>
> > >> You mean using shared locking for buffered reads and writes, right?
> > >>
> > >> I think it's a great idea. In theory, write operations can be performed
> > >> simultaneously if they write to different ranges.
> > >
> > > Even if they overlap, the folio locks will prevent concurrent writes
> > > to the same range.
> > >
> > > Now that we have atomic write support as native functionality (i.e.
> > > RWF_ATOMIC), we really should not have to care that much about
> > > normal buffered IO being atomic. i.e. if the application wants
> > > atomic writes, it can now specify that it wants atomic writes and so
> > > we can relax the constraints we have on existing IO...
> >
> > Yes, I'm not particularly concerned about whether buffered I/O is
> > atomic. I'm more concerned about the multithreading performance of
> > buffered I/O.
>
> Hi Chi,
>
> Sorry for joining late, I was on vacation.
> I am very happy that you have taken on this task and your timing is good,
> because John Garry just posted his patches for large atomic writes [1].
>
> I need to explain the relation to atomic buffered I/O, because it is not
> easy to follow it from the old discussions and also some of the discussions
> about the solution were held in-person during LSFMM2024.
>
> Naturally, your interest is improving multithreading performance of
> buffered I/O, so was mine when I first posted this question [2].
>
> The issue with atomicity of buffered I/O is the xfs has traditionally
> provided atomicity of write vs. read (a.k.a no torn writes), which is
> not required by POSIX standard (because POSIX was not written with
> threads in mind) and is not respected by any other in-tree filesystem.
>
> It is obvious that the exclusive rw lock for buffered write hurts performance
> in the case of mixed read/write on xfs, so the question was - if xfs provides
> atomic semantics that portable applications cannot rely on, why bother
> providing these atomic semantics at all?
>
> Dave's answer to this question was that there are some legacy applications
> (database applications IIRC) on production systems that do rely on the fact
> that xfs provides this semantics and on the prerequisite that they run on xfs.
>
> However, it was noted that:
> 1. Those application do not require atomicity for any size of IO, they
> typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> and they only require no torn writes for this I/O size
> 2. Large folios and iomap can usually provide this semantics via folio lock,
> but application has currently no way of knowing if the semantics are
> provided or not
> 3. The upcoming ability of application to opt-in for atomic writes larger
> than fs block size [1] can be used to facilitate the applications that
> want to legacy xfs semantics and avoid the need to enforce the legacy
> semantics all the time for no good reason
>
> Disclaimer: this is not a standard way to deal with potentially breaking
> legacy semantics, because old applications will not usually be rebuilt
> to opt-in for the old semantics, but the use case in hand is probably
> so specific, of a specific application that relies on xfs specific semantics
> that there are currently no objections for trying this solution.
>
> [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
>
> >
> > Last week, it was mentioned that removing i_rwsem would have some
> > impacts on truncate, fallocate, and PNFS operations.
> >
> > (I'm not familiar with pNFS, so please correct me if I'm wrong.)
>
> You are not wrong. pNFS uses a "layout lease", which requires
> that the blockmap of the file will not be modified while the lease is held.
> but I think that block that are not mapped (i.e. holes) can be mapped
> while the lease is held.
>
> >
> > My understanding is that the current i_rwsem is used to protect both
> > the file's data and its size. Operations like truncate, fallocate,
> > and PNFS use i_rwsem because they modify both the file's data and its
> > size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > only the file's size, without protecting the file's data.
> >
>
> It also protects the file's blockmap, for example, punch hole
> does not change the size, but it unmaps blocks from the blockmap,
> leading to races that could end up reading stale data from disk
> if the lock wasn't taken.
>
> > So operations that modify the file's size need to be executed
> > sequentially. For example, buffered writes to the EOF, fallocate
> > operations without the "keep size" requirement, and truncate operations,
> > etc, all need to hold an exclusive lock.
> >
> > Other operations require a shared lock because they only need to access
> > the file's size without modifying it.
> >
>
> As far as I understand, exclusive lock is not needed for non-extending
> writes, because it is ok to map new blocks.
> I guess the need for exclusive lock for extending writes is related to
> update of file size, but not 100% sure.
> Anyway, exclusive lock on extending write is widely common in other fs,
> while exclusive lock for non-extending write is unique to xfs.
I am sorry but I can't understand, because I found ext4_buffered_write_iter()
take exclusive lock unconditionally.
Did I miss some context of the discussion?
Thank you very much.
Jinliang Zheng
>
> > >
> > >> So we should track all the ranges we are reading or writing,
> > >> and check whether the new read or write operations can be performed
> > >> concurrently with the current operations.
> > >
> > > That is all discussed in detail in the discussions I linked.
> >
> > Sorry, I overlooked some details from old discussion last time.
> > It seems that you are not satisfied with the effectiveness of
> > range locks.
> >
>
> Correct. This solution was taken off the table.
>
> I hope my explanation was correct and clear.
> If anything else is not clear, please feel free to ask.
>
> Thanks,
> Amir.
Powered by blists - more mailing lists