linux-kernel - Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxgurXXQ23wkk_oB4Vx6Ej3ga5vv_94OVi28arOhaTWOKA@mail.gmail.com>
Date: Wed, 8 Jan 2025 12:45:55 +0100
From: Amir Goldstein <amir73il@...il.com>
To: Chi Zhiling <chizhiling@....com>, John Garry <john.g.garry@...cle.com>
Cc: Dave Chinner <david@...morbit.com>, djwong@...nel.org, cem@...nel.org, 
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Chi Zhiling <chizhiling@...inos.cn>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

On Wed, Jan 8, 2025 at 12:33 PM Amir Goldstein <amir73il@...il.com> wrote:
>
> On Wed, Jan 8, 2025 at 8:43 AM Chi Zhiling <chizhiling@....com> wrote:
> >
> > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > On Mon, Dec 30, 2024 at 3:43 AM Chi Zhiling <chizhiling@....com> wrote:
> > >> On 2024/12/29 06:17, Dave Chinner wrote:
> > >>> On Sat, Dec 28, 2024 at 03:37:41PM +0800, Chi Zhiling wrote:
> > >>>> On 2024/12/27 05:50, Dave Chinner wrote:
> > >>>>> On Thu, Dec 26, 2024 at 02:16:02PM +0800, Chi Zhiling wrote:
> > >>>>>> From: Chi Zhiling <chizhiling@...inos.cn>
> > >>>>>>
> > >>>>>> Using an rwsem to protect file data ensures that we can always obtain a
> > >>>>>> completed modification. But due to the lock, we need to wait for the
> > >>>>>> write process to release the rwsem before we can read it, even if we are
> > >>>>>> reading a different region of the file. This could take a lot of time
> > >>>>>> when many processes need to write and read this file.
> > >>>>>>
> > >>>>>> On the other hand, The ext4 filesystem and others do not hold the lock
> > >>>>>> during buffered reading, which make the ext4 have better performance in
> > >>>>>> that case. Therefore, I think it will be fine if we remove the lock in
> > >>>>>> xfs, as most applications can handle this situation.
> > >>>>>
> > >>>>> Nope.
> > >>>>>
> > >>>>> This means that XFS loses high level serialisation of incoming IO
> > >>>>> against operations like truncate, fallocate, pnfs operations, etc.
> > >>>>>
> > >>>>> We've been through this multiple times before; the solution lies in
> > >>>>> doing the work to make buffered writes use shared locking, not
> > >>>>> removing shared locking from buffered reads.
> > >>>>
> > >>>> You mean using shared locking for buffered reads and writes, right?
> > >>>>
> > >>>> I think it's a great idea. In theory, write operations can be performed
> > >>>> simultaneously if they write to different ranges.
> > >>>
> > >>> Even if they overlap, the folio locks will prevent concurrent writes
> > >>> to the same range.
> > >>>
> > >>> Now that we have atomic write support as native functionality (i.e.
> > >>> RWF_ATOMIC), we really should not have to care that much about
> > >>> normal buffered IO being atomic. i.e. if the application wants
> > >>> atomic writes, it can now specify that it wants atomic writes and so
> > >>> we can relax the constraints we have on existing IO...
> > >>
> > >> Yes, I'm not particularly concerned about whether buffered I/O is
> > >> atomic. I'm more concerned about the multithreading performance of
> > >> buffered I/O.
> > >
> > > Hi Chi,
> > >
> > > Sorry for joining late, I was on vacation.
> > > I am very happy that you have taken on this task and your timing is good,
> > > because  John Garry just posted his patches for large atomic writes [1].
> >
> > I'm glad to have you on board. :)
> >
> > I'm not sure if my understanding is correct, but it seems that our
> > discussion is about multithreading safety, while John's patch is about
> > providing power-failure safety, even though both mention atomicity.
> >
>
> You are right about the *motivation* of John's work.
> But as a by-product of his work, we get an API to opt-in for
> atomic writes and we can piggy back this opt-in API to say
> that whoever wants the legacy behavior can use the new API
> to get both power failure safety and multithreading safety.
>
> > >
> > > I need to explain the relation to atomic buffered I/O, because it is not
> > > easy to follow it from the old discussions and also some of the discussions
> > > about the solution were held in-person during LSFMM2024.
> > >
> > > Naturally, your interest is improving multithreading performance of
> > > buffered I/O, so was mine when I first posted this question [2].
> > >
> > > The issue with atomicity of buffered I/O is the xfs has traditionally
> > > provided atomicity of write vs. read (a.k.a no torn writes), which is
> > > not required by POSIX standard (because POSIX was not written with
> > > threads in mind) and is not respected by any other in-tree filesystem.
> > >
> > > It is obvious that the exclusive rw lock for buffered write hurts performance
> > > in the case of mixed read/write on xfs, so the question was - if xfs provides
> > > atomic semantics that portable applications cannot rely on, why bother
> > > providing these atomic semantics at all?
> >
> > Perhaps we can add an option that allows distributions or users to
> > decide whether they need this semantics. I would not hesitate to
> > turn off this semantics on my system when the time comes.
> >
>
> Yes, we can, but we do not want to do it unless we have to.
> 99% of the users do not want the legacy semantics, so it would
> be better if they get the best performance without having to opt-in to get it.
>
> > >
> > > Dave's answer to this question was that there are some legacy applications
> > > (database applications IIRC) on production systems that do rely on the fact
> > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > >
> > > However, it was noted that:
> > > 1. Those application do not require atomicity for any size of IO, they
> > >      typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > >      and they only require no torn writes for this I/O size
> > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > >      but application has currently no way of knowing if the semantics are
> > >      provided or not
> >
> > To be honest, it would be best if the folio lock could provide such
> > semantics, as it would not cause any potential problems for the
> > application, and we have hope to achieve concurrent writes.
> >
> > However, I am not sure if this is easy to implement and will not cause
> > other problems.
> >
> > > 3. The upcoming ability of application to opt-in for atomic writes larger
> > >      than fs block size [1] can be used to facilitate the applications that
> > >      want to legacy xfs semantics and avoid the need to enforce the legacy
> > >      semantics all the time for no good reason
> > >
> > > Disclaimer: this is not a standard way to deal with potentially breaking
> > > legacy semantics, because old applications will not usually be rebuilt
> > > to opt-in for the old semantics, but the use case in hand is probably
> > > so specific, of a specific application that relies on xfs specific semantics
> > > that there are currently no objections for trying this solution.
> > >
> > > [1] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/
> > > [2] https://lore.kernel.org/linux-xfs/CAOQ4uxi0pGczXBX7GRAFs88Uw0n1ERJZno3JSeZR71S1dXg+2w@mail.gmail.com/
> > >
> > >>
> > >> Last week, it was mentioned that removing i_rwsem would have some
> > >> impacts on truncate, fallocate, and PNFS operations.
> > >>
> > >> (I'm not familiar with pNFS, so please correct me if I'm wrong.)
> > >
> > > You are not wrong. pNFS uses a "layout lease", which requires
> > > that the blockmap of the file will not be modified while the lease is held.
> > > but I think that block that are not mapped (i.e. holes) can be mapped
> > > while the lease is held.
> > >
> > >>
> > >> My understanding is that the current i_rwsem is used to protect both
> > >> the file's data and its size. Operations like truncate, fallocate,
> > >> and PNFS use i_rwsem because they modify both the file's data and its
> > >> size. So, I'm thinking whether it's possible to use i_rwsem to protect
> > >> only the file's size, without protecting the file's data.
> > >>
> > >
> > > It also protects the file's blockmap, for example, punch hole
> > > does not change the size, but it unmaps blocks from the blockmap,
> > > leading to races that could end up reading stale data from disk
> > > if the lock wasn't taken.
> > >
> > >> So operations that modify the file's size need to be executed
> > >> sequentially. For example, buffered writes to the EOF, fallocate
> > >> operations without the "keep size" requirement, and truncate operations,
> > >> etc, all need to hold an exclusive lock.
> > >>
> > >> Other operations require a shared lock because they only need to access
> > >> the file's size without modifying it.
> > >>
> > >
> > > As far as I understand, exclusive lock is not needed for non-extending
> > > writes, because it is ok to map new blocks.
> > > I guess the need for exclusive lock for extending writes is related to
> > > update of file size, but not 100% sure.
> > > Anyway, exclusive lock on extending write is widely common in other fs,
> > > while exclusive lock for non-extending write is unique to xfs.
> > >
> > >>>
> > >>>> So we should track all the ranges we are reading or writing,
> > >>>> and check whether the new read or write operations can be performed
> > >>>> concurrently with the current operations.
> > >>>
> > >>> That is all discussed in detail in the discussions I linked.
> > >>
> > >> Sorry, I overlooked some details from old discussion last time.
> > >> It seems that you are not satisfied with the effectiveness of
> > >> range locks.
> > >>
> > >
> > > Correct. This solution was taken off the table.
> > >
> > > I hope my explanation was correct and clear.
> > > If anything else is not clear, please feel free to ask.
> > >
> >
> > I think your explanation is very clear.
>
> One more thing I should mention.
> You do not need to wait for atomic large writes patches to land.
> There is nothing stopping you from implementing the suggested
> solution based on the xfs code already in master (v6.13-rc1),
> which has support for the RWF_ATOMIC flag for writes.
>
> It just means that the API will not be usable for applications that
> want to do IO larger than block size, but concurrent read/write
> performance of 4K IO could be improved already.
>
> It's possible that all you need to do is:
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c488ae26b23d0..2542f15496488 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -777,9 +777,10 @@ xfs_file_buffered_write(
>         ssize_t                 ret;
>         bool                    cleared_space = false;
>         unsigned int            iolock;
> +       bool                    atomic_write = iocb->ki_flags & IOCB_ATOMIC;
>
>  write_retry:
> -       iolock = XFS_IOLOCK_EXCL;
> +       iolock = atomic_write ? XFS_IOLOCK_SHARED : XFS_IOLOCK_EXCL;

It's the other way around of course :)
and did not state the obvious - this mock patch is NOT TESTED!

Thanks,
Amir.

>         ret = xfs_ilock_iocb(iocb, iolock);
> --
>
> xfs_file_write_checks() afterwards already takes care of promoting
> XFS_IOLOCK_SHARED to XFS_IOLOCK_EXCL for extending writes.
>
> It is possible that XFS_IOLOCK_EXCL could be immediately demoted
> back to XFS_IOLOCK_SHARED for atomic_writes as done in
> xfs_file_dio_write_aligned().
>
> TBH, I am not sure which blockdevs support 4K atomic writes that could
> be used to test this.
>
> John, can you share your test setup instructions for atomic writes?
>
> Thanks,
> Amir.