[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <395042eb-3986-45c1-88c1-482b750303c7@163.com>
Date: Fri, 2 May 2025 07:57:59 +0800
From: Chi Zhiling <chizhiling@....com>
To: Dave Chinner <david@...morbit.com>
Cc: cem@...nel.org, linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
Chi Zhiling <chizhiling@...inos.cn>
Subject: Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio
lock
On 2025/5/1 07:45, Dave Chinner wrote:
> On Wed, Apr 30, 2025 at 05:03:51PM +0800, Chi Zhiling wrote:
>> On 2025/4/30 10:05, Dave Chinner wrote:
>>> On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
>>>> From: Chi Zhiling <chizhiling@...inos.cn>
>>>>
>>>> This is a patch attempting to implement concurrent buffered writes.
>>>> The main idea is to use the folio lock to ensure the atomicity of the
>>>> write when writing to a single folio, instead of using the i_rwsem.
>>>>
>>>> I tried the "folio batch" solution, which is a great idea, but during
>>>> testing, I encountered an OOM issue because the locked folios couldn't
>>>> be reclaimed.
>>>>
>>>> So for now, I can only allow concurrent writes within a single block.
>>>> The good news is that since we already support BS > PS, we can use a
>>>> larger block size to enable higher granularity concurrency.
>>>
>>> I'm not going to say no to this, but I think it's a short term and
>>> niche solution to the general problem of enabling shared buffered
>>> writes. i.e. I expect that it will not exist for long, whilst
>>
>> Hi, Dave,
>>
>> Yes, it's a short-term solution, but it's enough for some scenarios.
>> I would also like to see better idea.
>>
>>> experience tells me that adding special cases to the IO path locking
>>> has a fairly high risk of unexpected regressions and/or data
>>> corruption....
>>
>> I can't say there is definitely no data corruption, but I haven't seen
>> any new errors in xfstests.
>
> Yeah, that's why they are "unexpected regressions" - testing looks
> fine, but once it gets out into complex production workloads....
>
>> We might need to add some assertions in the code to check for the risk
>> of data corruption, not specifically for this patch, but for the current
>> XFS system in general. This would help developers avoid introducing new
>> bugs, similar to the lockdep tool.
>
> I'm not sure what you invisage here or what problems you think we
> might be able to catch - can you describe what you are thinking
> about here?
I'm just say it casually.
I mean, is there a way to check for data corruption risks, rather than
waiting for it to happen and then reporting an error? Just like how
lockdep detects deadlock risks in advance.
I guess not.
>
>>>> These ideas come from previous discussions:
>>>> https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/
>>>
>>> In my spare time I've been looking at using the two state lock from
>>> bcachefs for this because it looks to provide a general solution to
>>> the issue of concurrent buffered writes.
>>
>> In fact, I have tried the two state lock, and it does work quite well.
>> However, I noticed some performance degradation in single-threaded
>> scenarios in UnixBench (I'm not sure if it's caused by the memory
>> barrier).
>
> Please share the patch - I'd like to see how you implemented it and
> how you solved the various lock ordering and wider IO serialisation
> issues. It may be that I've overlooked something and your
> implementation makes me aware of it. OTOH, I might see something in
> your patch that could be improved that mitigates the regression.
I think I haven't solved these problems.
The lock order I envisioned is IOLOCK -> TWO_STATE_LOCK -> MMAP_LOCK ->
ILOCK, which means that when releasing IOLOCK, TWO_STATE_LOCK should
also be released first, including when upgrading IOLOCK_SHARED to
IOLOCK_EXCL. However, I didn't do this.
I missed this part, and although I didn't encounter any issues in the
xfstests, this could indeed lead to a deadlock.
Besides this, is there anything else I have missed?
The patch is as follows, though it's not helpful
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3e7448c2a969..573e31bfef3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,17 @@
static const struct vm_operations_struct xfs_file_vm_ops;
+#define TWO_STATE_LOCK(ip, state) \
+ xfs_two_state_lock(&ip->i_write_lock, state)
+
+#define TWO_STATE_UNLOCK(ip, state) \
+ xfs_two_state_unlock(&ip->i_write_lock, state)
+
+#define buffered_lock(inode) TWO_STATE_LOCK(inode, 0)
+#define buffered_unlock(inode) TWO_STATE_UNLOCK(inode, 0)
+#define direct_lock(inode) TWO_STATE_LOCK(inode, 1)
+#define direct_unlock(inode) TWO_STATE_UNLOCK(inode, 1)
+
/*
* Decide if the given file range is aligned to the size of the
fundamental
* allocation unit for the file.
@@ -263,7 +274,10 @@ xfs_file_dio_read(
ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
if (ret)
return ret;
+ direct_lock(ip);
ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+ direct_unlock(ip);
+
xfs_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
@@ -598,9 +612,13 @@ xfs_file_dio_write_aligned(
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
+
+ direct_lock(ip);
trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops, 0, NULL, 0);
+ direct_unlock(ip);
+
out_unlock:
if (iolock)
xfs_iunlock(ip, iolock);
@@ -676,9 +694,11 @@ xfs_file_dio_write_unaligned(
if (flags & IOMAP_DIO_FORCE_WAIT)
inode_dio_wait(VFS_I(ip));
+ direct_lock(ip);
trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops, flags, NULL, 0);
+ direct_unlock(ip);
/*
* Retry unaligned I/O with exclusive blocking semantics if the DIO
@@ -776,9 +796,11 @@ xfs_file_buffered_write(
if (ret)
goto out;
+ buffered_lock(ip);
trace_xfs_file_buffered_write(iocb, from);
ret = iomap_file_buffered_write(iocb, from,
&xfs_buffered_write_iomap_ops);
+ buffered_unlock(ip);
/*
* If we hit a space limit, try to free up some lingering preallocated
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 52210a54fe7e..a8bc8d9737c4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,7 @@ xfs_inode_alloc(
spin_lock_init(&ip->i_ioend_lock);
ip->i_next_unlinked = NULLAGINO;
ip->i_prev_unlinked = 0;
+ two_state_lock_init(&ip->i_write_lock);
return ip;
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b91aaa23ea1e..9a8c75feda16 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -8,6 +8,7 @@
#include "xfs_inode_buf.h"
#include "xfs_inode_fork.h"
+#include "xfs_lock.h"
/*
* Kernel only inode definitions
@@ -92,6 +93,8 @@ typedef struct xfs_inode {
spinlock_t i_ioend_lock;
struct work_struct i_ioend_work;
struct list_head i_ioend_list;
+
+ two_state_lock_t i_write_lock;
} xfs_inode_t;
static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
Thanks
>
>> Since single-threaded bufferedio is still the primary read-write mode,
>> I don't want to introduce too much impact in single-threaded scenarios.
>
> I mostly don't care that much about small single threaded
> performance regressions anywhere in XFS if there is some upside for
> scalability or performance. We've always traded off single threaded
> performance for better concurrency and/or scalability in XFS (right
> from the initial design choices way back in the early 1990s), so I
> don't see why we'd treat a significant improvement in buffered IO
> concurrency any differently.
>
> -Dave.
Powered by blists - more mailing lists