lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <395042eb-3986-45c1-88c1-482b750303c7@163.com>
Date: Fri, 2 May 2025 07:57:59 +0800
From: Chi Zhiling <chizhiling@....com>
To: Dave Chinner <david@...morbit.com>
Cc: cem@...nel.org, linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 Chi Zhiling <chizhiling@...inos.cn>
Subject: Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio
 lock

On 2025/5/1 07:45, Dave Chinner wrote:
> On Wed, Apr 30, 2025 at 05:03:51PM +0800, Chi Zhiling wrote:
>> On 2025/4/30 10:05, Dave Chinner wrote:
>>> On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
>>>> From: Chi Zhiling <chizhiling@...inos.cn>
>>>>
>>>> This is a patch attempting to implement concurrent buffered writes.
>>>> The main idea is to use the folio lock to ensure the atomicity of the
>>>> write when writing to a single folio, instead of using the i_rwsem.
>>>>
>>>> I tried the "folio batch" solution, which is a great idea, but during
>>>> testing, I encountered an OOM issue because the locked folios couldn't
>>>> be reclaimed.
>>>>
>>>> So for now, I can only allow concurrent writes within a single block.
>>>> The good news is that since we already support BS > PS, we can use a
>>>> larger block size to enable higher granularity concurrency.
>>>
>>> I'm not going to say no to this, but I think it's a short term and
>>> niche solution to the general problem of enabling shared buffered
>>> writes. i.e. I expect that it will not exist for long, whilst
>>
>> Hi, Dave,
>>
>> Yes, it's a short-term solution, but it's enough for some scenarios.
>> I would also like to see better idea.
>>
>>> experience tells me that adding special cases to the IO path locking
>>> has a fairly high risk of unexpected regressions and/or data
>>> corruption....
>>
>> I can't say there is definitely no data corruption, but I haven't seen
>> any new errors in xfstests.
> 
> Yeah, that's why they are "unexpected regressions" - testing looks
> fine, but once it gets out into complex production workloads....
> 
>> We might need to add some assertions in the code to check for the risk
>> of data corruption, not specifically for this patch, but for the current
>> XFS system in general. This would help developers avoid introducing new
>> bugs, similar to the lockdep tool.
> 
> I'm not sure what you invisage here or what problems you think we
> might be able to catch - can you describe what you are thinking
> about here?

I'm just say it casually.

I mean, is there a way to check for data corruption risks, rather than
waiting for it to happen and then reporting an error? Just like how
lockdep detects deadlock risks in advance.

I guess not.

> 
>>>> These ideas come from previous discussions:
>>>> https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@163.com/
>>>
>>> In my spare time I've been looking at using the two state lock from
>>> bcachefs for this because it looks to provide a general solution to
>>> the issue of concurrent buffered writes.
>>
>> In fact, I have tried the two state lock, and it does work quite well.
>> However, I noticed some performance degradation in single-threaded
>> scenarios in UnixBench (I'm not sure if it's caused by the memory
>> barrier).
> 
> Please share the patch - I'd like to see how you implemented it and
> how you solved the various lock ordering and wider IO serialisation
> issues. It may be that I've overlooked something and your
> implementation makes me aware of it. OTOH, I might see something in
> your patch that could be improved that mitigates the regression.

I think I haven't solved these problems.

The lock order I envisioned is IOLOCK -> TWO_STATE_LOCK -> MMAP_LOCK ->
ILOCK, which means that when releasing IOLOCK, TWO_STATE_LOCK should
also be released first, including when upgrading IOLOCK_SHARED to
IOLOCK_EXCL. However, I didn't do this.

I missed this part, and although I didn't encounter any issues in the
xfstests, this could indeed lead to a deadlock.


Besides this, is there anything else I have missed?



The patch is as follows, though it's not helpful

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3e7448c2a969..573e31bfef3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,17 @@

  static const struct vm_operations_struct xfs_file_vm_ops;

+#define TWO_STATE_LOCK(ip, state)		\
+	xfs_two_state_lock(&ip->i_write_lock, state)
+
+#define TWO_STATE_UNLOCK(ip, state)		\
+	xfs_two_state_unlock(&ip->i_write_lock, state)
+
+#define buffered_lock(inode)	TWO_STATE_LOCK(inode, 0)
+#define buffered_unlock(inode)	TWO_STATE_UNLOCK(inode, 0)
+#define direct_lock(inode)	TWO_STATE_LOCK(inode, 1)
+#define direct_unlock(inode)	TWO_STATE_UNLOCK(inode, 1)
+
  /*
   * Decide if the given file range is aligned to the size of the 
fundamental
   * allocation unit for the file.
@@ -263,7 +274,10 @@ xfs_file_dio_read(
  	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
  	if (ret)
  		return ret;
+	direct_lock(ip);
  	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+	direct_unlock(ip);
+
  	xfs_iunlock(ip, XFS_IOLOCK_SHARED);

  	return ret;
@@ -598,9 +612,13 @@ xfs_file_dio_write_aligned(
  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
  		iolock = XFS_IOLOCK_SHARED;
  	}
+
+	direct_lock(ip);
  	trace_xfs_file_direct_write(iocb, from);
  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
  			   &xfs_dio_write_ops, 0, NULL, 0);
+	direct_unlock(ip);
+
  out_unlock:
  	if (iolock)
  		xfs_iunlock(ip, iolock);
@@ -676,9 +694,11 @@ xfs_file_dio_write_unaligned(
  	if (flags & IOMAP_DIO_FORCE_WAIT)
  		inode_dio_wait(VFS_I(ip));

+	direct_lock(ip);
  	trace_xfs_file_direct_write(iocb, from);
  	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
  			   &xfs_dio_write_ops, flags, NULL, 0);
+	direct_unlock(ip);

  	/*
  	 * Retry unaligned I/O with exclusive blocking semantics if the DIO
@@ -776,9 +796,11 @@ xfs_file_buffered_write(
  	if (ret)
  		goto out;

+	buffered_lock(ip);
  	trace_xfs_file_buffered_write(iocb, from);
  	ret = iomap_file_buffered_write(iocb, from,
  			&xfs_buffered_write_iomap_ops);
+	buffered_unlock(ip);

  	/*
  	 * If we hit a space limit, try to free up some lingering preallocated
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 52210a54fe7e..a8bc8d9737c4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,7 @@ xfs_inode_alloc(
  	spin_lock_init(&ip->i_ioend_lock);
  	ip->i_next_unlinked = NULLAGINO;
  	ip->i_prev_unlinked = 0;
+	two_state_lock_init(&ip->i_write_lock);

  	return ip;
  }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b91aaa23ea1e..9a8c75feda16 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -8,6 +8,7 @@

  #include "xfs_inode_buf.h"
  #include "xfs_inode_fork.h"
+#include "xfs_lock.h"

  /*
   * Kernel only inode definitions
@@ -92,6 +93,8 @@ typedef struct xfs_inode {
  	spinlock_t		i_ioend_lock;
  	struct work_struct	i_ioend_work;
  	struct list_head	i_ioend_list;
+
+	two_state_lock_t	i_write_lock;
  } xfs_inode_t;

  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)



Thanks

> 
>> Since single-threaded bufferedio is still the primary read-write mode,
>> I don't want to introduce too much impact in single-threaded scenarios.
> 
> I mostly don't care that much about small single threaded
> performance regressions anywhere in XFS if there is some upside for
> scalability or performance. We've always traded off single threaded
> performance for better concurrency and/or scalability in XFS (right
> from the initial design choices way back in the early 1990s), so I
> don't see why we'd treat a significant improvement in buffered IO
> concurrency any differently.
> 
> -Dave.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ