[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1aZ4==a3-uoRhH=qDKA36-FE6GoaKDZB7HX3o9pKdibYA@mail.gmail.com>
Date: Thu, 9 Oct 2025 15:11:13 -0700
From: Joanne Koong <joannelkoong@...il.com>
To: "guangming.zhao" <giveme.gulu@...il.com>
Cc: miklos@...redi.hu, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5.15] fuse: Fix race condition in writethrough path A race
On Thu, Oct 9, 2025 at 4:09 AM guangming.zhao <giveme.gulu@...il.com> wrote:
>
Hi Guangming,
> The race occurs as follows:
> 1. A write operation locks a page, fills it with new data, marks it
> Uptodate, and then immediately unlocks it within fuse_fill_write_pages().
> 2. This opens a window before the new data is sent to the userspace daemon.
> 3. A concurrent read operation for the same page may decide to re-validate
> its cache from the daemon. The fuse_wait_on_page_writeback()
> mechanism does not protect this synchronous writethrough path.
> 4. The read request can be processed by the multi-threaded daemon *before*
> the write request, causing it to reply with stale data from its backend.
> 5. The read syscall returns this stale data to userspace, causing data
> verification to fail.
I don't think the issue is that the read returns stale data (the
client is responsible for synchronizing reads and writes, so if the
read is issued before the write has completed then it should be fine
that the read returned back stale data) but that the read will
populate the page cache with stale data (overwriting the write data in
the page cache), which makes later subsequent reads that are issued
after the write has completed return back stale data.
>
> This can be reliably reproduced on a mainline kernel (e.g., 6.1.x)
> using iogen and a standard multi-threaded libfuse passthrough filesystem.
>
> Steps to Reproduce:
> 1. Mount a libfuse passthrough filesystem (must be multi-threaded):
> $ ./passthrough /path/to/mount_point
>
> 2. Run the iogen/doio test from LTP (Linux Test Project) with mixed
> read/write operations (example):
> $ /path/to/ltp/iogen -N iogen01 -i 120s -s read,write 500k:/path/to/mount_point/file1 | \
> /path/to/ltp/doio -N iogen01 -a -v -n 2 -k
>
> 3. A data comparison error similar to the following will be reported:
> *** DATA COMPARISON ERROR ***
> check_file(/path/to/mount_point/file1, ...) failed
> expected bytes: X:3091346:gm-arco:doio*X:3091346
> actual bytes: 91346:gm-arco:doio*C:3091346:gm-
>
> The fix is to delay unlocking the page until after the data has been
> successfully sent to the daemon. This is achieved by moving the unlock
> logic from fuse_fill_write_pages() to the completion path of
> fuse_send_write_pages(), ensuring the page lock is held for the entire
> critical section and serializing the operations correctly.
>
> [Note for maintainers]
> This patch is created and tested against the 5.15 kernel. I have observed
> that recent kernels have migrated to using folios, and I am not confident
> in porting this fix to the new folio-based code myself.
>
> I am submitting this patch to clearly document the race condition and a
> proven fix on an older kernel, in the hope that a developer more
> familiar with the folio conversion can adapt it for the mainline tree.
>
> Signed-off-by: guangming.zhao <giveme.gulu@...il.com>
> ---
> [root@...arco example]# uname -a
> Linux gm-arco 6.16.8-arch3-1 #1 SMP PREEMPT_DYNAMIC Mon, 22 Sep 2025 22:08:35 +0000 x86_64 GNU/Linux
> [root@...arco example]# ./passthrough /tmp/test/
> [root@...arco example]# mkdir /tmp/test/yy
> [root@...arco example]# /home/gm/code/ltp/testcases/kernel/fs/doio/iogen -N iogen01 -i 120s -s read,write 500b:/tmp/test/yy/kk1 1000b:/tmp/test/yy/kk2 | /home/gm/code/ltp/testcases/kernel/fs/doio/doio -N iogen01 -a -v -n 2 -k
>
> iogen(iogen01) starting up with the following:
>
> Out-pipe: stdout
> Iterations: 120 seconds
> Seed: 3091343
> Offset-Mode: sequential
> Overlap Flag: off
> Mintrans: 1 (1 blocks)
> Maxtrans: 131072 (256 blocks)
> O_RAW/O_SSD Multiple: (Determined by device)
> Syscalls: read write
> Aio completion types: none
> Flags: buffered sync
>
> Test Files:
>
> Path Length iou raw iou file
> (bytes) (bytes) (bytes) type
> -----------------------------------------------------------------------------
> /tmp/test/yy/kk1 256000 1 512 regular
> /tmp/test/yy/kk2 512000 1 512 regular
>
> doio(iogen01) (3091346) 17:43:50
> ---------------------
> *** DATA COMPARISON ERROR ***
> check_file(/tmp/test/yy/kk2, 116844, 106653, X:3091346:gm-arco:doio*, 23, 0) failed
>
> Comparison fd is 3, with open flags 0
> Corrupt regions follow - unprintable chars are represented as '.'
> -----------------------------------------------------------------
> corrupt bytes starting at file offset 116844
> 1st 32 expected bytes: X:3091346:gm-arco:doio*X:3091346
> 1st 32 actual bytes: 91346:gm-arco:doio*C:3091346:gm-
> Request number 13873
> syscall: write(4, 02540107176414100, 106653)
> fd 4 is file /tmp/test/yy/kk2 - open flags are 04010001
> write done at file offset 116844 - pattern is X:3091346:gm-arco:doio*
>
> doio(iogen01) (3091344) 17:43:50
> ---------------------
> (parent) pid 3091346 exited because of data compare errors
>
> fs/fuse/file.c | 36 ++++++++++--------------------------
> 1 file changed, 10 insertions(+), 26 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 5c5ed58d9..a832c3122 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1098,7 +1098,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> struct fuse_file *ff = file->private_data;
> struct fuse_mount *fm = ff->fm;
> unsigned int offset, i;
> - bool short_write;
> int err;
>
> for (i = 0; i < ap->num_pages; i++)
> @@ -1113,26 +1112,21 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> if (!err && ia->write.out.size > count)
> err = -EIO;
>
> - short_write = ia->write.out.size < count;
> offset = ap->descs[0].offset;
> count = ia->write.out.size;
> for (i = 0; i < ap->num_pages; i++) {
> struct page *page = ap->pages[i];
>
> - if (err) {
> - ClearPageUptodate(page);
> - } else {
> - if (count >= PAGE_SIZE - offset)
> - count -= PAGE_SIZE - offset;
> - else {
> - if (short_write)
> - ClearPageUptodate(page);
> - count = 0;
> - }
> - offset = 0;
> - }
> - if (ia->write.page_locked && (i == ap->num_pages - 1))
> - unlock_page(page);
> + if (!err && !offset && count >= PAGE_SIZE)
> + SetPageUptodate(page);
> +
> + if (count > PAGE_SIZE - offset)
> + count -= PAGE_SIZE - offset;
> + else
> + count = 0;
> + offset = 0;
> +
> + unlock_page(page);
> put_page(page);
> }
>
> @@ -1195,16 +1189,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
> if (offset == PAGE_SIZE)
> offset = 0;
>
> - /* If we copied full page, mark it uptodate */
> - if (tmp == PAGE_SIZE)
> - SetPageUptodate(page);
> -
> - if (PageUptodate(page)) {
> - unlock_page(page);
> - } else {
> - ia->write.page_locked = true;
> - break;
> - }
I think this will run into the deadlock described here
https://lore.kernel.org/linux-fsdevel/CAHk-=wh9Eu-gNHzqgfvUAAiO=vJ+pWnzxkv+tX55xhGPFy+cOw@mail.gmail.com/,
so I think we would need a different solution. Maybe one idea is doing
something similar to what the fi->writectr bias does - that at least
seems simpler to me than having to unlock all the pages in the array
if we have to fault in the iov iter and then having to relock the
pages while making sure everything is all consistent.
Thanks,
Joanne
> if (!fc->big_writes)
> break;
> } while (iov_iter_count(ii) && count < fc->max_write &&
> --
> 2.51.0
>
>
Powered by blists - more mailing lists