[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aT-SSeAqz3bF5iHw@eldamar.lan>
Date: Mon, 15 Dec 2025 05:44:57 +0100
From: Salvatore Bonaccorso <carnil@...ian.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: J. Neuschäfer <j.neuschaefer@....net>,
1120058@...s.debian.org, Jingbo Xu <jefflexu@...ux.alibaba.com>,
Jeff Layton <jlayton@...nel.org>,
Miklos Szeredi <mszeredi@...hat.com>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, regressions@...ts.linux.dev
Subject: Re: [regression] 0c58a97f919c ("fuse: remove tmp folio for
writebacks and internal rb tree") results in suspend-to-RAM hang on AMD
Ryzen 5 5625U on test scenario involving podman containers, x2go and openjdk
workload
Hi Joanne,
On Mon, Dec 15, 2025 at 06:37:42AM +0800, Joanne Koong wrote:
> On Sun, Dec 14, 2025 at 10:27 PM Salvatore Bonaccorso <carnil@...ian.org> wrote:
> >
> > Hi Joanne,
> >
> > In Debian J. Neuschäfer reported an issue where after 0c58a97f919c
> > ("fuse: remove tmp folio for writebacks and internal rb tree") a
> > specific, but admittely not very minimal workload, involving podman
> > contains, x2goserver and a openjdk application restults in
> > suspend-to-ram hang.
> >
> > The report is at https://bugs.debian.org/1120058 and information on
> > bisection and the test setup follows:
> >
> > On Sun, Nov 30, 2025 at 02:11:13PM +0100, J. Neuschäfer wrote:
> > > On Fri, Nov 28, 2025 at 09:10:25PM +0100, Salvatore Bonaccorso wrote:
> > > > Control: found -1 6.17.8-1
> > > >
> > > > Hi,
> > > >
> > > > On Fri, Nov 28, 2025 at 11:50:48AM +0100, J. Neuschäfer wrote:
> > > > > On Wed, Nov 05, 2025 at 06:09:43AM +0100, Salvatore Bonaccorso wrote:
> > > [...]
> > > > > I can reproduce the bug fairly reliably on 6.16/17 by running a specific
> > > > > podman container plus x2go (not entirely sure which parts of this is
> > > > > necessary).
> > > >
> > > > Okay if you have a very reliable way to reproduce it, would you be
> > > > open to make "your hands bit dirty" and do some bisecting on the
> > > > issue?
> > >
> > > Thank you for your detailed instructions! I've already started and completed
> > > the git bisect run in the meantime. I had to restart a few times due to
> > > mistakes, but I was able to identify the following upstream commit as the
> > > commit that introduced the issue:
> > >
> > > https://git.kernel.org/linus/0c58a97f919c24fe4245015f4375a39ff05665b6
> > >
> > > fuse: remove tmp folio for writebacks and internal rb tree
> > >
> > > The relevant commit history is as follows:
> > >
> > > * 2619a6d413f4c3 Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse <-- bad
> > > |\
> > > | * dabb9039102879 fuse: increase readdir buffer size
> > > | * 467e245d47e666 readdir: supply dir_context.count as readdir buffer size hint
> > > | * c31f91c6af96a5 fuse: don't allow signals to interrupt getdents copying
> > > | * f3cb8bd908c72e fuse: support large folios for writeback
> > > | * 906354c87f4917 fuse: support large folios for readahead
> > > | * ff7c3ee4842d87 fuse: support large folios for queued writes
> > > | * c91440c89fbd9d fuse: support large folios for stores
> > > | * cacc0645bcad3e fuse: support large folios for symlinks
> > > | * 351a24eb48209b fuse: support large folios for folio reads
> > > | * d60a6015e1a284 fuse: support large folios for writethrough writes
> > > | * 63c69ad3d18a80 fuse: refactor fuse_fill_write_pages()
> > > | * 3568a956932621 fuse: support large folios for retrieves
> > > | * 394244b24fdd09 fuse: support copying large folios
> > > | * f09222980d7751 fs: fuse: add dev id to /dev/fuse fdinfo
> > > | * 18ee43c398af0b docs: filesystems: add fuse-passthrough.rst
> > > | * 767c4b82715ad3 MAINTAINERS: update filter of FUSE documentation
> > > | * 69efbff69f89c9 fuse: fix race between concurrent setattrs from multiple nodes
> > > | * 0c58a97f919c24 fuse: remove tmp folio for writebacks and internal rb tree <-- first bad commit
> > > | * 0c4f8ed498cea1 mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
> > > | * 4fea593e625cd5 fuse: optimize over-io-uring request expiration check
> > > | * 03a3617f92c2a7 fuse: use boolean bit-fields in struct fuse_copy_state
> > > | * a5c4983bb90759 fuse: Convert 'write' to a bit-field in struct fuse_copy_state
> > > | * 2396356a945bb0 fuse: add more control over cache invalidation behaviour
> > > | * faa794dd2e17e7 fuse: Move prefaulting out of hot write path
> > > | * 0486b1832dc386 fuse: change 'unsigned' to 'unsigned int'
> > > * 0fb34422b5c223 Merge tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs <-- good
> > >
> > > The first and last commits shown are merge commits done by Linus Torvalds. The
> > > fuse-update branch was based on v6.15-rc1, under which I can't run my test due
> > > to an unrelated bug, so I ended up merging in 0fb34422b5c223 to test the
> > > commits within the fuse-update branch. e.g.:
> > >
> > > git reset --hard 394244b24fdd09 && git merge 0fb34422b5c223 && make clean && make
> > >
> > >
> > > I have also verified that the issue still happens on v6.18-rc7 but I wasn't
> > > able to revert 0c58a97f919 on top of this release, because a trivial revert
> > > is not possible.
> > >
> > > My test case consists of a few parts:
> > >
> > > - A podman container based on the "debian:13" image (which points to
> > > docker.io/library/debian via /etc/containers/registries.conf.d/shortnames.conf),
> > > where I installed x2goserver and a openjdk-21-based application; It runs the
> > > OpenSSH server and port 22 is exposed as localhost:2001
> > > - x2goclient to start a desktop session in the container
> > >
> > > Source code: https://codeberg.org/neuschaefer/re-workspace
> > >
> > > I suspect, but haven't verified, that the X server in the container somehow
> > > uses the FUSE-emulated filesystem in the container to create a file that is
> > > used with mmap (perhaps to create shared pages as frame buffers).
> > >
> > >
> > > Raw bisect notes:
> > >
> > > good:
> > > - v6.12.48+deb13-amd64
> > > - v6.12.59
> > > - v6.12
> > > - v6.14
> > > - v6.15-1304-g14418ddcc2c205
> > > - v6.15-10380-gec71f661a572
> > > - v6.15-10888-gb509c16e1d7cba
> > > - v6.15-rc7-357-g8e86e73626527e
> > > - v6.15-10933-g4c3b7df7844340
> > > - v6.15-10954-gd00a83477e7a8f
> > > - v6.15-rc7-366-g438e22801b1958 (CONFIG_X86_5LEVEL=y)
> > > - v6.15-rc4-126-g07212d16adc7a0
> > > - v6.15-10958-gdf7b9b4f6bfeb1 <-- first parent, 5LEVEL doesn't exist
> > > - v6.15-rc4-00127-g4d62121ce9b5
> > > - v6.15-rc7-375-g61374cc145f4a5 <-- second parent, `X86_5LEVEL=y`
> > > - v6.15-rc7-375-g61374cc145f4a5 <-- second parent, `X86_5LEVEL=n`
> > > - v6.15-11061-g7f9039c524a351: "first bad", actually good. merge of df7b9b4f6bfeb1 61374cc145f4a5
> > > - v6.15-11093-g0fb34422b5c223
> > > - v6.15-rc1-7-g0c4f8ed498cea1 + merge = v6.15-11101-gaec20ffad33068
> > >
> > > testing:
> > > - v6.18-rc7 + revert: doesn't apply
> > >
> > > weird (ssh doesn't work):
> > > - v6.15-rc1-1-g0486b1832dc386
> > > - v6.15-rc1-10-g767c4b82715ad3
> > > - v6.15-rc1-13-g394244b24fdd09: folio stuff
> > > - v6.15-rc1-22-gf3cb8bd908c72e
> > > - v6.15-rc1-23-gc31f91c6af96a5
> > > - next-20251128
> > >
> > > bad:
> > > - v6.15-rc1-8-g0c58a97f919c24 + merge = v6.15-11102-gdfc4869c8ef1f0 first bad commit
> > > - v6.15-rc1-9-g69efbff69f89c9 + merge = v6.15-11103-ga7b103c57680ce
> > > - v6.15-rc1-11-g18ee43c398af0b + merge = v6.15-11105-g4ad0d4fa61974c
> > > - v6.15-rc1-13-g394244b24fdd09 + merge = v6.15-11107-g37da056b3b873b
> > > - v6.15-11119-g2619a6d413f4c3: merge of 0fb34422b5c223 (last good) dabb9039102879 (fuse branch)
> > > - v6.15-11165-gfd1f8473503e5b: confirmed bad
> > > - v6.15-11401-g69352bd52b2667
> > > - v6.15-12422-g2c7e4a2663a1ab
> > > - regulator-fix-v6.16-rc2-372-g5c00eca95a9a20
> > > - v6.16.12
> > > - v6.16.12 again
> > > - v6.16.12+deb14+1-amd64
> > > - v6.18-rc7
> >
> > Would that ring some bells to you which make this tackable?
>
> Hi Salvatore,
>
> This looks like the same issue reported in this thread [1]. The lockup
> occurs when there's a faulty fuse server on the system that doesn't
> complete a write request. Prior to commit 0c58a97f919c24 ("fuse:
> remove tmp folio for writebacks and internal rb tree"), syncs on fuse
> filesystems were effectively no-ops. This patch upstream [2] reverts
> the behavior back to that. I'll send v2 of that patch and work on
> getting it merged as soon as possible.
Thank you and apologies I did miss the already reported issue.
J. Neuschäfer might it be possible you test the patch and report back
if that fixes your issue?
Regards,
Salvatore
Powered by blists - more mailing lists