[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvtdA2A8Ub9v5v3a@dread.disaster.area>
Date: Tue, 1 Oct 2024 12:22:59 +1000
From: Dave Chinner <david@...morbit.com>
To: Christian Theune <ct@...ingcircus.io>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>, Chris Mason <clm@...a.com>,
Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
folios since Dec 2021 (any kernel from 6.1 upwards)
On Mon, Sep 30, 2024 at 07:34:39PM +0200, Christian Theune wrote:
> Hi,
>
> we’ve been running a number of VMs since last week on 6.11. We’ve
> encountered one hung task situation multiple times now that seems
> to be resolving itself after a bit of time, though. I do not see
> spinning CPU during this time.
>
> The situation seems to be related to cgroups-based IO throttling /
> weighting so far:
.....
> Sep 28 03:39:19 <redactedhostname>10 kernel: INFO: task nix-build:94696 blocked for more than 122 seconds.
> Sep 28 03:39:19 <redactedhostname>10 kernel: Not tainted 6.11.0 #1-NixOS
> Sep 28 03:39:19 <redactedhostname>10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 28 03:39:19 <redactedhostname>10 kernel: task:nix-build state:D stack:0 pid:94696 tgid:94696 ppid:94695 flags:0x00000002
> Sep 28 03:39:19 <redactedhostname>10 kernel: Call Trace:
> Sep 28 03:39:19 <redactedhostname>10 kernel: <TASK>
> Sep 28 03:39:19 <redactedhostname>10 kernel: __schedule+0x3a3/0x1300
> Sep 28 03:39:19 <redactedhostname>10 kernel: schedule+0x27/0xf0
> Sep 28 03:39:19 <redactedhostname>10 kernel: io_schedule+0x46/0x70
> Sep 28 03:39:19 <redactedhostname>10 kernel: folio_wait_bit_common+0x13f/0x340
> Sep 28 03:39:19 <redactedhostname>10 kernel: folio_wait_writeback+0x2b/0x80
> Sep 28 03:39:19 <redactedhostname>10 kernel: truncate_inode_partial_folio+0x5e/0x1b0
> Sep 28 03:39:19 <redactedhostname>10 kernel: truncate_inode_pages_range+0x1de/0x400
> Sep 28 03:39:19 <redactedhostname>10 kernel: evict+0x29f/0x2c0
> Sep 28 03:39:19 <redactedhostname>10 kernel: do_unlinkat+0x2de/0x330
That's not what I'd call expected behaviour.
By the time we are that far through eviction of a newly unlinked
inode, we've already removed the inode from the writeback lists and
we've supposedly waited for all writeback to complete.
IOWs, there shouldn't be a cached folio in writeback state at this
point in time - we're supposed to have guaranteed all writeback has
already compelted before we call truncate_inode_pages_final()....
So how are we getting a partial folio that is still under writeback
at this point in time?
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists