[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f8232f8b-06e0-4d1a-bee4-cfc2ac23194e@meta.com>
Date: Mon, 30 Sep 2024 20:56:03 -0400
From: Chris Mason <clm@...a.com>
To: Christian Theune <ct@...ingcircus.io>,
Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Dave Chinner <david@...morbit.com>, Matthew Wilcox <willy@...radead.org>,
Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
folios since Dec 2021 (any kernel from 6.1 upwards)
On 9/30/24 7:34 PM, Christian Theune wrote:
> Hi,
>
> we’ve been running a number of VMs since last week on 6.11. We’ve encountered one hung task situation multiple times now that seems to be resolving itself after a bit of time, though. I do not see spinning CPU during this time.
>
> The situation seems to be related to cgroups-based IO throttling / weighting so far:
>
> Here are three examples of similar tracebacks where jobs that do perform a certain amount of IO (either given a weight or given an explicit limit like this:
>
> IOWeight=10
> IOReadIOPSMax=/dev/vda 188
> IOWriteIOPSMax=/dev/vda 188
>
> Telemetry for the affected VM does not show that it actually reaches 188 IOPS (the load is mostly writing) but creates a kind of gaussian curve …
>
> The underlying storage and network was completely inconspicuous during the whole time.
Not disagreeing with Linus at all, but given that you've got IO
throttling too, we might really just be waiting. It's hard to tell
because the hung task timeouts only give you information about one process.
I've attached a minimal version of a script we use here to show all the
D state processes, it might help explain things. The only problem is
you have to actually ssh to the box and run it when you're stuck.
The idea is to print the stack trace of every D state process, and then
also print out how often each unique stack trace shows up. When we're
deadlocked on something, there are normally a bunch of the same stack
(say waiting on writeback) and then one jerk sitting around in a
different stack who is causing all the trouble.
(I made some quick changes to make this smaller, so apologies if you get
silly errors)
Example output:
sudo ./walker.py
15 rcu_tasks_trace_kthread D
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30
1440504 dd D
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
-----
stack summary
1 hit:
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30
-----
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
View attachment "walker.py.txt" of type "text/plain" (3020 bytes)
Powered by blists - more mailing lists