lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f8232f8b-06e0-4d1a-bee4-cfc2ac23194e@meta.com>
Date: Mon, 30 Sep 2024 20:56:03 -0400
From: Chris Mason <clm@...a.com>
To: Christian Theune <ct@...ingcircus.io>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Dave Chinner <david@...morbit.com>, Matthew Wilcox <willy@...radead.org>,
        Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
        regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On 9/30/24 7:34 PM, Christian Theune wrote:
> Hi,
> 
> we’ve been running a number of VMs since last week on 6.11. We’ve encountered one hung task situation multiple times now that seems to be resolving itself after a bit of time, though. I do not see spinning CPU during this time.
> 
> The situation seems to be related to cgroups-based IO throttling / weighting so far:
> 
> Here are three examples of similar tracebacks where jobs that do perform a certain amount of IO (either given a weight or given an explicit limit like this:
> 
> IOWeight=10
> IOReadIOPSMax=/dev/vda 188
> IOWriteIOPSMax=/dev/vda 188
> 	
> Telemetry for the affected VM does not show that it actually reaches 188 IOPS (the load is mostly writing) but creates a kind of gaussian curve … 
> 
> The underlying storage and network was completely inconspicuous during the whole time.

Not disagreeing with Linus at all, but given that you've got IO
throttling too, we might really just be waiting.  It's hard to tell
because the hung task timeouts only give you information about one process.

I've attached a minimal version of a script we use here to show all the
D state processes, it might help explain things.  The only problem is
you have to actually ssh to the box and run it when you're stuck.

The idea is to print the stack trace of every D state process, and then
also print out how often each unique stack trace shows up.  When we're
deadlocked on something, there are normally a bunch of the same stack
(say waiting on writeback) and then one jerk sitting around in a
different stack who is causing all the trouble.

(I made some quick changes to make this smaller, so apologies if you get
silly errors)

Example output:

 sudo ./walker.py
15 rcu_tasks_trace_kthread D
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

1440504 dd D
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

-----
stack summary

1 hit:
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

-----
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
View attachment "walker.py.txt" of type "text/plain" (3020 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ