linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZvtdA2A8Ub9v5v3a@dread.disaster.area>
Date: Tue, 1 Oct 2024 12:22:59 +1000
From: Dave Chinner <david@...morbit.com>
To: Christian Theune <ct@...ingcircus.io>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	Matthew Wilcox <willy@...radead.org>, Chris Mason <clm@...a.com>,
	Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org,
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
	regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Mon, Sep 30, 2024 at 07:34:39PM +0200, Christian Theune wrote:
> Hi,
> 
> we’ve been running a number of VMs since last week on 6.11. We’ve
> encountered one hung task situation multiple times now that seems
> to be resolving itself after a bit of time, though. I do not see
> spinning CPU during this time.
> 
> The situation seems to be related to cgroups-based IO throttling /
> weighting so far:

.....

> Sep 28 03:39:19 <redactedhostname>10 kernel: INFO: task nix-build:94696 blocked for more than 122 seconds.
> Sep 28 03:39:19 <redactedhostname>10 kernel:       Not tainted 6.11.0 #1-NixOS
> Sep 28 03:39:19 <redactedhostname>10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 28 03:39:19 <redactedhostname>10 kernel: task:nix-build       state:D stack:0     pid:94696 tgid:94696 ppid:94695  flags:0x00000002
> Sep 28 03:39:19 <redactedhostname>10 kernel: Call Trace:
> Sep 28 03:39:19 <redactedhostname>10 kernel:  <TASK>
> Sep 28 03:39:19 <redactedhostname>10 kernel:  __schedule+0x3a3/0x1300
> Sep 28 03:39:19 <redactedhostname>10 kernel:  schedule+0x27/0xf0
> Sep 28 03:39:19 <redactedhostname>10 kernel:  io_schedule+0x46/0x70
> Sep 28 03:39:19 <redactedhostname>10 kernel:  folio_wait_bit_common+0x13f/0x340
> Sep 28 03:39:19 <redactedhostname>10 kernel:  folio_wait_writeback+0x2b/0x80
> Sep 28 03:39:19 <redactedhostname>10 kernel:  truncate_inode_partial_folio+0x5e/0x1b0
> Sep 28 03:39:19 <redactedhostname>10 kernel:  truncate_inode_pages_range+0x1de/0x400
> Sep 28 03:39:19 <redactedhostname>10 kernel:  evict+0x29f/0x2c0
> Sep 28 03:39:19 <redactedhostname>10 kernel:  do_unlinkat+0x2de/0x330

That's not what I'd call expected behaviour.

By the time we are that far through eviction of a newly unlinked
inode, we've already removed the inode from the writeback lists and
we've supposedly waited for all writeback to complete.

IOWs, there shouldn't be a cached folio in writeback state at this
point in time - we're supposed to have guaranteed all writeback has
already compelted before we call truncate_inode_pages_final()....

So how are we getting a partial folio that is still under writeback
at this point in time?

-Dave.
-- 
Dave Chinner
david@...morbit.com