linux-kernel - Re: [PATCH] xfs: don't do inodgc work if task is exiting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230512225414.GE3223426@dread.disaster.area>
Date:   Sat, 13 May 2023 08:54:14 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Tycho Andersen <tycho@...ho.pizza>
Cc:     "Darrick J. Wong" <djwong@...nel.org>, linux-xfs@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        Tycho Andersen <tandersen@...flix.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>
Subject: Re: [PATCH] xfs: don't do inodgc work if task is exiting

On Fri, May 12, 2023 at 09:15:11AM -0600, Tycho Andersen wrote:
> INFO: task java:3546393 blocked for more than 1912 seconds.
>       Tainted: G           OE     5.15.35netflix-g54efd87a8576 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:skyframe-evalua state:D stack:    0 pid:3546393 ppid:3532734 flags:0x00000220
> Call Trace:
>  <TASK>
>  __schedule+0x2c5/0x8d0
>  schedule+0x3a/0xa0
>  schedule_timeout+0x115/0x280
>  ? xfs_buf_read_map+0x52/0x2e0 [xfs]
>  ? xfs_da_read_buf+0xcf/0x120 [xfs]
>  __down+0x90/0xe0
>  ? down+0x43/0x60
>  down+0x43/0x60
>  xfs_buf_lock+0x29/0xa0 [xfs]
>  xfs_buf_find.isra.34+0x1fd/0x610 [xfs]
>  xfs_buf_get_map+0x4c/0x400 [xfs]
>  xfs_buf_read_map+0x52/0x2e0 [xfs]
>  ? xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_trans_read_buf_map+0x223/0x2d0 [xfs]
>  ? xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_ialloc_read_agi+0x2a/0x90 [xfs]
>  xfs_dialloc+0x1ab/0x680 [xfs]
>  ? xfs_trans_reserve_quota_icreate+0x32/0x40 [xfs]
>  ? xfs_trans_alloc_icreate+0x75/0xf0 [xfs]
>  xfs_create+0x39e/0x5b0 [xfs]
>  xfs_generic_create+0x129/0x380 [xfs]
>  ? generic_permission+0x27/0x200
>  vfs_create+0x12b/0x1c0
>  ovl_create_real+0xd7/0x220 [overlay]
>  ovl_create_or_link+0x166/0x670 [overlay]

Lots of stuff blocked on the AGI lock doing inode allocation. Kinda
what I expected to see...

> INFO: task kworker/66:1:3548191 blocked for more than 1912 seconds.
>       Tainted: G           OE     5.15.35netflix-g54efd87a8576 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/66:1    state:D stack:    0 pid:3548191 ppid:     2 flags:0x00004000
> Workqueue: xfs-inodegc/nvme1n1 xfs_inodegc_worker [xfs]
> Call Trace:
>  <TASK>
>  __schedule+0x2c5/0x8d0
>  schedule+0x3a/0xa0
>  schedule_timeout+0x115/0x280
>  __down+0x90/0xe0
>  ? down+0x43/0x60
>  down+0x43/0x60
>  xfs_buf_lock+0x29/0xa0 [xfs]
>  xfs_buf_find.isra.34+0x1fd/0x610 [xfs]
>  xfs_buf_get_map+0x4c/0x400 [xfs]
>  xfs_buf_read_map+0x52/0x2e0 [xfs]
>  ? xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_trans_read_buf_map+0x223/0x2d0 [xfs]
>  ? xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_read_agi+0x8c/0x120 [xfs]
>  xfs_iunlink_remove+0x5a/0x220 [xfs]
>  xfs_ifree+0x7a/0x4c0 [xfs]
>  ? xfs_trans_alloc+0xec/0x1e0 [xfs]
>  xfs_inactive_ifree+0xa1/0x1a0 [xfs]
>  xfs_inactive+0xf1/0x170 [xfs]
>  xfs_inodegc_worker+0x76/0x100 [xfs]
>  process_one_work+0x200/0x3d0
>  worker_thread+0x2d/0x3e0
>  ? process_one_work+0x3d0/0x3d0
>  kthread+0x11a/0x140
>  ? set_kthread_struct+0x40/0x40
>  ret_from_fork+0x22/0x30
>  </TASK>

Yup, there's the stuck inodegc worker - also trying to get the AGI
lock doing inode unlink. I'm not surprised at all by this - working
out why this worker is stuck here is the goal because that is what
is causing all your flush problems.

It also gives me a better idea of what the likely issue is, too.

> INFO: task bazelenv:3548683 blocked for more than 1912 seconds.
>       Tainted: G           OE     5.15.35netflix-g54efd87a8576 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:bazelenv        state:D stack:    0 pid:3548683 ppid:3435211 flags:0x00000220
> Call Trace:
>  <TASK>
>  __schedule+0x2c5/0x8d0
>  schedule+0x3a/0xa0
>  schedule_timeout+0x115/0x280
>  __down+0x90/0xe0
>  ? down+0x43/0x60
>  down+0x43/0x60
>  xfs_buf_lock+0x29/0xa0 [xfs]
>  xfs_buf_find.isra.34+0x1fd/0x610 [xfs]
>  xfs_buf_get_map+0x4c/0x400 [xfs]
>  xfs_buf_read_map+0x52/0x2e0 [xfs]
>  ? xfs_read_agf+0x84/0x100 [xfs]
>  xfs_trans_read_buf_map+0x223/0x2d0 [xfs]
>  ? xfs_read_agf+0x84/0x100 [xfs]
>  xfs_read_agf+0x84/0x100 [xfs]
>  xfs_alloc_read_agf+0x3a/0x1b0 [xfs]
>  xfs_alloc_fix_freelist+0x434/0x500 [xfs]
>  ? __kmalloc+0x3c0/0x400
>  ? xfs_dquot_to_disk+0xfe/0x110 [xfs]
>  ? xfs_qm_dquot_logitem_format+0x89/0x150 [xfs]
>  xfs_free_extent_fix_freelist+0x61/0xa0 [xfs]
>  __xfs_free_extent+0x6a/0x1c0 [xfs]
>  xfs_trans_free_extent+0x3b/0xd0 [xfs]
>  xfs_extent_free_finish_item+0x23/0x40 [xfs]
>  xfs_defer_finish_noroll+0x24b/0x5a0 [xfs]
>  xfs_defer_finish+0x13/0x80 [xfs]
>  xfs_itruncate_extents_flags+0x13b/0x220 [xfs]
>  xfs_setattr_size+0x35c/0x3d0 [xfs]
>  ? aa_compute_fperms+0x16f/0x190
>  xfs_vn_setattr+0xf7/0x110 [xfs]

And that confirms the suspicions I have - another process blocked on
an AGF lock. That might be IO it's blocking on here, but combined
with the inode unlink being stuck in inodegc, I have a suspicion
that this is an AGF - AGI - inode buffer lock ordering issue.

There's been a heap of work done in this space since 5.15, and I'd
suggest that almost none of it is trivially backportable as it's
deeply intertwined with relatively complex algorithm changes.

I also recently found evidence of another AGI -> inode buffer -> AGF
-> inode buffer deadlock issue debugging another hang from a machine
making heavy use of O_TMPFILEs via OVL.  Without a kernel code to
examine in intricate detail, I can't say if this is what you are
seeing.

I also don't have a fix for that yet because, well, it's complex and
I can really only accomodate one complex high priority bug at a time
in my head and it hasn't been my highiest priority. As it is, I
don't think the fix for it will be trivially backportable to 5.15,
either....

> INFO: task journalctl:3549038 blocked for more than 1912 seconds.
>       Tainted: G           OE     5.15.35netflix-g54efd87a8576 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:journalctl      state:D stack:    0 pid:3549038 ppid:469330 flags:0x00000224
> Call Trace:
>  <TASK>
>  __schedule+0x2c5/0x8d0
>  schedule+0x3a/0xa0
>  schedule_timeout+0x115/0x280
>  ? __traceiter_sched_wakeup+0x29/0x40
>  ? ttwu_do_wakeup+0x106/0x170
>  wait_for_completion+0x9f/0x100
>  __flush_work+0x161/0x1f0
>  ? worker_detach_from_pool+0xb0/0xb0
>  xfs_inodegc_flush.part.25+0x4f/0xa0 [xfs]
>  xfs_fs_statfs+0x35/0x1e0 [xfs]
>  statfs_by_dentry+0x67/0x90
>  vfs_statfs+0x16/0xd0
>  ovl_statfs+0x40/0x70 [overlay]

Fixed by commit 5e672cd69f0a ("xfs: introduce xfs_inodegc_push()")
which also means that the commits Darrick suggested you try aren't
relevant. The mod_delayed_work() issues they address weren't
introduced until commit 7cf2b0f9611b ("xfs: bound maximum wait time
for inodegc work") which has a direct dependency on 5e672cd69f0a
being present....

-Dave.
-- 
Dave Chinner
david@...morbit.com