[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c71f1cb7-14d6-45e4-9df1-dc9bc82deda8@oracle.com>
Date: Tue, 14 Nov 2023 12:09:19 -0800
From: junxiao.bi@...cle.com
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
gregkh@...uxfoundation.org
Subject: Re: [PATCH] kernfs: support kernfs notify in memory recliam context
On 11/14/23 11:06 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
>> kernfs notify is used in write path of md (md_write_start) to wake up
>> userspace daemon, like "mdmon" for updating md superblock of imsm raid,
>> md write will wait for that update done before issuing the write, if this
> How is forward progress guarnateed for that userspace daemon? This sounds
> like a really fragile setup.
For imsm raid, userspace daemon "mdmon" is responsible for updating raid
metadata, kernel will use kernfs_notify to wake up the daemon anywhere
metadata update is required. If the daemon can't move forward, write may
hung, but that will be a bug in the daemon?
>
>> write is used for memory reclaim, the system may hung due to kernel notify
>> can't be executed, that's because kernel notify is executed by "system_wq"
>> which doesn't have a rescuer thread and kworker thread may not be created
>> due to memory pressure, then userspace daemon can't be woke up and md write
>> will hung.
>>
>> According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
>> that workqueue is shared and someone else might occupy that rescuer thread,
>> to fix this from md side, have to replace kernfs notify with other way to
>> communite with userspace daemon, that will break userspace interface,
>> so use a separated workqueue for kernefs notify to allow it be used in
>> memory reclaim context.
> I'm not necessarily against the change but please go into a bit more details
> on how and why it's structured this way and add a comment explaining
> explaining who's depending on kernfs notify for reclaim forward progress.
"kthreadd" was doing memory reclaim and stuck by md flush work, md flush
work was stuck by md_write_start, where it was waiting
"MD_SB_CHANGE_PENDING" flag to be cleared, before waiting, it invoked
kernefs_notify to wake up userspace daemon which should update the
meatadata and clear the flag.
PID: 2 TASK: ffff8df829539e40 CPU: 103 COMMAND: "kthreadd"
#0 [ffffaf14800f3220] __schedule at ffffffff9488cbac
#1 [ffffaf14800f32c0] schedule at ffffffff9488d1c6
#2 [ffffaf14800f32d8] schedule_timeout at ffffffff948916e6
#3 [ffffaf14800f3360] wait_for_completion at ffffffff9488ddeb
#4 [ffffaf14800f33c8] flush_work at ffffffff940b5103
#5 [ffffaf14800f3448] xlog_cil_force_lsn at ffffffffc0571791 [xfs]
#6 [ffffaf14800f34e8] _xfs_log_force_lsn at ffffffffc056f79f [xfs]
#7 [ffffaf14800f3570] xfs_log_force_lsn at ffffffffc056fa8c [xfs]
#8 [ffffaf14800f35a8] __dta___xfs_iunpin_wait_3444 at ffffffffc05595c4
[xfs]
#9 [ffffaf14800f3620] xfs_iunpin_wait at ffffffffc055c229 [xfs]
#10 [ffffaf14800f3630] __dta_xfs_reclaim_inode_3358 at ffffffffc054f8cc
[xfs]
#11 [ffffaf14800f3680] xfs_reclaim_inodes_ag at ffffffffc054fd56 [xfs]
#12 [ffffaf14800f3818] xfs_reclaim_inodes_nr at ffffffffc0551013 [xfs]
#13 [ffffaf14800f3838] xfs_fs_free_cached_objects at ffffffffc0565469 [xfs]
#14 [ffffaf14800f3848] super_cache_scan at ffffffff942959a7
#15 [ffffaf14800f38a0] shrink_slab at ffffffff941fa935
#16 [ffffaf14800f3988] shrink_node at ffffffff942005d8
#17 [ffffaf14800f3a10] do_try_to_free_pages at ffffffff94200ae2
#18 [ffffaf14800f3a78] try_to_free_pages at ffffffff94200e89
#19 [ffffaf14800f3b00] __alloc_pages_slowpath at ffffffff941ed82c
#20 [ffffaf14800f3c20] __alloc_pages_nodemask at ffffffff941ee191
#21 [ffffaf14800f3c90] __vmalloc_node_range at ffffffff9423a8e7
#22 [ffffaf14800f3d00] copy_process at ffffffff94096670
#23 [ffffaf14800f3de8] _do_fork at ffffffff94097f30
#24 [ffffaf14800f3e68] kernel_thread at ffffffff94098219
#25 [ffffaf14800f3e78] kthreadd at ffffffff940bd4e5
#26 [ffffaf14800f3f50] ret_from_fork at ffffffff94a00354
PID: 852 TASK: ffff8e351fc51e40 CPU: 77 COMMAND: "md"
#0 [ffffaf148e983c68] __schedule at ffffffff9488cbac
#1 [ffffaf148e983d08] schedule at ffffffff9488d1c6
#2 [ffffaf148e983d20] md_write_start at ffffffff9469fc75
#3 [ffffaf148e983d80] raid1_make_request at ffffffffc038d8bd [raid1]
#4 [ffffaf148e983da8] md_handle_request at ffffffff9469cc24
#5 [ffffaf148e983e18] md_submit_flush_data at ffffffff9469cce1
#6 [ffffaf148e983e38] process_one_work at ffffffff940b5bd9
#7 [ffffaf148e983e80] rescuer_thread at ffffffff940b6334
#8 [ffffaf148e983f08] kthread at ffffffff940bc245
#9 [ffffaf148e983f50] ret_from_fork at ffffffff94a00354
bool md_write_start(struct mddev *mddev, struct bio *bi)
{
...
>>> process 852 go into the "if" and set "MD_SB_CHANGE_PENDING"
if (mddev->in_sync || mddev->sync_checkers) {
spin_lock(&mddev->lock);
if (mddev->in_sync) {
mddev->in_sync = 0;
set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
md_wakeup_thread(mddev->thread);
did_change = 1;
}
spin_unlock(&mddev->lock);
}
rcu_read_unlock();
>>> invoke kernfs_notify to wake up userspace daemon
if (did_change)
sysfs_notify_dirent_safe(mddev->sysfs_state);
if (!mddev->has_superblocks)
return true;
>>>> hung here waiting userspace daemon clear that flag.
wait_event(mddev->sb_wait,
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
is_md_suspended(mddev));
if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
percpu_ref_put(&mddev->writes_pending);
return false;
}
return true;
}
Thanks,
Junxiao.
>
> Thanks.
>
Powered by blists - more mailing lists