linux-kernel - Re: [PATCH] kernfs: support kernfs notify in memory recliam context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c71f1cb7-14d6-45e4-9df1-dc9bc82deda8@oracle.com>
Date:   Tue, 14 Nov 2023 12:09:19 -0800
From:   junxiao.bi@...cle.com
To:     Tejun Heo <tj@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
        gregkh@...uxfoundation.org
Subject: Re: [PATCH] kernfs: support kernfs notify in memory recliam context

On 11/14/23 11:06 AM, Tejun Heo wrote:

> Hello,
>
> On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
>> kernfs notify is used in write path of md (md_write_start) to wake up
>> userspace daemon, like "mdmon" for updating md superblock of imsm raid,
>> md write will wait for that update done before issuing the write, if this
> How is forward progress guarnateed for that userspace daemon? This sounds
> like a really fragile setup.

For imsm raid, userspace daemon "mdmon" is responsible for updating raid 
metadata, kernel will use kernfs_notify to wake up the daemon anywhere 
metadata update is required. If the daemon can't move forward, write may 
hung, but that will be a bug in the daemon?

>
>> write is used for memory reclaim, the system may hung due to kernel notify
>> can't be executed, that's because kernel notify is executed by "system_wq"
>> which doesn't have a rescuer thread and kworker thread may not be created
>> due to memory pressure, then userspace daemon can't be woke up and md write
>> will hung.
>>
>> According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
>> that workqueue is shared and someone else might occupy that rescuer thread,
>> to fix this from md side, have to replace kernfs notify with other way to
>> communite with userspace daemon, that will break userspace interface,
>> so use a separated workqueue for kernefs notify to allow it be used in
>> memory reclaim context.
> I'm not necessarily against the change but please go into a bit more details
> on how and why it's structured this way and add a comment explaining
> explaining who's depending on kernfs notify for reclaim forward progress.

"kthreadd" was doing memory reclaim and stuck by md flush work, md flush 
work was stuck by md_write_start, where it was waiting 
"MD_SB_CHANGE_PENDING" flag to be cleared, before waiting, it invoked 
kernefs_notify to wake up userspace daemon which should update the 
meatadata and clear the flag.



PID: 2        TASK: ffff8df829539e40  CPU: 103  COMMAND: "kthreadd"
  #0 [ffffaf14800f3220] __schedule at ffffffff9488cbac
  #1 [ffffaf14800f32c0] schedule at ffffffff9488d1c6
  #2 [ffffaf14800f32d8] schedule_timeout at ffffffff948916e6
  #3 [ffffaf14800f3360] wait_for_completion at ffffffff9488ddeb
  #4 [ffffaf14800f33c8] flush_work at ffffffff940b5103
  #5 [ffffaf14800f3448] xlog_cil_force_lsn at ffffffffc0571791 [xfs]
  #6 [ffffaf14800f34e8] _xfs_log_force_lsn at ffffffffc056f79f [xfs]
  #7 [ffffaf14800f3570] xfs_log_force_lsn at ffffffffc056fa8c [xfs]
  #8 [ffffaf14800f35a8] __dta___xfs_iunpin_wait_3444 at ffffffffc05595c4 
[xfs]
  #9 [ffffaf14800f3620] xfs_iunpin_wait at ffffffffc055c229 [xfs]
#10 [ffffaf14800f3630] __dta_xfs_reclaim_inode_3358 at ffffffffc054f8cc 
[xfs]
#11 [ffffaf14800f3680] xfs_reclaim_inodes_ag at ffffffffc054fd56 [xfs]
#12 [ffffaf14800f3818] xfs_reclaim_inodes_nr at ffffffffc0551013 [xfs]
#13 [ffffaf14800f3838] xfs_fs_free_cached_objects at ffffffffc0565469 [xfs]
#14 [ffffaf14800f3848] super_cache_scan at ffffffff942959a7
#15 [ffffaf14800f38a0] shrink_slab at ffffffff941fa935
#16 [ffffaf14800f3988] shrink_node at ffffffff942005d8
#17 [ffffaf14800f3a10] do_try_to_free_pages at ffffffff94200ae2
#18 [ffffaf14800f3a78] try_to_free_pages at ffffffff94200e89
#19 [ffffaf14800f3b00] __alloc_pages_slowpath at ffffffff941ed82c
#20 [ffffaf14800f3c20] __alloc_pages_nodemask at ffffffff941ee191
#21 [ffffaf14800f3c90] __vmalloc_node_range at ffffffff9423a8e7
#22 [ffffaf14800f3d00] copy_process at ffffffff94096670
#23 [ffffaf14800f3de8] _do_fork at ffffffff94097f30
#24 [ffffaf14800f3e68] kernel_thread at ffffffff94098219
#25 [ffffaf14800f3e78] kthreadd at ffffffff940bd4e5
#26 [ffffaf14800f3f50] ret_from_fork at ffffffff94a00354


PID: 852      TASK: ffff8e351fc51e40  CPU: 77   COMMAND: "md"
  #0 [ffffaf148e983c68] __schedule at ffffffff9488cbac
  #1 [ffffaf148e983d08] schedule at ffffffff9488d1c6
  #2 [ffffaf148e983d20] md_write_start at ffffffff9469fc75
  #3 [ffffaf148e983d80] raid1_make_request at ffffffffc038d8bd [raid1]
  #4 [ffffaf148e983da8] md_handle_request at ffffffff9469cc24
  #5 [ffffaf148e983e18] md_submit_flush_data at ffffffff9469cce1
  #6 [ffffaf148e983e38] process_one_work at ffffffff940b5bd9
  #7 [ffffaf148e983e80] rescuer_thread at ffffffff940b6334
  #8 [ffffaf148e983f08] kthread at ffffffff940bc245
  #9 [ffffaf148e983f50] ret_from_fork at ffffffff94a00354


bool md_write_start(struct mddev *mddev, struct bio *bi)
{
     ...

    >>> process 852 go into the "if" and set "MD_SB_CHANGE_PENDING"

     if (mddev->in_sync || mddev->sync_checkers) {
         spin_lock(&mddev->lock);
         if (mddev->in_sync) {
             mddev->in_sync = 0;
             set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
             set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
             md_wakeup_thread(mddev->thread);
             did_change = 1;
         }
         spin_unlock(&mddev->lock);
     }
     rcu_read_unlock();

     >>> invoke kernfs_notify to wake up userspace daemon

     if (did_change)
         sysfs_notify_dirent_safe(mddev->sysfs_state);
     if (!mddev->has_superblocks)
         return true;

    >>>> hung here waiting userspace daemon clear that flag.

     wait_event(mddev->sb_wait,
            !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
            is_md_suspended(mddev));
     if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
         percpu_ref_put(&mddev->writes_pending);
         return false;
     }
     return true;
}

Thanks,

Junxiao.

>
> Thanks.
>