linux-kernel - RE: [PATCH] [RFC] rt: kernel/sched/core: fix kthread_park() pending too long when CPU un-plugged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM6PR04MB54138E0D36C60ECA1E44242BF1AF0@AM6PR04MB5413.eurprd04.prod.outlook.com>
Date:   Thu, 7 Jan 2021 09:13:32 +0000
From:   Ran Wang <ran.wang_1@....com>
To:     Ran Wang <ran.wang_1@....com>,
        Sebastian Siewior <bigeasy@...utronix.de>,
        Thomas Gleixner <tglx@...utronix.de>
CC:     Jiafei Pan <jiafei.pan@....com>,
        "linux-rt-users@...r.kernel.org" <linux-rt-users@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] [RFC] rt: kernel/sched/core: fix kthread_park() pending
 too long when CPU un-plugged

Hi,

On Thursday, January 7, 2021 5:19 PM, Ran Wang wrote:
> 
> When doing CPU un-plug stress test, function smpboot_park_threads() would get call to park kernel threads (which including ksoftirqd) on
> that CPU core, and function wait_task_inactive() would yield for those queued
> task(s) by calling schedule_hrtimerout() with mode of HRTIMER_MODE_REL.
> 
> stack trace:
> ...
> smpboot_thread_fn
>     cpuhp_thread_fun
>         cpuhp_invoke_callback
>             smpboot_park_threads
>               smpboot_park_thread: ksoftirqd/1
>                 kthread_park
>                   wait_task_inactive
>                      schedule_hrtimerout
> 
> However, when PREEMPT_RT is set, this would cause a pending issue since
> schedule_hrtimerout() depend on thread ksoftirqd to complete related work if it using HRTIMER_MODE_SOFT. So force using
> HRTIMER_MODE_HARD in such case.

This issue was observed on LX2160ARDB (arm64, 16 A72 cores) when selecting PREEMPT_RT, 
non-RT kernel works fine.And I could verify that fix on both linux-5.6.y-rt and linux-5.4.y-rt.
But for linux-5.9.y-rt and linux-5.10.y-rt, looks there are other issues which blocking
verification currently. Below is the steps for issue reproducing:

1. Kernel menuconfig:
CONFIG_QORIQ_CPUFREQ=y

CONFIG_HAVE_PREEMPT_LAZY=y
CONFIG_PREEMPT_LAZY=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_RT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y

2. Shell commands (Issue would happen within roughly 400 rounds of below loop)
echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu12/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu13/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu14/cpufreq/scaling_governor
echo ondemand > /sys/devices/system/cpu/cpu15/cpufreq/scaling_governor

count=1
while [ $? -eq 0 ]
do
        echo "$count th test"
        sleep 3
        let "count=count+1"

        echo 0 > /sys/devices/system/cpu/cpu0/online
        echo 0 > /sys/devices/system/cpu/cpu1/online
        echo 0 > /sys/devices/system/cpu/cpu2/online
        echo 0 > /sys/devices/system/cpu/cpu3/online
        echo 0 > /sys/devices/system/cpu/cpu4/online
        echo 0 > /sys/devices/system/cpu/cpu5/online
        echo 0 > /sys/devices/system/cpu/cpu6/online
        echo 0 > /sys/devices/system/cpu/cpu7/online
        echo 0 > /sys/devices/system/cpu/cpu8/online
        echo 0 > /sys/devices/system/cpu/cpu9/online
        echo 0 > /sys/devices/system/cpu/cpu10/online
        echo 0 > /sys/devices/system/cpu/cpu11/online
        echo 0 > /sys/devices/system/cpu/cpu12/online
        echo 0 > /sys/devices/system/cpu/cpu13/online
        echo 0 > /sys/devices/system/cpu/cpu14/online

        echo 1 > /sys/devices/system/cpu/cpu0/online
        echo 1 > /sys/devices/system/cpu/cpu1/online
        echo 1 > /sys/devices/system/cpu/cpu2/online
        echo 1 > /sys/devices/system/cpu/cpu3/online
        echo 1 > /sys/devices/system/cpu/cpu4/online
        echo 1 > /sys/devices/system/cpu/cpu5/online
        echo 1 > /sys/devices/system/cpu/cpu6/online
        echo 1 > /sys/devices/system/cpu/cpu7/online
        echo 1 > /sys/devices/system/cpu/cpu8/online
        echo 1 > /sys/devices/system/cpu/cpu9/online
        echo 1 > /sys/devices/system/cpu/cpu10/online
        echo 1 > /sys/devices/system/cpu/cpu11/online
        echo 1 > /sys/devices/system/cpu/cpu12/online
        echo 1 > /sys/devices/system/cpu/cpu13/online
        echo 1 > /sys/devices/system/cpu/cpu14/online
done

To be honest, I am not sure how non-RT kernel could avoid this issue. Could anybody give some input/suggestion on this?
Thank you.

Regards,
Ran
 
> Suggested-by: Jiafei Pan <jiafei.pan@....com>
> Signed-off-by: Ran Wang <ran.wang_1@....com>
> ---
>  kernel/sched/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 792da55..4cc742a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2054,10 +2054,15 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
>  			ktime_t to = NSEC_PER_SEC / HZ;
> 
>  			set_current_state(TASK_UNINTERRUPTIBLE);
> -			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
> +
> +			if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> +			    !strncmp(p->comm, "ksoftirqd/", 10))
> +				schedule_hrtimeout(&to,
> +					HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> +			else
> +				schedule_hrtimeout(&to, HRTIMER_MODE_REL);
>  			continue;
>  		}
> -
>  		/*
>  		 * Ahh, all good. It wasn't running, and it wasn't
>  		 * runnable, which means that it will never become
> --
> 2.7.4