linux-kernel - Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <155adb21-be6e-533c-02f8-600a1e9138f8@huawei.com>
Date:   Thu, 29 Jun 2023 09:41:59 +0800
From:   Xiongfeng Wang <wangxiongfeng2@...wei.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Vincent Guittot <vincent.guittot@...aro.org>
CC:     <vschneid@...hat.com>, Phil Auld <pauld@...hat.com>,
        <vdonnefort@...gle.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Wei Li <liwei391@...wei.com>,
        "liaoyu (E)" <liaoyu15@...wei.com>, <zhangqiao22@...wei.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling



On 2023/6/29 6:01, Thomas Gleixner wrote:
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
>> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> No, because this is fundamentally wrong.
>>>
>>> If the CPU is on the way out, then the scheduler hotplug machinery
>>> has to handle the period timer so that the problem Xiongfeng analyzed
>>> does not happen in the first place.
>>
>> But the hrtimer was enqueued before it starts to offline the cpu
> 
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
> 
>> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
>> of the outgoing cpu but :
>> - it must run on another target cpu to migrate the hrtimer.
>> - it runs in the context of the caller which can be throttled.
> 
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
> 
>>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>>> timers, but let me look into whether we can migrate hrtimers early in
>>> general.
>>
>> but for that we must check if the timer is enqueued on the outgoing
>> cpu and we then need to choose a target cpu.
> 
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.
> 
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.
> 
> Something like the untested below should just work.
> 
> Thanks,
> 
>         tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
>  	return err;
>  }
>  
> +static long __cpu_device_down(void *arg)
> +{
> +	struct device *dev = arg;
> +
> +	return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
>  /**
>   * cpu_device_down - Bring down a cpu device
>   * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
>   */
>  int cpu_device_down(struct device *dev)
>  {
> -	return cpu_down(dev->id, CPUHP_OFFLINE);
> +	unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> +	if (cpu >= nr_cpu_ids)
> +		return -EBUSY;
> +
> +	return work_on_cpu(cpu, __cpu_device_down, dev);
>  }
>  
>  int remove_cpu(unsigned int cpu)
> .
> 

Test with the following kernel modification which helps reproduce the issue. The
hang task does not happen any more. Thanks a lot.

Thanks,
Xiongfeng

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -110,6 +110,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/task.h>

+#include <linux/delay.h>
+
 /*
  * Minimum number of threads to boot the kernel
  */
@@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
        struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
        int i;

+       mdelay(2000);
+       cond_resched();
+
        for (i = 0; i < NR_CACHED_STACKS; i++) {
                struct vm_struct *vm_stack = cached_vm_stacks[i];