linux-kernel - Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtC6eVhnQh+SeiPLqVCWDLW5PnXmT+7LkZ8iPDh8_QvUeA@mail.gmail.com>
Date:   Thu, 29 Jun 2023 10:33:49 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Xiongfeng Wang <wangxiongfeng2@...wei.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>, vschneid@...hat.com,
        Phil Auld <pauld@...hat.com>, vdonnefort@...gle.com,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Wei Li <liwei391@...wei.com>,
        "liaoyu (E)" <liaoyu15@...wei.com>, zhangqiao22@...wei.com,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Thu, 29 Jun 2023 at 03:26, Xiongfeng Wang <wangxiongfeng2@...wei.com> wrote:
>
>
>
> On 2023/6/28 0:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@...wei.com> wrote:
> >>
> >> Hi,
> >>
> >> Kindly ping~
> >> Could you please take a look at this issue and the below temporary fix ?
> >>
> >> Thanks,
> >> Xiongfeng
> >>
> >> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >>>
> >>>
> >>> On 2023/6/9 22:55, Thomas Gleixner wrote:
> >>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>>>
> >>>> Cc+ scheduler people, leave context intact
> >>>>
> >>>>> Hello,
> >>>>>  When I do some low power tests, the following hung task is printed.

[...]

> >>> diff --cc kernel/sched/fair.c
> >>> index d9d6519fae01,bd6624353608..000000000000
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >>>   {
> >>>         lockdep_assert_held(&cfs_b->lock);
> >>>
> >>> -       if (cfs_b->period_active)
> >>> +       if (cfs_b->period_active) {
> >>> +               struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >>> +               int cpu = clock_base->cpu_base->cpu;
> >>> +               if (!cpu_active(cpu) && cpu != smp_processor_id())
> >>> +                       hrtimer_start_expires(&cfs_b->period_timer,
> >>> HRTIMER_MODE_ABS_PINNED);
> >>>                 return;
> >>> +       }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
>
> Sorry, I forgot to provide the kernel modification to help reproduce the issue.
> At first, the issue can only be reproduced on the product environment with
> product stress testcase. After firguring out the reason, I add the following
> modification. It make sure the process ran out cfs quota and can be sched out in
> free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
> also show the issue exists.

I have been able to reproduce the problem ( or at least something
similar) without your change below with a shorter cfs_quota_us and
other tasks always running in the cgroup

>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0fb86b65ae60..3b2d83fb407a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -110,6 +110,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>
> +#include <linux/delay.h>
> +
>  /*
>   * Minimum number of threads to boot the kernel
>   */
> @@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
>         struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
>         int i;
>
> +       mdelay(2000);
> +       cond_resched();
> +
>         for (i = 0; i < NR_CACHED_STACKS; i++) {
>                 struct vm_struct *vm_stack = cached_vm_stacks[i];
>
> Thanks,
> Xiongfeng
>
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
> >
> >
> >>>
> >>>         cfs_b->period_active = 1;
> >>>  -
> >>>         hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> >>>         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >>>   }
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >>>
> >>> .
> >>>
> > .
> >