linux-kernel - Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtAzTy4KPrBNRA4cMeTonxn5EKLEAg0b9iH5ecJkAMEStw@mail.gmail.com>
Date:   Wed, 28 Jun 2023 14:35:50 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Xiongfeng Wang <wangxiongfeng2@...wei.com>, vschneid@...hat.com,
        Phil Auld <pauld@...hat.com>, vdonnefort@...gle.com,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Wei Li <liwei391@...wei.com>,
        "liaoyu (E)" <liaoyu15@...wei.com>, zhangqiao22@...wei.com,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@...wei.com> wrote:
> >> > diff --cc kernel/sched/fair.c
> >> > index d9d6519fae01,bd6624353608..000000000000
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >> >   {
> >> >         lockdep_assert_held(&cfs_b->lock);
> >> >
> >> > -       if (cfs_b->period_active)
> >> > +       if (cfs_b->period_active) {
> >> > +               struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >> > +               int cpu = clock_base->cpu_base->cpu;
> >> > +               if (!cpu_active(cpu) && cpu != smp_processor_id())
> >> > +                       hrtimer_start_expires(&cfs_b->period_timer,
> >> > HRTIMER_MODE_ABS_PINNED);
> >> >                 return;
> >> > +       }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
> No, because this is fundamentally wrong.
>
> If the CPU is on the way out, then the scheduler hotplug machinery
> has to handle the period timer so that the problem Xiongfeng analyzed
> does not happen in the first place.

But the hrtimer was enqueued before it starts to offline the cpu
Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
of the outgoing cpu but :
- it must run on another target cpu to migrate the hrtimer.
- it runs in the context of the caller which can be throttled.

>
> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> timers, but let me look into whether we can migrate hrtimers early in
> general.

but for that we must check if the timer is enqueued on the outgoing
cpu and we then need to choose a target cpu

>
> Aside of that the above is wrong by itself.
>
>         if (cfs_b->period_active)
>                 hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>
> This only ends up on the outgoing CPU if either:
>
>    1) The code runs on the outgoing CPU
>
> or
>
>    2) The hrtimer is concurrently executing the hrtimer callback on the
>       outgoing CPU.
>
> So this:
>
>         if (cfs_b->period_active) {
>                 struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>                 int cpu = clock_base->cpu_base->cpu;
>
>                 if (!cpu_active(cpu) && cpu != smp_processor_id())
>                         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>               return;
>       }
>
> only works, if
>
>   1) The code runs _not_ on the outgoing CPU
>
> and
>
>   2) The hrtimer is _not_ concurrently executing the hrtimer callback on
>      the outgoing CPU.
>
>      If the callback is executing (it spins on cfs_b->lock), then the
>      timer is requeued on the outgoing CPU. Not what you want, right?
>
> Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
> best.
>
> Thanks,
>
>         tglx