linux-kernel - Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBoe_jRn-EMsQxssQ4BcveT+Qcd+GmsRbQEXQDGfzFOMg@mail.gmail.com>
Date:   Tue, 27 Jun 2023 18:46:54 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Xiongfeng Wang <wangxiongfeng2@...wei.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>, vschneid@...hat.com,
        Phil Auld <pauld@...hat.com>, vdonnefort@...gle.com,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Wei Li <liwei391@...wei.com>,
        "liaoyu (E)" <liaoyu15@...wei.com>, zhangqiao22@...wei.com,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@...wei.com> wrote:
>
> Hi,
>
> Kindly ping~
> Could you please take a look at this issue and the below temporary fix ?
>
> Thanks,
> Xiongfeng
>
> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >
> >
> > On 2023/6/9 22:55, Thomas Gleixner wrote:
> >> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>
> >> Cc+ scheduler people, leave context intact
> >>
> >>> Hello,
> >>>  When I do some low power tests, the following hung task is printed.
> >>>
> >>>   Call trace:
> >>>    __switch_to+0xd4/0x160
> >>>    __schedule+0x38c/0x8c4
> >>>    __cond_resched+0x24/0x50
> >>>    unmap_kernel_range_noflush+0x210/0x240
> >>>    kretprobe_trampoline+0x0/0xc8
> >>>    __vunmap+0x70/0x31c
> >>>    __vfree+0x34/0x8c
> >>>    vfree+0x40/0x58
> >>>    free_vm_stack_cache+0x44/0x74
> >>>    cpuhp_invoke_callback+0xc4/0x71c
> >>>    _cpu_down+0x108/0x284
> >>>    kretprobe_trampoline+0x0/0xc8
> >>>    suspend_enter+0xd8/0x8ec
> >>>    suspend_devices_and_enter+0x1f0/0x360
> >>>    pm_suspend.part.1+0x428/0x53c
> >>>    pm_suspend+0x3c/0xa0
> >>>    devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> >>>    devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> >>>    devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> >>>    drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> >>>    __arm64_sys_ioctl+0xb4/0xf0
> >>>    el0_svc_common.constprop.0+0x140/0x374
> >>>    do_el0_svc+0x80/0xa0
> >>>    el0_svc+0x1c/0x28
> >>>    el0_sync_handler+0x90/0xf0
> >>>    el0_sync+0x168/0x180
> >>>
> >>> After some analysis, I found it is caused by the following race condition.
> >>>
> >>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> >>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> >>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> >>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> >>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> >>> steps, CPU2 starts the rest CPUHP step.
> >>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> >>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> >>> hrtimer because 'cfs_b->period_active' is set.
> >>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> >>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> >>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> >>> hrtimers_dead_cpu() step. So the task hungs.
> >>>
> >>>     CPU1                                                     CPU2
> >>> Task set cfs_quota
> >>> start hrtimer cfs_bandwidth 'period_timer'
> >>>                                             start to offline CPU1
> >>> CPU1 start CPUHP AP step
> >>> ...
> >>> 'period_timer' expired and re-enqueued on CPU1
> >>> ...
> >>> disable irq in take_cpu_down()
> >>> ...
> >>>                                             CPU2 start the rest CPUHP steps
> >>>                                             ...
> >>>                                           sched out in free_vm_stack_cache()
> >>>                                             wait for 'period_timer' expires
> >>>
> >>>
> >>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >> .
> >>
> >
> > Test script:
> > taskset -cp 1 $$
> > mkdir /sys/fs/cgroup/cpu/test
> > echo $$ > /sys/fs/cgroup/cpu/test/tasks
> > echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> > echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> > taskset -cp 2 $$
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> >
> >
> > Tests show that the following modification can solve the problem of above test
> > scripts. But I am not sure if there exists some other issues.
> >
> > diff --cc kernel/sched/fair.c
> > index d9d6519fae01,bd6624353608..000000000000
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >   {
> >         lockdep_assert_held(&cfs_b->lock);
> >
> > -       if (cfs_b->period_active)
> > +       if (cfs_b->period_active) {
> > +               struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> > +               int cpu = clock_base->cpu_base->cpu;
> > +               if (!cpu_active(cpu) && cpu != smp_processor_id())
> > +                       hrtimer_start_expires(&cfs_b->period_timer,
> > HRTIMER_MODE_ABS_PINNED);
> >                 return;
> > +       }

I have been able to reproduce your problem and run your fix on top. I
still wonder if there is a
Could we have a helper from hrtimer to get the cpu of the clock_base ?


> >
> >         cfs_b->period_active = 1;
> >  -
> >         hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> >         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >   }
> >
> > Thanks,
> > Xiongfeng
> >
> > .
> >