linux-kernel - Re: [PATCH] sched/fair: Fix detection of per-CPU kthreads waking a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87zgpsb6de.mognet@arm.com>
Date:   Thu, 25 Nov 2021 15:30:05 +0000
From:   Valentin Schneider <Valentin.Schneider@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Vincent Donnefort <Vincent.Donnefort@....com>,
        peterz@...radead.org, mingo@...hat.com,
        linux-kernel@...r.kernel.org, mgorman@...hsingularity.net,
        dietmar.eggemann@....com
Subject: Re: [PATCH] sched/fair: Fix detection of per-CPU kthreads waking a task

On 25/11/21 14:23, Vincent Guittot wrote:
> On Thu, 25 Nov 2021 at 12:16, Valentin Schneider
> <Valentin.Schneider@....com> wrote:
>> I think you can still hit this on a symmetric system; let me try to
>> reformulate my other email.
>>
>> If this (non-patched) condition evaluates to true, it means the previous
>> condition
>>
>>   (available_idle_cpu(target) || sched_idle_cpu(target)) &&
>>    asym_fits_capacity(task_util, target)
>>
>> evaluated to false, so for a symmetric system target sure isn't idle.
>>
>> prev == smp_processor_id() implies prev == target, IOW prev isn't
>> idle. Now, consider:
>>
>>   p0.prev = CPU1
>>   p1.prev = CPU1
>>
>>   CPU0                     CPU1
>>   current = don't care     current = swapper/1
>>
>>   ttwu(p1)
>>     ttwu_queue(p1, CPU1)
>>     // or
>>     ttwu_queue_wakelist(p1, CPU1)
>>
>>                           hrtimer_wakeup()
>>                             wake_up_process()
>>                               ttwu()
>>                                 idle_cpu(CPU1)? no
>>
>>                                 is_per_cpu_kthread(current)? yes
>>                                 prev == smp_processor_id()? yes
>>                                 this_rq()->nr_running <= 1? yes
>>                                 => self enqueue
>>
>>                           ...
>>                           schedule_idle()
>>
>> This works if CPU0 does either a full enqueue (rq->nr_running == 1) or just
>> a wakelist enqueue (rq->ttwu_pending > 0). If there was an idle CPU3
>> around, we'd still be stacking p0 and p1 onto CPU1.
>>
>> IOW this opens a window between a remote ttwu() and the idle task invoking
>> schedule_idle() where the idle task can stack more tasks onto its CPU.
>
> Your use case above is out of the scope of this patch and has always
> been there, even for other per cpu kthreads. In such case, the wake up
> is not triggered by current (idle or another per cpu kthread) but by
> an interrupt (hrtimer in your case).

Technically the idle task didn't pass is_per_cpu_kthread(p) when that
condition was added, this is somewhat of a "new development" - but you're
right on the hardirq side of things.

> If we want to filter wakeup
> generated by interrupt context while a per cpu kthread is running, it
> would be better to fix all cases and test the running context like
> this
>

I think that could make sense - though can the idle task issue wakeups in
process context? If so that won't be sufficient. A quick audit tells me:

o rcu_nocb_flush_deferred_wakeup() happens before calling into cpuidle
o I didn't see any wakeup issued from the cpu_pm_notifier call chain
o I'm not entirely sure about flush_smp_call_function_from_idle(). I found
  this thing in RCU:

  smp_call_function_single(cpu, rcu_exp_handler)

    rcu_exp_handler()
      rcu_report_exp_rdp()
        rcu_report_exp_cpu_mult()
          __rcu_report_exp_rnp()
            swake_up_one()

IIUC if set_nr_if_polling() then the smp_call won't send an IPI and should be
handled in that flush_foo_from_idle() call.

I'd be tempted to stick your VincentD's conditions together, just to be
safe...

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6397,7 +6397,8 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>          * essentially a sync wakeup. An obvious example of this
>          * pattern is IO completions.
>          */
> -       if (is_per_cpu_kthread(current) &&
> +       if (!in_interrupt() &&
> +           is_per_cpu_kthread(current) &&
>             prev == smp_processor_id() &&
>             this_rq()->nr_running <= 1) {
>                 return prev;
>
>>
>> >
>> >> --
>> >> 2.25.1
>> >>