linux-kernel - Re: pick_next_task() picking the wrong task [v4.9.163]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190323101540.GC6058@hirez.programming.kicks-ass.net>
Date:   Sat, 23 Mar 2019 11:15:40 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Radu Rendec <radu.rendec@...il.com>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>
Subject: Re: pick_next_task() picking the wrong task [v4.9.163]

On Fri, Mar 22, 2019 at 05:57:59PM -0400, Radu Rendec wrote:
> Hi Everyone,
> 
> I believe I'm seeing a weird behavior of pick_next_task() where it
> chooses a lower priority task over a higher priority one. The scheduling
> class of the two tasks is also different ('fair' vs. 'rt'). The culprit
> seems to be the optimization at the beginning of the function, where
> fair_sched_class.pick_next_task() is called directly.  I'm running
> v4.9.163, but that piece of code is very similar in recent kernels.
> 
> My use case is quite simple: I have a real-time thread that is woken up
> by a GPIO hardware interrupt. The thread sleeps most of the time in
> poll(), waiting for gpio_sysfs_irq() to wake it. The latency between the
> interrupt and the thread being woken up/scheduled is very important for
> the application. Note that I backported my own commit 03c0a9208bb1, so
> the thread is always woken up synchronously from HW interrupt context.
> 
> Most of the time things work as expected, but sometimes the scheduler
> picks kworker and even the idle task before my real-time thread. I used
> the trace infrastructure to figure out what happens and I'm including a
> snippet below (I apologize for the wide lines).

If only they were wide :/ I had to unwrap them myself..

>      <idle>-0     [000] d.h2   161.202970: gpio_sysfs_irq  <-__handle_irq_event_percpu
>      <idle>-0     [000] d.h2   161.202981: kernfs_notify <-gpio_sysfs_irq
>      <idle>-0     [000] d.h4   161.202998: sched_waking: comm=irqWorker pid=1141 prio=9 target_cpu=000
>      <idle>-0     [000] d.h5   161.203025: sched_wakeup: comm=irqWorker pid=1141 prio=9 target_cpu=000

weird how the next line doesn't have 'n/N' set:

>      <idle>-0     [000] d.h3   161.203047: workqueue_queue_work: work struct=806506b8 function=kernfs_notify_workfn workqueue=8f5dae60 req_cpu=1 cpu=0
>      <idle>-0     [000] d.h3   161.203049: workqueue_activate_work: work struct 806506b8
>      <idle>-0     [000] d.h4   161.203061: sched_waking: comm=kworker/0:1 pid=134 prio=120 target_cpu=000
>      <idle>-0     [000] d.h5   161.203083: sched_wakeup: comm=kworker/0:1 pid=134 prio=120 target_cpu=000

There's that kworker wakeup.

>      <idle>-0     [000] d..2   161.203201: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=134 next_prio=120

And I agree that that is weird.

> kworker/0:1-134   [000] ....   161.203222: workqueue_execute_start: work struct 806506b8: function kernfs_notify_workfn
> kworker/0:1-134   [000] ...1   161.203286: schedule <-worker_thread
> kworker/0:1-134   [000] d..2   161.203329: sched_switch: prev_comm=kworker/0:1 prev_pid=134 prev_prio=120 prev_state=S ==> next_comm=swapper next_pid=0 next_prio=120
>      <idle>-0     [000] .n.1   161.230287: schedule <-schedule_preempt_disabled

Only here do I see 'n'.

>      <idle>-0     [000] d..2   161.230310: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R+ ==> next_comm=irqWorker next_pid=1141 next_prio=9
>   irqWorker-1141  [000] d..3   161.230316: finish_task_switch <-schedule
> 
> The system is Freescale MPC8378 (PowerPC, single processor).
> 
> I instrumented pick_next_task() with trace_printk() and I am sure that
> every time the wrong task is picked, flow goes through the optimization

That's weird, because when you wake a RT task, the:

  rq->nr_running == rq->cfs.h_nr_running

condition should not be true. Maybe try adding trace_printk() to all
rq->nr_running manipulation to see what goes wobbly?

> path and idle_sched_class.pick_next_task() is called directly. When the
> right task is eventually picked, flow goes through the bottom block that
> iterates over all scheduling classes. This probably makes sense: when
> the scheduler runs in the context of the idle task, prev->sched_class is
> no longer fair_sched_class, so the bottom block with the full iteration
> is used. Note that in v4.9.163 the optimization path is taken only when
> prev->sched_class is fair_sched_class, whereas in recent kernels it is
> taken for both fair_sched_class and idle_sched_class.
> 
> Any help or feedback would be much appreciated. In the meantime, I will
> experiment with commenting out the optimization (at the expense of a
> slower scheduler, of course).

It would be very good if you could confirm on the very latest kernel,
instead of on 4.9.