linux-kernel - Latency spikes from within the scheduler.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <Y2T72N6nE6UHDPx3@linutronix.de>
Date:   Fri, 4 Nov 2022 12:47:36 +0100
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     linux-kernel@...r.kernel.org
Cc:     Ben Segall <bsegall@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Latency spikes from within the scheduler.

Hi,

I've been playing with v6.1-rc3-rt2 on a 256 CPU machine. Some of my
findings:
- it can take some time to grab the RQ lock. That is to got on the CPU
  _or_ to leave the CPU. For instance an 1ms example:
|       cyclictest-4619    [011] d...2..  1128.288674: idle_cpu <-find_busiest_group
|       cyclictest-4619    [011] d...2..  1128.288674: idle_cpu <-find_busiest_group
|       cyclictest-4619    [011] d...2..  1128.288675: raw_spin_rq_lock_nested <-load_balance
|       cyclictest-4619    [011] d...3..  1128.288675: _raw_spin_lock <-raw_spin_rq_lock_nested
|       cyclictest-4619    [011] d...4..  1128.288676: queued_spin_lock_slowpath <-raw_spin_rq_lock_nested
|       cyclictest-4619    [011] d...3..  1128.289688: update_rq_clock <-load_balance
|       cyclictest-4619    [011] d...3..  1128.289688: can_migrate_task <-load_balance
|       cyclictest-4619    [011] d...3..  1128.289688: kthread_is_per_cpu <-can_migrate_task

   https://breakpoint.cc/traces-zen3/trace-load_balance_rq_lock-11.txt.zst [77KiB]
   https://breakpoint.cc/traces-zen3/trace-load_balance_rq_lock.txt.zst [34MiB]

- A CPU can be attacked by rto_push_irq_work_func(). A few instances are
  probably okay but doing it 86 times can sum up to ~400us if called from
  flush_smp_call_function_queue() (after waiting for the RQ lock):
|            <idle>-0       [032] d...2..  4069.605763: irq_enter_rcu <-sysvec_apic_timer_interrupt
|            <idle>-0       [032] dn.h6..  4069.605929: queued_spin_lock_slowpath <-enqueue_task_rt
…
|            <idle>-0       [032] d..h6..  4069.605766: queued_spin_lock_slowpath <-try_to_wake_up
|            <idle>-0       [032] dn.h5..  4069.605923: update_rq_clock <-try_to_wake_up
…
|            <idle>-0       [032] dn.h6..  4069.605929: queued_spin_lock_slowpath <-enqueue_task_rt
|            <idle>-0       [032] dn.h5..  4069.606167: enqueue_top_rt_rq <-enqueue_task_rt
|            <idle>-0       [032] dn.h4..  4069.606169: sched_wakeup: comm=cyclictest pid=8045 prio=9 target_cpu=032
…
|            <idle>-0       [032] dn.h2..  4069.606171: irq_exit_rcu <-sysvec_apic_timer_interrupt
|            <idle>-0       [032] .n..2..  4069.606171: arch_cpu_idle_exit <-do_idle
|            <idle>-0       [032] .n..2..  4069.606171: flush_smp_call_function_queue <-do_idle
|            <idle>-0       [032] dn..2..  4069.606172: __flush_smp_call_function_queue <-flush_smp_call_function_queue
|            <idle>-0       [032] dn..2..  4069.606173: rto_push_irq_work_func <-irq_work_single
…
|            <idle>-0       [032] dn.h2..  4069.606578: irq_exit_rcu <-sysvec_irq_work
|            <idle>-0       [032] .n..2..  4069.606578: schedule_idle <-do_idle
|            <idle>-0       [032] dn..2..  4069.606579: rcu_note_context_switch <-__schedule

   https://breakpoint.cc/traces-zen3/trace-rto_push_irq_work_func-32.txt.zst [102KiB]
   https://breakpoint.cc/traces-zen3/trace-rto_push_irq_work_func.txt.zst [49MiB]

- I also managed to run into sched_rt_period_timer():
|            <idle>-0       [027] dn.h1..   630.600139: hrtimer_expire_entry: hrtimer=00000000602b5d49 function=sched_rt_period_timer now=630600137263
|            <idle>-0       [027] dn.h2..   630.600139: sched_rt_period_timer <-__hrtimer_run_queues
|            <idle>-0       [027] dn.h2..   630.600139: _raw_spin_lock <-sched_rt_period_timer
|            <idle>-0       [027] dn.h3..   630.600140: queued_spin_lock_slowpath <-sched_rt_period_timer
|            <idle>-0       [027] dn.h3..   630.600326: ktime_get <-sched_rt_period_timer
…
|            <idle>-0       [027] dn.h3..   630.606825: hrtimer_forward <-sched_rt_period_timer
|            <idle>-0       [027] dn.h2..   630.606826: _raw_spin_lock_irq <-__hrtimer_run_queues
|            <idle>-0       [027] dn.h3..   630.606826: enqueue_hrtimer <-__hrtimer_run_queues
|            <idle>-0       [027] dn.h2..   630.606826: hrtimer_start: hrtimer=00000000602b5d49 function=sched_rt_period_timer expires=631600000000 softexpires=631600000000 mode=ABS

   That is almost 7ms of runtime just for that one hrtimer ensuring that
   an RT task is not running longer than needing.

   https://breakpoint.cc/traces-zen3/trace-sched_rt_period_timer-27.txt.zst [101K]
   https://breakpoint.cc/traces-zen3/trace-sched_rt_period_timer.txt.zst [48MiB]

The -XX trace is just from that one CPU, the other the complete one. I
also have them as an all-in-one archive:
   https://breakpoint.cc/traces-zen3/trace-all-in-one.tar.zst

And the dot-config:
   https://breakpoint.cc/traces-zen3/dot-config.tar.zst

The first two were run with CONFIG-PERIODIC, the last one with CONFIG-NO_HZ.

Sebastian