linux-kernel - [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241119054432.6405-1-kprateek.nayak@amd.com>
Date: Tue, 19 Nov 2024 05:44:28 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Sebastian Andrzej Siewior
	<bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>, "Steven
 Rostedt" <rostedt@...dmis.org>, <linux-kernel@...r.kernel.org>,
	<linux-rt-devel@...ts.linux.dev>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, Tejun Heo
	<tj@...nel.org>, NeilBrown <neilb@...e.de>, Jens Axboe <axboe@...nel.dk>,
	Frederic Weisbecker <frederic@...nel.org>, Zqiang
	<qiang.zhang1211@...il.com>, Caleb Sander Mateos <csander@...estorage.com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>, Chen Yu <yu.c.chen@...el.com>,
	Julia Lawall <Julia.Lawall@...ia.fr>, K Prateek Nayak
	<kprateek.nayak@....com>
Subject: [PATCH v5 0/4] sched/fair: Idle load balancer fixes for fallouts from IPI optimization to TIF_POLLING CPUs

Hello everyone,

This is the fifth version with one more patch addition since last
version, and some updated benchmarking data down below. Any and all
feedback is highly appreciated.

This series is based on:

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

at commit f9ed1f7c2e26 ("genirq/proc: Use seq_put_decimal_ull_width()
for decimal values")

Since commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()"), an idle CPU in TIF_POLLING_NRFLAG can
be pulled out of idle by setting TIF_NEED_RESCHED instead of sending an
actual IPI. This affects at least three scenarios that have been
described below:

1. A need_resched() check within a call function does not necessarily
   indicate a task wakeup since a CPU intending to send an IPI to an
   idle target in TIF_POLLING_NRFLAG mode can simply queue the
   SMP-call-function and set the TIF_NEED_RESCHED flag to pull the
   polling target out of idle. The SMP-call-function will be executed by
   flush_smp_call_function_queue() on the idle-exit path. On x86, where
   mwait_idle_with_hints() sets TIF_POLLING_NRFLAG for long idling,
   this leads to idle load balancer bailing out early since
   need_resched() check in nohz_csd_func() returns true in most
   instances. This is also true for softirqs being handled from 
   do_softirq_post_smp_call_flush(), which again, is processed before
   __schedule() clears the TIF_NEED_RESCHED flag.

2. A TIF_POLLING_NRFLAG idling CPU woken up to process an IPI will end
   up calling schedule() even in cases where the call function does not
   wake up a new task on the idle CPU, thus delaying the idle re-entry.

3. Julia Lawall reported a case where a softirq raised from a
   SMP-call-function on an idle CPU will wake up ksoftirqd since
   flush_smp_call_function_queue() executes in the idle thread's
   context. This can throw off the idle load balancer by making the idle
   CPU appear busy since ksoftirqd just woke on the said CPU [1].

Solution to (2.) was sent independently in [2] since it was not
dependent on the changes enclosed in this series which reworks some
PREEMPT_RT specific bits.

(1.) was solved by dropping the need_resched() check in nohz_csd_func()
(please refer Patch 2 and Patch 3 for the full version of the
explanation) which led to a splat on PREEMPT_RT kernels [3].

Since flush_smp_call_function_queue() and the following
do_softirq_post_smp_call_flush() runs with [reem[t disabled, it is not
ideal for the IRQ handlers to raise a SOFTIRQ, prolonging the preempt
disabled section especially on PREEMPT_RT kernels. Patch 1 works around
this by allowing processing of SCHED_SOFTIRQ without raising a warning
since the idle load balancer has checkpoints to detect wakup of tasks
on the idle CPU in which case it immediately bails out of the load
balaning process and proceeds to call schedule()

With the above solution, problem discussed in (3.) is even more
prominent with idle load balancing waking up ksoftirqd to unnecessarily.
Previous versions attempted to solve this by introducing a per-cpu
variable to keep track on an impending call to do_softirq(). Sebastian
suggested using __raise_softirq_irqoff() for SCHED_SOFTIRQ since it
avoids wakeup of ksoftirqd and since SCHED_SOFTIRQ is being raised from
the IPI handler, it is guaranteed to be serviced on IRQ exit or via
do_softirq_post_smp_call_flush()

Sebastian also raised concerns about threadirqs possibly interfering
with load balancing, however in case of idle load balancing, the CPU
performing the idle load balancing on behalf of other nohz idle CPUs
only performs load balance for itself at the very end. If the idle load
balancing was offloaded to ksoftirqd, and the CPU bails out from idle
load balancing, the scheduler will perform a newidle balance if
ksoftirqd was not able to pull any task since it will be the last fair
task on the runqueue and should handle any imbalance. These details have
been added in the commit log of Patch 3 for the record.

Chenyu had reported a regression when running a modified version of
ipistorm that performs a fixed set of IPIs between two CPUs on his
setup with a previous version of this series which updated certain
per-CPU variable at flush_smp_call_function_queue() boundary to prevent
ksoftirqd wakeups. Since these updates are now gone, courtesy of the
new approach suggested by Sebastian, this regression sould no longer be
seen. The results from testing on the patched kernel is identical to
the base kernel.

Julia Lawall reported reduction in the number of load balancing attempts
on v6.11 based tip:sched/core at NUMA level. The issue was root-caused
to commit 3dcac251b066 ("sched/core: Introduce SM_IDLE and an idle
re-entry fast-path in __schedule()") which would skip the
newidle_balance() if schedule_idle() was called without any task wakeups
on the idle CPUs to favor a faster idle re-entry. To rule out any
surprises from this series in particular, I tested the bt.B.x benchmark
where she originally observed this behavior on. Following are the
numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T):

   ==================================================================
   Test          : bt.B.x (OMP variant)
   Units         : % improvement over base kernel in Mop/s throughput
   Interpretation: Higher is better
   ======================= Intel Ice Lake Xeon ======================
   kernel:						[pct imp]
   performance gov, boost en	  			  0.89%
   performance gov, boost dis	  			  0.26%
   ==================================================================

I did not see any discernable difference with this one over the base
kernel.

[1] https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/
[2] https://lore.kernel.org/lkml/20240809092240.6921-1-kprateek.nayak@amd.com/
[3] https://lore.kernel.org/lkml/225e6d74-ed43-51dd-d1aa-c75c86dd58eb@amd.com/
[4] https://lore.kernel.org/lkml/20240710150557.GB27299@noisy.programming.kicks-ass.net/
---
v4..v5:

o Added Patch 3 which modifies another need_resched() check in the idle
  load balancer's SOFTIRQ handler path that I had missed previously.

o Collected Reviewed-by from Sebastian on Patch 4.

o Reworded the commit messages for Patch 1, and Patch 4.

o Added more clarifications around load balancing in the cover letter
  for threadirqs case.

v4: https://lore.kernel.org/lkml/20241030071557.1422-1-kprateek.nayak@amd.com/

v3..v4:

o Use __raise_softirq_irqoff() for SCHED_SOFTIRQ. (Sebastian)

o Updated benchmark numbers and reworded the cover letter.

v3: https://lore.kernel.org/lkml/20241014090339.2478-1-kprateek.nayak@amd.com/

v2..v3:

o Removed ifdefs around local_lock_t. (Peter)

o Reworded Patch 1 to add more details on raising SCHED_SOFTIRQ from
  flush_smp_call_function_queue() and why it should be okay on
  PREEMPT_RT.

o Updated the trace data in Patch 5.

o More benchmarking.

v2: https://lore.kernel.org/lkml/20240904111223.1035-1-kprateek.nayak@amd.com/

v1..v2:

o Broke the PREEMPT_RT unification and idle load balance fixes into
  separate series (this) and post the SM_IDLE fast-path enhancements
  separately.

o Worked around the splat on PREEMPT_RT kernel caused by raising
  SCHED_SOFTIRQ from nohz_csd_func() in context of
  flush_smp_call_function_queue() which is undesirable on PREEMPT_RT
  kernels. (Please refer to commit 1a90bfd22020 ("smp: Make softirq
  handling RT safe in flush_smp_call_function_queue()")

o Reuse softirq_ctrl::cnt from PREEMPT_RT to prevent unnecessary
  wakeups of ksoftirqd. (Peter)
  This unifies should_wakeup_ksoftirqd() and adds an interface to
  indicate an impending call to do_softirq (set_do_softirq_pending())
  and clear it just before fulfilling the promise
  (clr_do_softirq_pending()).

o More benchmarking.

v1: https://lore.kernel.org/lkml/20240710090210.41856-1-kprateek.nayak@amd.com/
--


K Prateek Nayak (4):
  softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT
    kernel
  sched/core: Remove the unnecessary need_resched() check in
    nohz_csd_func()
  sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU
    turning busy
  sched/core: Prevent wakeup of ksoftirqd during idle load balance

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c |  2 +-
 kernel/softirq.c    | 15 +++++++++++----
 3 files changed, 14 insertions(+), 7 deletions(-)

-- 
2.34.1