linux-kernel - [RFC] sched/deadline: only mark active cpu as free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250110233010.2339521-1-opendmb@gmail.com>
Date: Fri, 10 Jan 2025 15:30:10 -0800
From: Doug Berger <opendmb@...il.com>
To: Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Florian Fainelli <florian.fainelli@...adcom.com>,
	linux-kernel@...r.kernel.org,
	Doug Berger <opendmb@...il.com>
Subject: [RFC] sched/deadline: only mark active cpu as free

There is a hazard in the deadline scheduler where an offlined CPU
can have its free_cpus bit left set in the def_root_domain when
the schedutil cpufreq governor is used. This can allow a deadline
thread to be pushed to the runqueue of a powered down CPU which
breaks scheduling.

This commit works around the issue by only setting the free_cpus
bit for a CPU when it is "active". It is likely that the ordering
of sched_set_rq_online() and set_cpu_active() at the end of the
sched_cpu_deactivate() function should be revisited if this
approach has merit.

Signed-off-by: Doug Berger <opendmb@...il.com>
---

Coffee is recommended before proceeding.

While stress testing CPU hotplug on a quad-core arm64 architecture
system I encountered a deadlock. My specific deadlock appears to be
dependent on the system having three or more cores and using the
sched-util cpufreq governor which uses a deadline scheduled thread
named "sugov:n" where n is the CPU number.

The scenario I observe is as follows:
Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
previously offlined so their runqueues are attached to the
def_root_domain.
1) A hot plug is initiated on CPU2.
2) The cpuhp/2 thread invokes the cpufreq governor driver during
   the CPUHP_AP_ONLINE_DYN step.
3) The sched util cpufreq governor creates the "sugov:2" thread to
   execute on CPU2 with the deadline scheduler.
4) The deadline scheduler clears the free_cpus mask for CPU2 within
   the def_root_domain when "sugov:2" is scheduled.
5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
   clear the deadline which sets the free_cpus mask for CPU2 within
   the def_root_domain.
6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
   domain is created to include CPU0, CPU1, and CPU2.
   o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
     CPU1 which offlines their runqueues and detaches their current
     dynamic scheduling domain (clearing their deadline free_cpus
     bits there) and attaches the def_root_domain and onlines their
     runqueus (setting their deadline free_cpus bits there).
   o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
     and CPU2.
     - Since only CPU0 and CPU1 are online in the def_root_domain
       set_rq_offline() is only called for them to offline their
       runqueues and detach the def_root_domain (clearing their
       deadline free_cpus bits there).
     - The free_cpus bit for CPU2 in def_root_domain is allowed to
       remain set.
     - The newly created dynamic scheduling domain is attached to
       CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
       to online their runqueues (setting their deadline free_cpus
       bits there).
7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
   CPUHP_AP_ACTIVE step, but since the runqueues are already online
   essentially nothing happens.
8) Some time later CPU2 is hot unplugged.
9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
   invokes balance_push_set() for CPU2 which migrates "sugov:2" to
   a different CPU through fallback.
10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
    CPU2 which takes its runqueue offline and clears its deadline
    free_cpus bit in the current dynamic scheduling domain.
11) Also at this step, cpuhp/2 updates the scheduling domain to
    remove CPU2.
    o detach_destroy_domains() invokes rq_attach_root() for CPU0,
      CPU1, and CPU2 to move them back to the def_root_domain.
      - Since only CPU0 and CPU1 are online in the current dynamic
        scheduling domain (CPU2 was removed at 10 above),
        set_rq_offline() is only called for them to clear their
        deadline free_cpus bits.
      - The def_root_domain is attached to CPU0, CPU1, and CPU2
        runqueues and since only CPU0 and CPU1 are marked active
        set_rq_online() is used to online their runqueues (setting
        their deadline free_cpus bits there).
      - The free_cpus bit for CPU2 in def_root_domain is allowed
        to remain set.
    o build_sched_domains() invokes rq_attach_root() for CPU0 and
      CPU1 which offlines their runqueues (clearing their deadline
      free_cpus bits in def_root_domain), attaches a new dynamic
      scheduling domain, and onlines their runqueus (setting their
      deadline free_cpus bits there).
12) The cpuhp/2 thread invokes the cpufreq governor driver during
    the CPUHP_AP_ONLINE_DYN step which attempts to stop the
    "sugov:2" kthread by calling kthread_flush_worker() followed by
    kthread_stop().
13) The "sugov:2" thread can likely be successfully deadline
    scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
    complete offlining CPU2 and power it off.
14) Some time later CPU1 is hot unplugged.
15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
   invokes balance_push_set() for CPU1 which migrates "sugov:1" to
   CPU0 through fallback.
16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
    CPU1 which takes its runqueue offline and clears its deadline
    free_cpus bit in the current dynamic scheduling domain.
17) Also at this step, cpuhp/1 updates the scheduling domain to
    remove CPU1.
    o detach_destroy_domains() invokes rq_attach_root() for CPU0
      and CPU1 to move them back to the def_root_domain.
      - Since only CPU0 is online in the current dynamic scheduling
        domain (CPU1 was removed at 16 above), set_rq_offline() is
        only called for it to clear its deadline free_cpus bit.
      - The def_root_domain is attached to CPU0 and CPU1 runqueues
        and since only CPU0 is marked active set_rq_online() is
        used to online its runqueue (setting its deadline free_cpus
        bits there).
      - The free_cpus bit for CPU1 is untouched in def_root_domain.
      - The free_cpus bit for CPU2 in def_root_domain remains set
        from the preceeding sequence.
18) If CPU0 executes the "sugov:0" deadline thread at this time it
    may see that the "sugov:1" deadline thread is also on its
    runqueue and may call push_dl_task() to attempt to push it to a
    different CPU.
19) The effort to find a later runqueue will find the stale
    free_cpus bit of CPU2 in the currently attached def_root_domain
    and will migrate the "sugov:1" thread to the runqueue of the
    powered down CPU2 where it can never get scheduled.
20) The cpuhp/1 thread invokes the cpufreq governor driver during
    the CPUHP_AP_ONLINE_DYN step which attempts to stop the
    "sugov:1" kthread by calling kthread_flush_worker() followed by
    kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
    remains blocked on completion events.

Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
the deadline scheduler def_root_domain to remain set for a CPU that
is powered off. The trap can be sprung during the narrow timing
hazard when the def_root_domain is transitionally attached while
changing scheduling domains if the deadline scheduler pushes a
queued task to the powered off CPU.

This problem appears to have been initially introduced by commit 
120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
moved the set_rq_offline() handling from sched_cpu_dying() to
sched_cpu_deactivate(). The original sequence allowed the free_cpus
bit to be forcibly cleared in the def_root_domains after all of the
scheduler dust settled. The new location makes the
sched_set_rq_offline() essentially meaningless for the deadline
scheduler since the managing of changed scheduling domains happens
later.

There are likely many different approaches to address this issue
and I'm hopeful that somone more familiar with the scheduler than
I can propose a better solution than the one suggested here.

Thank you for reading this far. Any advice is appreciated.
-Doug

 kernel/sched/cpudeadline.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 95baa12a1029..6896bbe0e9ae 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
 		cp->elements[cpu].idx = IDX_INVALID;
 		cpudl_heapify(cp, old_idx);
 
-		cpumask_set_cpu(cpu, cp->free_cpus);
+		if (cpu_active(cpu))
+			cpumask_set_cpu(cpu, cp->free_cpus);
 	}
 	raw_spin_unlock_irqrestore(&cp->lock, flags);
 }
-- 
2.34.1