lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z4Tw7Kx8TGDUrm35@jlelli-thinkpadt14gen4.remote.csb>
Date: Mon, 13 Jan 2025 11:54:36 +0100
From: Juri Lelli <juri.lelli@...hat.com>
To: Doug Berger <opendmb@...il.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Florian Fainelli <florian.fainelli@...adcom.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC] sched/deadline: only mark active cpu as free

Hi Doug,

On 10/01/25 15:30, Doug Berger wrote:
> There is a hazard in the deadline scheduler where an offlined CPU
> can have its free_cpus bit left set in the def_root_domain when
> the schedutil cpufreq governor is used. This can allow a deadline
> thread to be pushed to the runqueue of a powered down CPU which
> breaks scheduling.
> 
> This commit works around the issue by only setting the free_cpus
> bit for a CPU when it is "active". It is likely that the ordering
> of sched_set_rq_online() and set_cpu_active() at the end of the
> sched_cpu_deactivate() function should be revisited if this
> approach has merit.
> 
> Signed-off-by: Doug Berger <opendmb@...il.com>
> ---
> 
> Coffee is recommended before proceeding.
> 
> While stress testing CPU hotplug on a quad-core arm64 architecture
> system I encountered a deadlock. My specific deadlock appears to be
> dependent on the system having three or more cores and using the
> sched-util cpufreq governor which uses a deadline scheduled thread
> named "sugov:n" where n is the CPU number.
> 
> The scenario I observe is as follows:
> Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
> previously offlined so their runqueues are attached to the
> def_root_domain.
> 1) A hot plug is initiated on CPU2.
> 2) The cpuhp/2 thread invokes the cpufreq governor driver during
>    the CPUHP_AP_ONLINE_DYN step.
> 3) The sched util cpufreq governor creates the "sugov:2" thread to
>    execute on CPU2 with the deadline scheduler.
> 4) The deadline scheduler clears the free_cpus mask for CPU2 within
>    the def_root_domain when "sugov:2" is scheduled.
> 5) When the "sugov:2" thread blocks, cpudl_clear() gets called to
>    clear the deadline which sets the free_cpus mask for CPU2 within
>    the def_root_domain.
> 6) When cpuhp/2 reaches the CPUHP_AP_ACTIVE step a new scheduling
>    domain is created to include CPU0, CPU1, and CPU2.
>    o detach_destroy_domains() invokes rq_attach_root() for CPU0 and
>      CPU1 which offlines their runqueues and detaches their current
>      dynamic scheduling domain (clearing their deadline free_cpus
>      bits there) and attaches the def_root_domain and onlines their
>      runqueus (setting their deadline free_cpus bits there).
>    o build_sched_domains() invokes rq_attach_root() for CPU0, CPU1,
>      and CPU2.
>      - Since only CPU0 and CPU1 are online in the def_root_domain
>        set_rq_offline() is only called for them to offline their
>        runqueues and detach the def_root_domain (clearing their
>        deadline free_cpus bits there).
>      - The free_cpus bit for CPU2 in def_root_domain is allowed to
>        remain set.
>      - The newly created dynamic scheduling domain is attached to
>        CPU0, CPU1, and CPU2 runqueues and set_rq_online() is used
>        to online their runqueues (setting their deadline free_cpus
>        bits there).
> 7) The cpuhp/2 thread also invokes sched_set_rq_online() in the
>    CPUHP_AP_ACTIVE step, but since the runqueues are already online
>    essentially nothing happens.
> 8) Some time later CPU2 is hot unplugged.
> 9) At the CPUHP_AP_ACTIVE step, cpuhp/2 marks CPU2 not active and
>    invokes balance_push_set() for CPU2 which migrates "sugov:2" to
>    a different CPU through fallback.
> 10) Also at this step, cpuhp/2 invokes sched_set_rq_offline() for
>     CPU2 which takes its runqueue offline and clears its deadline
>     free_cpus bit in the current dynamic scheduling domain.
> 11) Also at this step, cpuhp/2 updates the scheduling domain to
>     remove CPU2.
>     o detach_destroy_domains() invokes rq_attach_root() for CPU0,
>       CPU1, and CPU2 to move them back to the def_root_domain.
>       - Since only CPU0 and CPU1 are online in the current dynamic
>         scheduling domain (CPU2 was removed at 10 above),
>         set_rq_offline() is only called for them to clear their
>         deadline free_cpus bits.
>       - The def_root_domain is attached to CPU0, CPU1, and CPU2
>         runqueues and since only CPU0 and CPU1 are marked active
>         set_rq_online() is used to online their runqueues (setting
>         their deadline free_cpus bits there).
>       - The free_cpus bit for CPU2 in def_root_domain is allowed
>         to remain set.
>     o build_sched_domains() invokes rq_attach_root() for CPU0 and
>       CPU1 which offlines their runqueues (clearing their deadline
>       free_cpus bits in def_root_domain), attaches a new dynamic
>       scheduling domain, and onlines their runqueus (setting their
>       deadline free_cpus bits there).
> 12) The cpuhp/2 thread invokes the cpufreq governor driver during
>     the CPUHP_AP_ONLINE_DYN step which attempts to stop the
>     "sugov:2" kthread by calling kthread_flush_worker() followed by
>     kthread_stop().
> 13) The "sugov:2" thread can likely be successfully deadline
>     scheduled on CPU0 or CPU1 to allow the cpuhp/2 thread to
>     complete offlining CPU2 and power it off.
> 14) Some time later CPU1 is hot unplugged.
> 15) At the CPUHP_AP_ACTIVE step, cpuhp/1 marks CPU1 not active and
>    invokes balance_push_set() for CPU1 which migrates "sugov:1" to
>    CPU0 through fallback.
> 16) Also at this step, cpuhp/1 invokes sched_set_rq_offline() for
>     CPU1 which takes its runqueue offline and clears its deadline
>     free_cpus bit in the current dynamic scheduling domain.
> 17) Also at this step, cpuhp/1 updates the scheduling domain to
>     remove CPU1.
>     o detach_destroy_domains() invokes rq_attach_root() for CPU0
>       and CPU1 to move them back to the def_root_domain.
>       - Since only CPU0 is online in the current dynamic scheduling
>         domain (CPU1 was removed at 16 above), set_rq_offline() is
>         only called for it to clear its deadline free_cpus bit.
>       - The def_root_domain is attached to CPU0 and CPU1 runqueues
>         and since only CPU0 is marked active set_rq_online() is
>         used to online its runqueue (setting its deadline free_cpus
>         bits there).
>       - The free_cpus bit for CPU1 is untouched in def_root_domain.
>       - The free_cpus bit for CPU2 in def_root_domain remains set
>         from the preceeding sequence.
> 18) If CPU0 executes the "sugov:0" deadline thread at this time it
>     may see that the "sugov:1" deadline thread is also on its
>     runqueue and may call push_dl_task() to attempt to push it to a
>     different CPU.
> 19) The effort to find a later runqueue will find the stale
>     free_cpus bit of CPU2 in the currently attached def_root_domain
>     and will migrate the "sugov:1" thread to the runqueue of the
>     powered down CPU2 where it can never get scheduled.
> 20) The cpuhp/1 thread invokes the cpufreq governor driver during
>     the CPUHP_AP_ONLINE_DYN step which attempts to stop the
>     "sugov:1" kthread by calling kthread_flush_worker() followed by
>     kthread_stop(). Since "sugov:1" never gets scheduled cpuhp/1
>     remains blocked on completion events.
> 
> Steps 1-13 amount to setting a trap by allowing a free_cpus bit in
> the deadline scheduler def_root_domain to remain set for a CPU that
> is powered off. The trap can be sprung during the narrow timing
> hazard when the def_root_domain is transitionally attached while
> changing scheduling domains if the deadline scheduler pushes a
> queued task to the powered off CPU.
> 
> This problem appears to have been initially introduced by commit 
> 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") which
> moved the set_rq_offline() handling from sched_cpu_dying() to
> sched_cpu_deactivate(). The original sequence allowed the free_cpus
> bit to be forcibly cleared in the def_root_domains after all of the
> scheduler dust settled. The new location makes the
> sched_set_rq_offline() essentially meaningless for the deadline
> scheduler since the managing of changed scheduling domains happens
> later.
> 
> There are likely many different approaches to address this issue
> and I'm hopeful that somone more familiar with the scheduler than
> I can propose a better solution than the one suggested here.
> 
> Thank you for reading this far. Any advice is appreciated.

Thanks a lot for the detailed analysis!

I actually fear that the issue is due to the cpudl_clear_freecpu() call
in rq_offline_dl() being racy, as we don't hold cp->lock while calling
that. So, I think your solution below might be almost correct. I am
thinking we should do something similar in cpudl_set() and remove cpudl_
{set,clear}_freecpu() calls altogether.

What do you think? If agree, care to update your patch please? :)

Best,
Juri

>  kernel/sched/cpudeadline.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 95baa12a1029..6896bbe0e9ae 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -195,7 +195,8 @@ void cpudl_clear(struct cpudl *cp, int cpu)
>  		cp->elements[cpu].idx = IDX_INVALID;
>  		cpudl_heapify(cp, old_idx);
>  
> -		cpumask_set_cpu(cpu, cp->free_cpus);
> +		if (cpu_active(cpu))
> +			cpumask_set_cpu(cpu, cp->free_cpus);
>  	}
>  	raw_spin_unlock_irqrestore(&cp->lock, flags);
>  }
> -- 
> 2.34.1
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ