[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <543d376c-85a7-4628-a38e-52bc117258a5@linux.ibm.com>
Date: Thu, 5 Dec 2024 20:18:28 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Tobias Huschle <huschle@...ux.ibm.com>, linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, linux-s390@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org
Subject: Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type
group_parked
On 12/4/24 16:51, Tobias Huschle wrote:
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
>
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
>
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:
Thanks for working on this. Yes, we had two possible implementable version.
1. Using new group type. (this RFC)
2. Using group_misfit and use very low CPU capacity set using thermal framework.
Those tricky issues were discussed at plumbers.
I agree using new group type simplifies from implementation perspective.
So for the idea of using this,
Acked-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
>
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
>
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
>
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
>
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
>
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
> - vertical high : On these CPUs, the guest has always highest
> priority over other guests. This means
> especially that if the guest executes tasks on
> these CPUs, it will encounter no steal time.
> - vertical medium : These CPUs are meant to cover fractions of
> entitlement.
> - vertical low : These CPUs will have no priority when being
> scheduled. This implies especially, that while
> all other guests are using their full
> entitlement, these CPUs might not be ran for a
> significant amount of time.
>
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
>
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
>
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.
Likely there could more use-cases. It is basically supposed to be a lightweight
mechanism to remove tasks out of CPUs instead of offline. Right?
>
> There are a couple of issues and corner cases which need further
> considerations:
> - no_hz: While the scheduler tick can and should still be disabled
> on idle CPUs, it should not be disabled on parked CPUs
> which run only one task, as that task will not be
task running on Parked CPUs itself is concern right? unless it is pinned.
> scheduled away in time. Side effects and completeness
> need to be further investigated. One option might be to
> allow dynamic changes to tick_nohz_full_mask. It is also
> possible to handle this in exclusively fair.c, but that
> seems not to be the best environment to do so.
> - pinned tasks: If a task is pinned to CPUs which are all parked, it will
> get moved to other CPUs. Like during CPU hotplug, the
> information about the tasks initial CPU mask gets lost.
Could be a warning instead saying to user or fail?
> - rt & dl: Realtime and deadline scheduling require some additional
> attention.
Ideal would be not run RT and DL there. But in these virtualized environment there is likely a number of CPUS
such a number of Vertical High which is always available (in PowerPC we call these as entitled CPUs) and use those
for RT or DL calculations of admission control?
> - ext: Probably affected as well. Needs some conceptional
> thoughts first.
> - idle vs parked: It could be considered whether an idle parked CPU
> would contribute to the count of idle CPUs. It is
> usually preferable to utilize idle CPUs, but parked CPUs
> should not be used. So a scheduler group with many idle,
> but parked, CPUs, should not be the target for additional
> workload. At this point, some more thought needs to be
> spent to evaluate if it would be ok to not set the idle
> flag on parked CPUs.
I think idle_cpus shouldn't include parked CPUs.
> - optimization: It is probably possible to cut some corners. In order to
> avoid tampering with scheduler statistics too much, the
> actions based on the parkedness on the CPU are not always
> taken on the earliest possible occasion yet.
> - raciness: Right now, there are no synchronization efforts. It needs
> to be considered whether those might be necessary or if
> it is alright that the parked-state of a CPU might change
> during load-balancing.
Next load balancing will take care of this instead right? Similar to CPU capacity can
change on its own even during load balancing. next load balancer takes care.
>
> Patches apply to tip:sched/core
>
> The s390 patch serves as a simplified implementation example.
>
> Tobias Huschle (2):
> sched/fair: introduce new scheduler group type group_parked
> s390/topology: Add initial implementation for selection of parked CPUs
>
> arch/s390/include/asm/topology.h | 3 +
> arch/s390/kernel/topology.c | 5 ++
> include/linux/sched/topology.h | 20 +++++
> kernel/sched/core.c | 10 ++-
> kernel/sched/fair.c | 122 +++++++++++++++++++++++++------
> 5 files changed, 135 insertions(+), 25 deletions(-)
>
tl;dr; debug patch and some testing results with mpstats logs.
So I gave it a try with using a debugfs based hint to say which CPUs are parked.
It is a hack to try it out. patch is below so one could try something similar is their archs
and see if it help if they have a use case.
Notes:
1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80 CPUs, but
doesn't get packed when changing the from 80 to 40 CPUs.
===================================debug patch ======================================
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 16bacfe8c7a2..ae7571f86773 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -140,6 +140,9 @@ static inline int cpu_to_coregroup_id(int cpu)
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
+#define arch_cpu_parked cpu_parked
+int cpu_parked(int cpu);
+
#endif
#endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5ac7084eebc0..6715ea78388c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -64,6 +64,7 @@
#include <asm/systemcfg.h>
#include <trace/events/ipi.h>
+#include <linux/debugfs.h>
#ifdef DEBUG
#include <asm/udbg.h>
@@ -77,6 +78,8 @@
static DEFINE_PER_CPU(int, cpu_state) = { 0 };
#endif
+static int vp_manual_hint = NR_CPUS;
+
struct task_struct *secondary_current;
bool has_big_cores __ro_after_init;
bool coregroup_enabled __ro_after_init;
@@ -1727,6 +1730,7 @@ static void __init build_sched_topology(void)
BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
set_sched_topology(powerpc_topology);
+ vp_manual_hint = num_present_cpus();
}
void __init smp_cpus_done(unsigned int max_cpus)
@@ -1807,4 +1811,43 @@ void __noreturn arch_cpu_idle_dead(void)
start_secondary_resume();
}
+int cpu_parked(int cpu) {
+ if (cpu >= vp_manual_hint)
+ return true;
+
+ return false;
+}
+
+static int pv_vp_manual_hint_set(void *data, u64 val)
+{
+ if (val == 0 || vp_manual_hint > num_present_cpus())
+ vp_manual_hint = num_present_cpus();
+
+ if (val != vp_manual_hint) {
+ vp_manual_hint = val;
+ }
+ return 0;
+}
+
+static int pv_vp_manual_hint_get(void *data, u64 *val)
+{
+ *val = vp_manual_hint;
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
+
+
+static __init int paravirt_debugfs_init(void)
+{
+ if (is_shared_processor()) {
+ debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
+ }
+
+ return 0;
+}
+
+device_initcall(paravirt_debugfs_init)
========================================= test logs 80 CPUs system ================================================
set the hint as 40 and run 80 stress-ng.
Average: 37 82.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 17.11
Average: 38 82.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 17.19
Average: 39 82.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 17.02
Average: 40 0.00 0.00 0.00 0.00 0.00 2.42 0.08 0.00 0.00 97.50
Average: 41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 42 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Set the hint as 20 and run 80 stress-ng
Average: 18 93.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.46
Average: 19 93.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.46
Average: 20 0.00 0.00 0.00 0.00 0.00 1.14 0.00 0.00 0.00 98.86
Average: 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Set the hint as 40 initially and set to 80 midway.
Average: 38 94.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.48
Average: 39 94.53 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.47
Average: 40 42.03 0.00 0.00 0.00 0.00 1.31 0.08 0.00 0.00 56.59
Average: 41 43.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 57.00
Set the hint as 80 initially and set to 40 midway -- *not working*
Average: 38 95.27 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.73
Average: 39 95.27 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.73
Average: 40 95.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.76
Average: 41 95.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.75
Powered by blists - more mailing lists