[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27c4288d-5617-4195-8424-e6e346acefd0@linux.ibm.com>
Date: Mon, 9 Dec 2024 09:05:44 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, linux-s390@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org
Subject: Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type
group_parked
On 05/12/2024 15:48, Shrikanth Hegde wrote:
>
>
> On 12/4/24 16:51, Tobias Huschle wrote:
>> Adding a new scheduler group type which allows to remove all tasks
>> from certain CPUs through load balancing can help in scenarios where
>> such CPUs are currently unfavorable to use, for example in a
>> virtualized environment.
>>
>> Functionally, this works as intended. The question would be, if this
>> could be considered to be added and would be worth going forward
>> with. If so, which areas would need additional attention?
>> Some cases are referenced below.
>>
>> The underlying concept and the approach of adding a new scheduler
>> group type were presented in the Sched MC of the 2024 LPC.
>> A short summary:
>
> Thanks for working on this. Yes, we had two possible implementable version.
> 1. Using new group type. (this RFC)
> 2. Using group_misfit and use very low CPU capacity set using thermal
> framework.
> Those tricky issues were discussed at plumbers.
>
> I agree using new group type simplifies from implementation perspective.
> So for the idea of using this,
> Acked-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
>
>>
>> Some architectures (e.g. s390) provide virtualization on a firmware
>> level. This implies, that Linux kernels running on such architectures
>> run on virtualized CPUs.
>>
>> Like in other virtualized environments, the CPUs are most likely shared
>> with other guests on the hardware level. This implies, that Linux
>> kernels running in such an environment may encounter 'steal time'. In
>> other words, instead of being able to use all available time on a
>> physical CPU, some of said available time is 'stolen' by other guests.
>>
>> This can cause side effects if a guest is interrupted at an unfavorable
>> point in time or if the guest is waiting for one of its other virtual
>> CPUs to perform certain actions while those are suspended in favour of
>> another guest.
>>
>> Architectures, like arch/s390, address this issue by providing an
>> alternative classification for the CPUs seen by the Linux kernel.
>>
>> The following example is arch/s390 specific:
>> In the default mode (horizontal CPU polarization), all CPUs are treated
>> equally and can be subject to steal time equally.
>> In the alternate mode (vertical CPU polarization), the underlying
>> firmware hypervisor assigns the CPUs, visible to the guest, different
>> types, depending on how many CPUs the guest is entitled to use. Said
>> entitlement is configured by assigning weights to all active guests.
>> The three CPU types are:
>> - vertical high : On these CPUs, the guest has always highest
>> priority over other guests. This means
>> especially that if the guest executes tasks on
>> these CPUs, it will encounter no steal time.
>> - vertical medium : These CPUs are meant to cover fractions of
>> entitlement.
>> - vertical low : These CPUs will have no priority when being
>> scheduled. This implies especially, that while
>> all other guests are using their full
>> entitlement, these CPUs might not be ran for a
>> significant amount of time.
>>
>> As a consequence, using vertical lows while the underlying hypervisor
>> experiences a high load, driven by all defined guests, is to be avoided.
>>
>> In order to consequently move tasks off of vertical lows, introduce a
>> new type of scheduler groups: group_parked.
>> Parked implies, that processes should be evacuated as fast as possible
>> from these CPUs. This implies that other CPUs should start pulling tasks
>> immediately, while the parked CPUs should refuse to pull any tasks
>> themselves.
>> Adding a group type beyond group_overloaded achieves the expected
>> behavior. By making its selection architecture dependent, it has
>> no effect on architectures which will not make use of that group type.
>>
>> This approach works very well for many kinds of workloads. Tasks are
>> getting migrated back and forth in line with changing the parked
>> state of the involved CPUs.
>
> Likely there could more use-cases. It is basically supposed to be a
> lightweight
> mechanism to remove tasks out of CPUs instead of offline. Right?
Correct!
>
>>
>> There are a couple of issues and corner cases which need further
>> considerations:
>> - no_hz: While the scheduler tick can and should still be disabled
>> on idle CPUs, it should not be disabled on parked CPUs
>> which run only one task, as that task will not be
> task running on Parked CPUs itself is concern right? unless it is pinned.
Exactly, if you run stress-ng with -l 100 for each CPU that you have, it
can happen that a single 100% tasks runs alone on a parked CPU. It will
then never leave the CPU because it never sleeps and never receives a
tick. Therefore, this needs to be adressed.
>
>> scheduled away in time. Side effects and completeness
>> need to be further investigated. One option might be to
>> allow dynamic changes to tick_nohz_full_mask. It is also
>> possible to handle this in exclusively fair.c, but that
>> seems not to be the best environment to do so.
>> - pinned tasks: If a task is pinned to CPUs which are all parked, it will
>> get moved to other CPUs. Like during CPU hotplug, the
>> information about the tasks initial CPU mask gets lost.
>
> Could be a warning instead saying to user or fail?
A warning + just moving the task would be my intuitive approach. If
someone really relies on pinnning, it might be necessary to overrule the
decisions on which CPUs are parked. Could be a scheduler feature or some
other vehicle, not sure about the use cases which would needed to be
covered here.
>
>> - rt & dl: Realtime and deadline scheduling require some additional
>> attention.
>
> Ideal would be not run RT and DL there. But in these virtualized
> environment there is likely a number of CPUS
> such a number of Vertical High which is always available (in PowerPC we
> call these as entitled CPUs) and use those
> for RT or DL calculations of admission control?
>
That would be the ideal way, but I was struggeling to get those two
working correctly and wanted to propose the overall idea first before
going too deep into the rabbit hole.
>> - ext: Probably affected as well. Needs some conceptional
>> thoughts first.
>> - idle vs parked: It could be considered whether an idle parked CPU
>> would contribute to the count of idle CPUs. It is
>> usually preferable to utilize idle CPUs, but parked CPUs
>> should not be used. So a scheduler group with many idle,
>> but parked, CPUs, should not be the target for
>> additional
>> workload. At this point, some more thought needs to be
>> spent to evaluate if it would be ok to not set the idle
>> flag on parked CPUs.
>
> I think idle_cpus shouldn't include parked CPUs.
>
>> - optimization: It is probably possible to cut some corners. In order to
>> avoid tampering with scheduler statistics too much, the
>> actions based on the parkedness on the CPU are not
>> always
>> taken on the earliest possible occasion yet.
>> - raciness: Right now, there are no synchronization efforts. It needs
>> to be considered whether those might be necessary or if
>> it is alright that the parked-state of a CPU might
>> change
>> during load-balancing.
>
> Next load balancing will take care of this instead right? Similar to CPU
> capacity can
> change on its own even during load balancing. next load balancer takes
> care.
>
That would be my intuition too, it could happen that a task gets
misplaced in one iteration, but that should be fixed in the next one.
Only concern could be that it messes with some scheduler statistics.
>>
>> Patches apply to tip:sched/core
>>
>> The s390 patch serves as a simplified implementation example.
>>
>> Tobias Huschle (2):
>> sched/fair: introduce new scheduler group type group_parked
>> s390/topology: Add initial implementation for selection of parked CPUs
>>
>> arch/s390/include/asm/topology.h | 3 +
>> arch/s390/kernel/topology.c | 5 ++
>> include/linux/sched/topology.h | 20 +++++
>> kernel/sched/core.c | 10 ++-
>> kernel/sched/fair.c | 122 +++++++++++++++++++++++++------
>> 5 files changed, 135 insertions(+), 25 deletions(-)
>>
>
> tl;dr; debug patch and some testing results with mpstats logs.
>
>
> So I gave it a try with using a debugfs based hint to say which CPUs are
> parked.
> It is a hack to try it out. patch is below so one could try something
> similar is their archs
> and see if it help if they have a use case.
>
> Notes:
> 1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
> 2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80
> CPUs, but
> doesn't get packed when changing the from 80 to 40 CPUs.
With stress-ng -l 100 this can happen, I tested with stress-ng -l 50 and
that worked well in all cases. As mentioned above, the -l 100 case would
need changes to handle the no-hz scenario. I have a patch for that which
works, but it is a bit hacky.
If this also happens with non-100% stressors on your end, something
needs ot be fixed code-wise.
>
> ===================================debug patch
> ======================================
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/
> asm/topology.h
> index 16bacfe8c7a2..ae7571f86773 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -140,6 +140,9 @@ static inline int cpu_to_coregroup_id(int cpu)
> #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
> #define topology_core_id(cpu) (cpu_to_core_id(cpu))
>
> +#define arch_cpu_parked cpu_parked
> +int cpu_parked(int cpu);
> +
> #endif
> #endif
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5ac7084eebc0..6715ea78388c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -64,6 +64,7 @@
> #include <asm/systemcfg.h>
>
> #include <trace/events/ipi.h>
> +#include <linux/debugfs.h>
>
> #ifdef DEBUG
> #include <asm/udbg.h>
> @@ -77,6 +78,8 @@
> static DEFINE_PER_CPU(int, cpu_state) = { 0 };
> #endif
>
> +static int vp_manual_hint = NR_CPUS;
> +
> struct task_struct *secondary_current;
> bool has_big_cores __ro_after_init;
> bool coregroup_enabled __ro_after_init;
> @@ -1727,6 +1730,7 @@ static void __init build_sched_topology(void)
> BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
>
> set_sched_topology(powerpc_topology);
> + vp_manual_hint = num_present_cpus();
> }
>
> void __init smp_cpus_done(unsigned int max_cpus)
> @@ -1807,4 +1811,43 @@ void __noreturn arch_cpu_idle_dead(void)
> start_secondary_resume();
> }
>
> +int cpu_parked(int cpu) {
> + if (cpu >= vp_manual_hint)
> + return true;
> +
> + return false;
> +}
> +
> +static int pv_vp_manual_hint_set(void *data, u64 val)
> +{
> + if (val == 0 || vp_manual_hint > num_present_cpus())
> + vp_manual_hint = num_present_cpus();
> +
> + if (val != vp_manual_hint) {
> + vp_manual_hint = val;
> + }
> + return 0;
> +}
> +
> +static int pv_vp_manual_hint_get(void *data, u64 *val)
> +{
> + *val = vp_manual_hint;
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get,
> pv_vp_manual_hint_set, "%llu\n");
> +
> +
> +static __init int paravirt_debugfs_init(void)
> +{
> + if (is_shared_processor()) {
> + debugfs_create_file("vp_manual_hint", 0600,
> arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
> + }
> +
> + return 0;
> +}
> +
> +device_initcall(paravirt_debugfs_init)
>
> ========================================= test logs 80 CPUs system
> ================================================
>
> set the hint as 40 and run 80 stress-ng.
> Average: 37 82.89 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 17.11
> Average: 38 82.81 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 17.19
> Average: 39 82.98 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 17.02
> Average: 40 0.00 0.00 0.00 0.00 0.00 2.42
> 0.08 0.00 0.00 97.50
> Average: 41 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 100.00
> Average: 42 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 100.00
>
> Set the hint as 20 and run 80 stress-ng
> Average: 18 93.54 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 6.46
> Average: 19 93.54 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 6.46
> Average: 20 0.00 0.00 0.00 0.00 0.00 1.14
> 0.00 0.00 0.00 98.86
> Average: 21 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 100.00
>
>
> Set the hint as 40 initially and set to 80 midway.
> Average: 38 94.52 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 5.48
> Average: 39 94.53 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 5.47
> Average: 40 42.03 0.00 0.00 0.00 0.00 1.31
> 0.08 0.00 0.00 56.59
> Average: 41 43.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 57.00
>
> Set the hint as 80 initially and set to 40 midway -- *not working*
> Average: 38 95.27 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 4.73
> Average: 39 95.27 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 4.73
> Average: 40 95.24 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 4.76
> Average: 41 95.25 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 4.75
Powered by blists - more mailing lists