linux-kernel - Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27c4288d-5617-4195-8424-e6e346acefd0@linux.ibm.com>
Date: Mon, 9 Dec 2024 09:05:44 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, linux-kernel@...r.kernel.org
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        vschneid@...hat.com, linux-s390@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [RFC PATCH 0/2] sched/fair: introduce new scheduler group type
 group_parked



On 05/12/2024 15:48, Shrikanth Hegde wrote:
> 
> 
> On 12/4/24 16:51, Tobias Huschle wrote:
>> Adding a new scheduler group type which allows to remove all tasks
>> from certain CPUs through load balancing can help in scenarios where
>> such CPUs are currently unfavorable to use, for example in a
>> virtualized environment.
>>
>> Functionally, this works as intended. The question would be, if this
>> could be considered to be added and would be worth going forward
>> with. If so, which areas would need additional attention?
>> Some cases are referenced below.
>>
>> The underlying concept and the approach of adding a new scheduler
>> group type were presented in the Sched MC of the 2024 LPC.
>> A short summary:
> 
> Thanks for working on this. Yes, we had two possible implementable version.
> 1. Using new group type. (this RFC)
> 2. Using group_misfit and use very low CPU capacity set using thermal 
> framework.
> Those tricky issues were discussed at plumbers.
> 
> I agree using new group type simplifies from implementation perspective.
> So for the idea of using this,
> Acked-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
> 
>>
>> Some architectures (e.g. s390) provide virtualization on a firmware
>> level. This implies, that Linux kernels running on such architectures
>> run on virtualized CPUs.
>>
>> Like in other virtualized environments, the CPUs are most likely shared
>> with other guests on the hardware level. This implies, that Linux
>> kernels running in such an environment may encounter 'steal time'. In
>> other words, instead of being able to use all available time on a
>> physical CPU, some of said available time is 'stolen' by other guests.
>>
>> This can cause side effects if a guest is interrupted at an unfavorable
>> point in time or if the guest is waiting for one of its other virtual
>> CPUs to perform certain actions while those are suspended in favour of
>> another guest.
>>
>> Architectures, like arch/s390, address this issue by providing an
>> alternative classification for the CPUs seen by the Linux kernel.
>>
>> The following example is arch/s390 specific:
>> In the default mode (horizontal CPU polarization), all CPUs are treated
>> equally and can be subject to steal time equally.
>> In the alternate mode (vertical CPU polarization), the underlying
>> firmware hypervisor assigns the CPUs, visible to the guest, different
>> types, depending on how many CPUs the guest is entitled to use. Said
>> entitlement is configured by assigning weights to all active guests.
>> The three CPU types are:
>>      - vertical high   : On these CPUs, the guest has always highest
>>                          priority over other guests. This means
>>                          especially that if the guest executes tasks on
>>                          these CPUs, it will encounter no steal time.
>>      - vertical medium : These CPUs are meant to cover fractions of
>>                          entitlement.
>>      - vertical low    : These CPUs will have no priority when being
>>                          scheduled. This implies especially, that while
>>                          all other guests are using their full
>>                          entitlement, these CPUs might not be ran for a
>>                          significant amount of time.
>>
>> As a consequence, using vertical lows while the underlying hypervisor
>> experiences a high load, driven by all defined guests, is to be avoided.
>>
>> In order to consequently move tasks off of vertical lows, introduce a
>> new type of scheduler groups: group_parked.
>> Parked implies, that processes should be evacuated as fast as possible
>> from these CPUs. This implies that other CPUs should start pulling tasks
>> immediately, while the parked CPUs should refuse to pull any tasks
>> themselves.
>> Adding a group type beyond group_overloaded achieves the expected
>> behavior. By making its selection architecture dependent, it has
>> no effect on architectures which will not make use of that group type.
>>
>> This approach works very well for many kinds of workloads. Tasks are
>> getting migrated back and forth in line with changing the parked
>> state of the involved CPUs.
> 
> Likely there could more use-cases. It is basically supposed to be a 
> lightweight
> mechanism to remove tasks out of CPUs instead of offline. Right?

Correct!

> 
>>
>> There are a couple of issues and corner cases which need further
>> considerations:
>> - no_hz:        While the scheduler tick can and should still be disabled
>>                  on idle CPUs, it should not be disabled on parked CPUs
>>                  which run only one task, as that task will not be
> task running on Parked CPUs itself is concern right? unless it is pinned.

Exactly, if you run stress-ng with -l 100 for each CPU that you have, it 
can happen that a single 100% tasks runs alone on a parked CPU. It will 
then never leave the CPU because it never sleeps and never receives a 
tick. Therefore, this needs to be adressed.

> 
>>                  scheduled away in time. Side effects and completeness
>>                  need to be further investigated. One option might be to
>>                  allow dynamic changes to tick_nohz_full_mask. It is also
>>                  possible to handle this in exclusively fair.c, but that
>>                  seems not to be the best environment to do so.
>> - pinned tasks: If a task is pinned to CPUs which are all parked, it will
>>                  get moved to other CPUs. Like during CPU hotplug, the
>>                  information about the tasks initial CPU mask gets lost.
> 
> Could be a warning instead saying to user or fail?

A warning + just moving the task would be my intuitive approach. If 
someone really relies on pinnning, it might be necessary to overrule the 
decisions on which CPUs are parked. Could be a scheduler feature or some 
other vehicle, not sure about the use cases which would needed to be 
covered here.

> 
>> - rt & dl:      Realtime and deadline scheduling require some additional
>>                  attention.
> 
> Ideal would be not run RT and DL there. But in these virtualized 
> environment there is likely a number of CPUS
> such a number of Vertical High which is always available (in PowerPC we 
> call these as entitled CPUs) and use those
> for RT or DL calculations of admission control?
> 

That would be the ideal way, but I was struggeling to get those two 
working correctly and wanted to propose the overall idea first before 
going too deep into the rabbit hole.

>> - ext:          Probably affected as well. Needs some conceptional
>>                  thoughts first.
>> - idle vs parked: It could be considered whether an idle parked CPU
>>                  would contribute to the count of idle CPUs. It is
>>                  usually preferable to utilize idle CPUs, but parked CPUs
>>                  should not be used. So a scheduler group with many idle,
>>                  but parked, CPUs, should not be the target for 
>> additional
>>                  workload. At this point, some more thought needs to be
>>                  spent to evaluate if it would be ok to not set the idle
>>                  flag on parked CPUs.
> 
> I think idle_cpus shouldn't include parked CPUs.
> 
>> - optimization: It is probably possible to cut some corners. In order to
>>                  avoid tampering with scheduler statistics too much, the
>>                  actions based on the parkedness on the CPU are not 
>> always
>>                  taken on the earliest possible occasion yet.
>> - raciness:     Right now, there are no synchronization efforts. It needs
>>                  to be considered whether those might be necessary or if
>>                  it is alright that the parked-state of a CPU might 
>> change
>>                  during load-balancing.
> 
> Next load balancing will take care of this instead right? Similar to CPU 
> capacity can
> change on its own even during load balancing. next load balancer takes 
> care.
> 

That would be my intuition too, it could happen that a task gets 
misplaced in one iteration, but that should be fixed in the next one. 
Only concern could be that it messes with some scheduler statistics.

>>
>> Patches apply to tip:sched/core
>>
>> The s390 patch serves as a simplified implementation example.
>>
>> Tobias Huschle (2):
>>    sched/fair: introduce new scheduler group type group_parked
>>    s390/topology: Add initial implementation for selection of parked CPUs
>>
>>   arch/s390/include/asm/topology.h |   3 +
>>   arch/s390/kernel/topology.c      |   5 ++
>>   include/linux/sched/topology.h   |  20 +++++
>>   kernel/sched/core.c              |  10 ++-
>>   kernel/sched/fair.c              | 122 +++++++++++++++++++++++++------
>>   5 files changed, 135 insertions(+), 25 deletions(-)
>>
> 
> tl;dr; debug patch and some testing results with mpstats logs.
> 
> 
> So I gave it a try with using a debugfs based hint to say which CPUs are 
> parked.
> It is a hack to try it out. patch is below so one could try something 
> similar is their archs
> and see if it help if they have a use case.
> 
> Notes:
> 1. Arch shouldn't set cpu_parked for all CPUs at boot. It causes panic.
> 2. Workload gets unpacked to all CPUs when changing from 40 CPUs to 80 
> CPUs, but
>     doesn't get packed when changing the from 80 to 40 CPUs.

With stress-ng -l 100 this can happen, I tested with stress-ng -l 50 and 
that worked well in all cases. As mentioned above, the -l 100 case would 
need changes to handle the no-hz scenario. I have a patch for that which 
works, but it is a bit hacky.
If this also happens with non-100% stressors on your end, something 
needs ot be fixed code-wise.

> 
> ===================================debug patch 
> ======================================
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/ 
> asm/topology.h
> index 16bacfe8c7a2..ae7571f86773 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -140,6 +140,9 @@ static inline int cpu_to_coregroup_id(int cpu)
>   #define topology_core_cpumask(cpu)     (per_cpu(cpu_core_map, cpu))
>   #define topology_core_id(cpu)          (cpu_to_core_id(cpu))
> 
> +#define arch_cpu_parked cpu_parked
> +int cpu_parked(int cpu);
> +
>   #endif
>   #endif
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5ac7084eebc0..6715ea78388c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -64,6 +64,7 @@
>   #include <asm/systemcfg.h>
> 
>   #include <trace/events/ipi.h>
> +#include <linux/debugfs.h>
> 
>   #ifdef DEBUG
>   #include <asm/udbg.h>
> @@ -77,6 +78,8 @@
>   static DEFINE_PER_CPU(int, cpu_state) = { 0 };
>   #endif
> 
> +static int vp_manual_hint = NR_CPUS;
> +
>   struct task_struct *secondary_current;
>   bool has_big_cores __ro_after_init;
>   bool coregroup_enabled __ro_after_init;
> @@ -1727,6 +1730,7 @@ static void __init build_sched_topology(void)
>          BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
> 
>          set_sched_topology(powerpc_topology);
> +       vp_manual_hint = num_present_cpus();
>   }
> 
>   void __init smp_cpus_done(unsigned int max_cpus)
> @@ -1807,4 +1811,43 @@ void __noreturn arch_cpu_idle_dead(void)
>          start_secondary_resume();
>   }
> 
> +int cpu_parked(int cpu) {
> +       if (cpu  >= vp_manual_hint)
> +               return true;
> +
> +       return false;
> +}
> +
> +static int pv_vp_manual_hint_set(void *data, u64 val)
> +{
> +       if (val == 0 || vp_manual_hint > num_present_cpus())
> +               vp_manual_hint = num_present_cpus();
> +
> +       if (val != vp_manual_hint) {
> +               vp_manual_hint = val;
> +       }
> +       return 0;
> +}
> +
> +static int pv_vp_manual_hint_get(void *data, u64 *val)
> +{
> +       *val = vp_manual_hint;
> +       return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, 
> pv_vp_manual_hint_set, "%llu\n");
> +
> +
> +static __init int paravirt_debugfs_init(void)
> +{
> +       if (is_shared_processor()) {
> +               debugfs_create_file("vp_manual_hint", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
> +       }
> +
> +       return 0;
> +}
> +
> +device_initcall(paravirt_debugfs_init)
> 
> ========================================= test logs 80 CPUs system 
> ================================================
> 
> set the hint as 40 and run 80 stress-ng.
> Average:      37   82.89    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.11
> Average:      38   82.81    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.19
> Average:      39   82.98    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   17.02
> Average:      40    0.00    0.00    0.00    0.00    0.00    2.42    
> 0.08    0.00    0.00   97.50
> Average:      41    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> Average:      42    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> 
> Set the hint as 20 and run 80 stress-ng
> Average:      18   93.54    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    6.46
> Average:      19   93.54    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    6.46
> Average:      20    0.00    0.00    0.00    0.00    0.00    1.14    
> 0.00    0.00    0.00   98.86
> Average:      21    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00  100.00
> 
> 
> Set the hint as 40 initially and set to 80 midway.
> Average:      38   94.52    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    5.48
> Average:      39   94.53    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    5.47
> Average:      40   42.03    0.00    0.00    0.00    0.00    1.31    
> 0.08    0.00    0.00   56.59
> Average:      41   43.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00   57.00
> 
> Set the hint as 80 initially and set to 40 midway -- *not working*
> Average:      38   95.27    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.73
> Average:      39   95.27    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.73
> Average:      40   95.24    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.76
> Average:      41   95.25    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00    0.00    4.75