lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 27 Sep 2023 13:44:22 +0530
From:   Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>, mingo@...hat.com,
        peterz@...radead.org, vincent.guittot@...aro.org,
        vschneid@...hat.com
Cc:     linux-kernel@...r.kernel.org, ionela.voinescu@....com,
        quentin.perret@....com, srikar@...ux.vnet.ibm.com,
        mgorman@...hsingularity.net, mingo@...nel.org,
        pierre.gondois@....com, yu.c.chen@...el.com,
        tim.c.chen@...ux.intel.com, pauld@...hat.com
Subject: Re: [PATCH v4] sched/topology: change behaviour of sysctl
 sched_energy_aware based on the platform



On 9/27/23 2:59 AM, Dietmar Eggemann wrote:
> On 26/09/2023 12:00, Shrikanth Hegde wrote:
Thank you very much Dietmar, for taking a look at this and trying this patch.

>> sysctl sched_energy_aware is available for the admin to disable/enable
>> energy aware scheduling(EAS). EAS is enabled only if few conditions are
>> met by the platform. They are, asymmetric CPU capacity, no SMT,
>> valid cpufreq policy, frequency invariant load tracking etc. A platform may
> 
> s/valid cpufreq policy/Schedutil CPUfreq governor
> 
> + EM complexity < EM_MAX_COMPLEXITY

ok. will add. 
> 
>> boot without EAS capability, but could gain such capability at runtime
>> For example, changing/registering the cpufreq policy.
> 
> s/cpufreq policy/CPUfreq governor to Schedutil

ok will change.

> 
>> At present, though platform doesn't support EAS, this sysctl returns 1
>> and it ends up calling rebuild of sched domain on write to 1 and
> 
> sched domains are not rebuild in this case, i.e.
> partition_sched_domains_locked() does not call detach_destroy_domains()
> or build_sched_domains(). Only build_perf_domains() is called which
> bails out if !sysctl_sched_energy_aware or one of the EAS conditions is
> not true.
> 

ok. that's because it goes to match1 and match2 right?

>> NOP when writing to 0. That is confusing and un-necessary.
>>
>> Desired behavior would be to, have this sysctl to enable/disable the EAS
>> on supported platform. On Non supported platform write to the sysctl
>> would return not supported error and read of the sysctl would return
>> empty. So
>> sched_energy_aware returns empty - EAS is not possible at this moment
>> sched_energy_aware returns 0 - EAS is supported but disabled by admin.
>> sched_energy_aware returns 1 - EAS is supported and enabled.
>> User can find out the reason why EAS is not possible by checking
>> info messages.
> 
> This will include EAS capable platforms which have at least one EAS
> condition false during startup, e.g. using a Performance CPUfreq governor:
> 

ok. will include that in the changelog.

> root@...o:~# cat /proc/sys/kernel/sched_energy_aware
> 
> root@...o:~# echo 1 > /proc/sys/kernel/sched_energy_aware
> echo: write error: Operation not supported
> 
> log messages:
> ...
> [  160.902138] rd 0-5: Checking EAS, schedutil is mandatory
> ...
> [  324.467341] rd 0-5: Checking EAS, schedutil is mandatory
> ...
> 
> root@...o:~# echo schedutil >
> /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
> root@...o:~# echo schedutil >
> /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
> 
> log messages:
> ...
> [  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [  414.210877] sched_energy_set: starting EAS
> ...
> 
> root@...o:~# cat /proc/sys/kernel/sched_energy_aware
> 1
> 
> root@...o:~# echo 0 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [  414.195073] root_domain 0-5: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [  414.210877] sched_energy_set: starting EAS
> ...
> [  544.482690] rd 0-5: Disabling EAS
> [  544.493729] sched_energy_set: stopping EAS
> 

Thanks for testing.  Glad to see it works.

>> sched_is_eas_possible return if the platform can do EAS at this moment.
> 
> sched_is_eas_possible() returns
> 
>> It takes most of the cases into account except one where EM complexity is
>> too high as the code was bit tricky to separate that.
> 
> I remember that there was a patch from Pierre to get rid of the EM
> complexity check:
> 
> https://lkml.kernel.org/r/20221121094336.3250917-1-pierre.gondois@arm.com
> 
> If this makes this patch easier, maybe both patches can go in as a
> patch-set together?
> 
> Although I don't see that not checking EM complexity in
> sched_energy_aware_handler() -> sched_is_eas_possible() is an issue.
> 

ok. I can combine it with this patch and send it as patchset. 

>>
>> v3->v4:
>> valentin suggested it would be better to consider simpler approach that
>> was mentioned in v2. It is a standard approach to keep the knob visible
>> but change how read and write are handled. Did that and Refactored the
>> code to use a common function in build_perf_domains and in sysctl handler.
>> v2->v3:
>> Chen Yu and Pierre Gondois both pointed out that if platform becomes
>> capable of EAS later, this patch was not allowing that to happen.
>> Addressed that by using a variable to indicate the sysctl change
>> and re-worded the commit message with desired behaviour,
>> v1->v2:
>> Chen Yu had pointed out that this will not destroy the perf domains on
>> architectures where EAS is supported by changing the sysctl.
>> [v1] Link: https://lore.kernel.org/lkml/20230829065040.920629-1-sshegde@linux.vnet.ibm.com/
>> [v2] Link: https://lore.kernel.org/lkml/20230901065249.137242-1-sshegde@linux.vnet.ibm.com/
>> [v3] Link: https://lore.kernel.org/lkml/20230913114807.665094-1-sshegde@linux.vnet.ibm.com/
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>
> 
> [...]
> 
>> @@ -231,6 +289,14 @@ static int sched_energy_aware_handler(struct ctl_table *table, int write,
>>  		return -EPERM;
>>
>>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +	if (!sched_is_eas_possible(cpu_active_mask)) {
> 
> This is using `cpu_active_mask` so EAS can only be disabled system-wide.
> 
> So I experimented with an exclusive cpuset setup creating a symmetric
> (cs1) and an asymmetric (cs2) island on my Juno with its cpumask = [l B
> B l l l] (l - little CPU, B - Big CPU).
> 
> root@...o:~# cd /sys/fs/cgroup/cpuset
> root@...o:/sys/fs/cgroup/cpuset# mkdir cs1
> root@...o:/sys/fs/cgroup/cpuset# echo 1 > cs1/cpuset.cpu_exclusive
> root@...o:/sys/fs/cgroup/cpuset# echo 0 > cs1/cpuset.mems
> root@...o:/sys/fs/cgroup/cpuset# echo 4,5 > cs1/cpuset.cpus
> root@...o:/sys/fs/cgroup/cpuset# mkdir cs2
> root@...o:/sys/fs/cgroup/cpuset# echo 1 > cs2/cpuset.cpu_exclusive
> root@...o:/sys/fs/cgroup/cpuset# echo 0 > cs2/cpuset.mems
> root@...o:/sys/fs/cgroup/cpuset# echo 0-3 > cs2/cpuset.cpus
> root@...o:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> 
> [ 3021.761278] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> 
> root@...o:~# echo 0 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [ 3143.538583] rd 4-5: Disabling EAS
> [ 3143.549569] rd 0-3: Disabling EAS
> [ 3143.560681] sched_energy_set: stopping EAS
> ...
> 
> root@...o:~# echo 1 > /proc/sys/kernel/sched_energy_aware
> 
> log messages:
> ...
> [ 3223.277521] root_domain 0-3: pd1:{ cpus=1-2 nr_pstate=5 } pd0:{
> cpus=0,3-5 nr_pstate=5 }
> [ 3223.293409] sched_energy_set: starting EAS
> 
> Seems still to work correctly.

I see that can be a issue. using first cpu here check to asymmetric cpu capacity. 
It would have worked here, since the first group had asymmetry. ( l B B l ). 
It wouldn't have worked if the first group had like ( l l ) and second group has ( l B B l )


Instead of cpu_active_mask, I can make use of ndoms_cur and doms_cur[i] logic to 
traverse through possible doms, and if none of the domains have asymmetric cpu capacity
return false.  Does that look right?

> 
> [...]
> 
>> @@ -458,6 +487,8 @@ static bool build_perf_domains(const struct cpumask *cpu_map)
>>  	return !!pd;
>>
>>  free:
>> +	if (sched_debug())
>> +		pr_warn("rd %*pbl: Disabling EAS", cpumask_pr_args(cpu_map));
> 
> Shouldn't this be used in condition `if (!sysctl_sched_energy_aware)`?
> Otherwise we have 2 warnings when the other conditions which leads to
> `goto free` aren't met.
Since sched_energy_set has the info messages about start and stop of EAS, maybe 
this debug is not needed. Will remove it. Doing it only `if (!sysctl_sched_energy_aware)`
also doesn't seem correct, as calling free in this function would free the perf_domains.  

> 
> [...]

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ