linux-kernel - Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b7b6e534-9402-458f-af1f-2a8af46ce190@arm.com>
Date: Fri, 8 Nov 2024 14:10:50 +0100
From: Pierre Gondois <pierre.gondois@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
 rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org,
 qyousef@...alina.io, hongyan.xia2@....com,
 Christian Loehle <Christian.Loehle@....com>
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases



On 11/8/24 10:27, Vincent Guittot wrote:
> Hi Pierre,
> 
> On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@....com> wrote:
>>
>> Hello Vincent,
>> Related to feec(), but not to this patchset, I think there might be a
>> concurrency issue while running feec().
> 
> yes, this is a know limitation
> 
>>
>> Feec() doesn't have any locking mechanism. This means that multiple CPUs
>> might run the function at the same time.
> 
> this is done on purpose as we don't want to lock and slow down the wakeup path

Yes right, this is understandable. However there could be a way to bail out of
feec() when such case is detected without actually waiting for a lock (cf. the
prototype).
We already bail out of feec() when the utilization of a CPU without a task is
higher than with the task in the energy computation.

> 
>> If:
>> - 2 tasks with approximately the same utilization wake up at the same time
>> - some space on an energy efficient CPU is available
>> feec() will likely select the same target for the 2 tasks.
> 
> yes
> 
>>
>> Once feec() determined a target for a task, util signals are updated in
>> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
>> is ~20us (on a Pixel6). This is not much, but this still allows some other
> 
> 20us is quite long. this is the worst case on little core lowest freq ?

I only kept the occurrences where feec() ends up with a target != prev_cpu.
In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
which might take more time.

In the other case, the delta is effectively lower (~10us).

> 
>> CPUs to run feec() util signals that will be wrong in no time.
>>
>> Note that it is also possible for one CPU to run feec() for 2 different tasks,
>> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
>> the tasks. Meaning one single CPU will run feec() using util signals it knows
>> are wrong.
> 
> isn't this case serialized because cpu selection for next task will
> happen after enqueuing the 1st one

I'm not sure I understand the question, but if the enqueue is done on the
target CPU, the running CPU might call feec() in the meantime.

> 
>>
>> The issue is problematic as it creates some instability. Once a
>> 'parallel selection' is done, the following scenarios can happen:
>> - the system goes overutilized, and EAS is disabled
>> - a frequency spike happen to handle the unexpected load.
>>     Then the perf. domain becomes less energy efficient compared to other
>>     perf. domains, and tasks are migrated out of this perf. domain
>>
>> I made the following prototype to avoid 'parallel selections'. The goal here
>> is to tag CPUs that are under pending migration.
>> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
>> CPUs should therefore not consider this CPU as valid candidate.
>>
>> The implementation is a bit raw, but it gives some good results. Using rt-app
>> workloads, and trying not to have tasks waking up at the same timing during
>> the whole test:
>> Workload1:
>> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
>> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>>
>> Workload2:
>> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
>> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
>> 20s and is run over 5 iterations.
>>
>> Are presented:
>> - the measured energy consumed, according to the Pixel6 energy meters
>> - the estimated energy consumed, lisa uses the util signals along with
>>     the CPU frequencies and the Energy Model to do an estimation.
>> - the amount of time spent in the overutilized state, in percentage.
>>
>> ------
>>
>> Workload1:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+------------+
>> | util | count | without      | with         | ratio      |
>> +------+-------+--------------+--------------+------------+
>> | 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
>> | 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
>> | 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
>> | 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
>> | 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
>> | 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
>> +------+-------+--------------+--------------+------------+
>> Std:
>> +------+-------+-------------+-------------+
>> | util | count | without     | with        |
>> +------+-------+-------------+-------------+
>> | 4    | 8     | 165.563394  | 48.903514   |
>> | 4    | 12    | 518.609612  | 81.170952   |
>> | 4    | 16    | 329.729882  | 192.739245  |
>> | 8    | 8     | 105.144497  | 336.796522  |
>> | 8    | 12    | 384.615323  | 339.86986   |
>> | 8    | 16    | 1252.735561 | 2563.268952 |
>> +------+-------+-------------+-------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
>> | 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
>> | 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
>> | 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
>> | 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
>> | 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+-----------+
>> | util | count | without  | with      |
>> +------+-------+----------+-----------+
>> | 4    | 8     | 1.3896e9 | 5.4265e8  |
>> | 4    | 12    | 4.7511e9 | 5.1521e8  |
>> | 4    | 16    | 3.5486e9 | 1.2625e9  |
>> | 8    | 8     | 3.0033e8 | 2.3168e9  |
>> | 8    | 12    | 8.7739e9 | 3.0743e9  |
>> | 8    | 16    | 6.7982e9 | 2.2393e10 |
>> +------+-------+----------+-----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
>> | 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
>> | 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
>> | 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
>> | 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
>> | 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+-----------+-----------+
>> | util | count | without   | with      |
>> +------+-------+-----------+-----------+
>> | 4    | 8     | 0.232077  | 0.016531  |
>> | 4    | 12    | 0.338637  | 0.040252  |
>> | 4    | 16    | 0.729743  | 6.368214  |
>> | 8    | 8     | 1.702964  | 1.722589  |
>> | 8    | 12    | 34.436278 | 17.314564 |
>> | 8    | 16    | 14.540217 | 33.77831  |
>> +------+-------+-----------+-----------+
>>
>> ------
>>
>> Workload2:
>>
>> Measured energy:
>> +------+-------+--------------+--------------+-----------+
>> | util | count | without      | with         | ratio     |
>> +------+-------+--------------+--------------+-----------+
>> | 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
>> | 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
>> | 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
>> | 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
>> | 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
>> | 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
>> +------+-------+--------------+--------------+-----------+
>> Std:
>> +------+-------+------------+------------+
>> | util | count | without    | with       |
>> +------+-------+------------+------------+
>> | 4    | 8     | 87.450628  | 76.955783  |
>> | 4    | 12    | 106.062839 | 116.882891 |
>> | 4    | 16    | 182.525881 | 172.819307 |
>> | 8    | 8     | 874.292359 | 162.790237 |
>> | 8    | 12    | 151.830636 | 339.286741 |
>> | 8    | 16    | 904.751446 | 154.419644 |
>> +------+-------+------------+------------+
>>
>> Estimated energy:
>> +------+-------+-----------+-----------+------------+
>> | util | count | without   | with      | ratio      |
>> +------+-------+-----------+-----------+------------+
>> | 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
>> | 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
>> | 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
>> | 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
>> | 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
>> | 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
>> +------+-------+-----------+-----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 3.5449e8 | 8.2454e8 |
>> | 4    | 12    | 9.4248e8 | 1.1364e9 |
>> | 4    | 16    | 8.3240e8 | 1.2084e9 |
>> | 8    | 8     | 9.0364e9 | 5.0381e8 |
>> | 8    | 12    | 9.9112e8 | 3.0836e9 |
>> | 8    | 16    | 4.9429e8 | 1.9533e9 |
>> +------+-------+----------+----------+
>>
>> Overutilized ratio (in % of the 20s test):
>> +------+-------+-----------+----------+------------+
>> | util | count | without   | with     | ratio      |
>> +------+-------+-----------+----------+------------+
>> | 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
>> | 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
>> | 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
>> | 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
>> | 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
>> | 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
>> +------+-------+-----------+----------+------------+
>> Std:
>> +------+-------+----------+----------+
>> | util | count | without  | with     |
>> +------+-------+----------+----------+
>> | 4    | 8     | 0.212919 | 0.036856 |
>> | 4    | 12    | 0.069696 | 0.060257 |
>> | 4    | 16    | 0.63995  | 0.542028 |
>> | 8    | 8     | 2.158079 | 0.211775 |
>> | 8    | 12    | 0.089159 | 0.187436 |
>> | 8    | 16    | 0.798565 | 1.669003 |
>> +------+-------+----------+----------+
>>
>> ------
>>
>> Analysis:
>>
>> - [1]
>> Without the patch, 2 tasks end up on one little CPU. This consumes
>> less energy than using the medium/big CPU according to the energy model,
>> but EAS should not be capable of doing such task placement as the little
>> CPU becomes overutilized.
>> Without the patch, the system is overutilized a lot more than with the patch.
>>
>> -
>> Looking at the overutilized ratio, being overutilized 0.5% of the time or
>> 0.05% of the time might seem close, but it means that EAS ended up
>> doing a bad task placement multiple, independent times.
>>
>> -
>> The overutilized ratio should be checked along the energy results as it
>> shows how much EAS was involved in the task placement.
>>
>> -
>> Overall, the energy consumed is less. The quantity of energy saved varies
>> with the workload.
>>
>> ------
>>
>> On another note, I wanted to ask if there would be a v2 of this present
>> patchset (sched/fair: Rework EAS to handle more cases),
> 
> yes, I have been side tracked by other stuff since LPC and haven't
> been able to finalize test on v2 but it's ongoing

Ok thanks!

> 
>>
>> Regards,
>> Pierre
>>
>> ------
>>
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index bb343136ddd0..812d5bf88875 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1592,6 +1592,12 @@ struct task_struct {
>>           struct user_event_mm            *user_event_mm;
>>    #endif
>>
>> +       /*
>> +        * Keep track of the CPU feec() migrated this task to.
>> +        * There is a per-cpu 'eas_pending_enqueue' value to reset.
>> +        */
>> +       int eas_target_cpu;
>> +
>>           /*
>>            * New fields for task_struct should be added above here, so that
>>            * they are included in the randomized portion of task_struct.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c157d4860a3b..34911eb059cf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
>>           se->sched_delayed = 0;
>>    }
>>
>> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
>> +
>>    /*
>>     * The enqueue_task method is called before nr_running is
>>     * increased. Here we update the fair scheduling stats and
>> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>                   check_update_overutilized_status(rq);
>>
>>    enqueue_throttle:
>> +       if (p->eas_target_cpu != -1) {
>> +               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
>> +               p->eas_target_cpu = -1;
>> +       }
>> +
>>           assert_list_leaf_cfs_rq(rq);
>>
>>           hrtick_update(rq);
>> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>                           if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>>                                   continue;
>>
>> +                       /* Skip this CPU as its util signal will be invalid soon. */
>> +                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
>> +                           cpu != prev_cpu)
>> +                               continue;
>> +
>>                           util = cpu_util(cpu, p, cpu, 0);
>>                           cpu_cap = capacity_of(cpu);
>>
>> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>               ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
>>                   target = best_energy_cpu;
>>
>> +       /*
>> +        *'Lock' the target CPU if there is a migration. Prevent other feec()
>> +        * calls to use the same target CPU until util signals are not updated.
>> +        */
>> +       if (prev_cpu != target) {
>> +               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
>> +                       p->eas_target_cpu = target;
>> +               else
>> +                       target = prev_cpu;
>> +       }
>> +
>>           return target;
>>
>>    unlock: