[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBjc+x0_8uGEp=f8nJxP0t2Z86kpjSEBY8HdDfa44ZF3g@mail.gmail.com>
Date: Fri, 8 Nov 2024 10:27:00 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Pierre Gondois <pierre.gondois@....com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, lukasz.luba@....com,
rafael.j.wysocki@...el.com, linux-kernel@...r.kernel.org, qyousef@...alina.io,
hongyan.xia2@....com
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases
Hi Pierre,
On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@....com> wrote:
>
> Hello Vincent,
> Related to feec(), but not to this patchset, I think there might be a
> concurrency issue while running feec().
yes, this is a know limitation
>
> Feec() doesn't have any locking mechanism. This means that multiple CPUs
> might run the function at the same time.
this is done on purpose as we don't want to lock and slow down the wakeup path
> If:
> - 2 tasks with approximately the same utilization wake up at the same time
> - some space on an energy efficient CPU is available
> feec() will likely select the same target for the 2 tasks.
yes
>
> Once feec() determined a target for a task, util signals are updated in
> enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
> is ~20us (on a Pixel6). This is not much, but this still allows some other
20us is quite long. this is the worst case on little core lowest freq ?
> CPUs to run feec() util signals that will be wrong in no time.
>
> Note that it is also possible for one CPU to run feec() for 2 different tasks,
> decide to migrate the 2 tasks to another target CPU, and then start enqueueing
> the tasks. Meaning one single CPU will run feec() using util signals it knows
> are wrong.
isn't this case serialized because cpu selection for next task will
happen after enqueuing the 1st one
>
> The issue is problematic as it creates some instability. Once a
> 'parallel selection' is done, the following scenarios can happen:
> - the system goes overutilized, and EAS is disabled
> - a frequency spike happen to handle the unexpected load.
> Then the perf. domain becomes less energy efficient compared to other
> perf. domains, and tasks are migrated out of this perf. domain
>
> I made the following prototype to avoid 'parallel selections'. The goal here
> is to tag CPUs that are under pending migration.
> A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
> CPUs should therefore not consider this CPU as valid candidate.
>
> The implementation is a bit raw, but it gives some good results. Using rt-app
> workloads, and trying not to have tasks waking up at the same timing during
> the whole test:
> Workload1:
> N tasks with a period of 16ms and a util of 4/8. Each task starts with a
> 4ms delay. Each workload lasts 20s and is run over 5 iterations.
>
> Workload2:
> N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
> has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
> 20s and is run over 5 iterations.
>
> Are presented:
> - the measured energy consumed, according to the Pixel6 energy meters
> - the estimated energy consumed, lisa uses the util signals along with
> the CPU frequencies and the Energy Model to do an estimation.
> - the amount of time spent in the overutilized state, in percentage.
>
> ------
>
> Workload1:
>
> Measured energy:
> +------+-------+--------------+--------------+------------+
> | util | count | without | with | ratio |
> +------+-------+--------------+--------------+------------+
> | 4 | 8 | 3220.970324 | 3312.097508 | 2.829184 |
> | 4 | 12 | 5942.486726 | 5016.106047 | -15.589108 |
> | 4 | 16 | 10412.26692 | 10017.633658 | -3.79008 |
> | 8 | 8 | 7524.271751 | 7479.451427 | -0.595677 |
> | 8 | 12 | 14782.214144 | 14567.282266 | -1.45399 |
> | 8 | 16 | 21452.863497 | 19561.143385 | -8.818031 |
> +------+-------+--------------+--------------+------------+
> Std:
> +------+-------+-------------+-------------+
> | util | count | without | with |
> +------+-------+-------------+-------------+
> | 4 | 8 | 165.563394 | 48.903514 |
> | 4 | 12 | 518.609612 | 81.170952 |
> | 4 | 16 | 329.729882 | 192.739245 |
> | 8 | 8 | 105.144497 | 336.796522 |
> | 8 | 12 | 384.615323 | 339.86986 |
> | 8 | 16 | 1252.735561 | 2563.268952 |
> +------+-------+-------------+-------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without | with | ratio |
> +------+-------+-----------+-----------+------------+
> | 4 | 8 | 1.4372e10 | 1.2791e10 | -11.000273 |
> | 4 | 12 | 3.1881e10 | 2.3743e10 | -25.526193 |
> | 4 | 16 | 5.7663e10 | 5.4079e10 | -6.215679 |
> | 8 | 8 | 2.5622e10 | 2.5337e10 | -1.109823 |
> | 8 | 12 | 6.4332e10 | 6.9335e10 | 7.776814 | [1]
> | 8 | 16 | 9.5285e10 | 8.2331e10 | -13.594508 |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+-----------+
> | util | count | without | with |
> +------+-------+----------+-----------+
> | 4 | 8 | 1.3896e9 | 5.4265e8 |
> | 4 | 12 | 4.7511e9 | 5.1521e8 |
> | 4 | 16 | 3.5486e9 | 1.2625e9 |
> | 8 | 8 | 3.0033e8 | 2.3168e9 |
> | 8 | 12 | 8.7739e9 | 3.0743e9 |
> | 8 | 16 | 6.7982e9 | 2.2393e10 |
> +------+-------+----------+-----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+-----------+------------+
> | util | count | without | with | ratio |
> +------+-------+-----------+-----------+------------+
> | 4 | 8 | 0.187941 | 0.015834 | -91.575158 |
> | 4 | 12 | 0.543073 | 0.045483 | -91.624815 |
> | 4 | 16 | 8.510734 | 8.389077 | -1.429448 |
> | 8 | 8 | 1.056678 | 0.876095 | -17.089643 |
> | 8 | 12 | 36.457757 | 9.260862 | -74.598378 | [1]
> | 8 | 16 | 72.327933 | 78.693558 | 8.801061 |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+-----------+-----------+
> | util | count | without | with |
> +------+-------+-----------+-----------+
> | 4 | 8 | 0.232077 | 0.016531 |
> | 4 | 12 | 0.338637 | 0.040252 |
> | 4 | 16 | 0.729743 | 6.368214 |
> | 8 | 8 | 1.702964 | 1.722589 |
> | 8 | 12 | 34.436278 | 17.314564 |
> | 8 | 16 | 14.540217 | 33.77831 |
> +------+-------+-----------+-----------+
>
> ------
>
> Workload2:
>
> Measured energy:
> +------+-------+--------------+--------------+-----------+
> | util | count | without | with | ratio |
> +------+-------+--------------+--------------+-----------+
> | 4 | 8 | 3357.578785 | 3324.890715 | -0.973561 |
> | 4 | 12 | 5024.573746 | 4903.394533 | -2.411731 |
> | 4 | 16 | 10114.715431 | 9762.803821 | -3.479204 |
> | 8 | 8 | 7485.230678 | 6961.782086 | -6.993086 |
> | 8 | 12 | 13720.482516 | 13374.765825 | -2.519712 |
> | 8 | 16 | 24846.806317 | 24444.012805 | -1.621108 |
> +------+-------+--------------+--------------+-----------+
> Std:
> +------+-------+------------+------------+
> | util | count | without | with |
> +------+-------+------------+------------+
> | 4 | 8 | 87.450628 | 76.955783 |
> | 4 | 12 | 106.062839 | 116.882891 |
> | 4 | 16 | 182.525881 | 172.819307 |
> | 8 | 8 | 874.292359 | 162.790237 |
> | 8 | 12 | 151.830636 | 339.286741 |
> | 8 | 16 | 904.751446 | 154.419644 |
> +------+-------+------------+------------+
>
> Estimated energy:
> +------+-------+-----------+-----------+------------+
> | util | count | without | with | ratio |
> +------+-------+-----------+-----------+------------+
> | 4 | 8 | 1.4778e10 | 1.4805e10 | 0.184658 |
> | 4 | 12 | 2.6105e10 | 2.5485e10 | -2.374486 |
> | 4 | 16 | 5.8394e10 | 5.7177e10 | -2.083208 |
> | 8 | 8 | 3.0275e10 | 2.5973e10 | -14.211178 |
> | 8 | 12 | 7.0616e10 | 6.9085e10 | -2.168347 |
> | 8 | 16 | 1.3133e11 | 1.2891e11 | -1.839725 |
> +------+-------+-----------+-----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without | with |
> +------+-------+----------+----------+
> | 4 | 8 | 3.5449e8 | 8.2454e8 |
> | 4 | 12 | 9.4248e8 | 1.1364e9 |
> | 4 | 16 | 8.3240e8 | 1.2084e9 |
> | 8 | 8 | 9.0364e9 | 5.0381e8 |
> | 8 | 12 | 9.9112e8 | 3.0836e9 |
> | 8 | 16 | 4.9429e8 | 1.9533e9 |
> +------+-------+----------+----------+
>
> Overutilized ratio (in % of the 20s test):
> +------+-------+-----------+----------+------------+
> | util | count | without | with | ratio |
> +------+-------+-----------+----------+------------+
> | 4 | 8 | 0.154992 | 0.049429 | -68.108419 |
> | 4 | 12 | 0.132593 | 0.061762 | -53.420202 |
> | 4 | 16 | 6.798091 | 4.606102 | -32.244179 |
> | 8 | 8 | 1.360703 | 0.174626 | -87.166465 |
> | 8 | 12 | 0.519704 | 0.250469 | -51.805502 |
> | 8 | 16 | 12.114269 | 8.969281 | -25.961019 |
> +------+-------+-----------+----------+------------+
> Std:
> +------+-------+----------+----------+
> | util | count | without | with |
> +------+-------+----------+----------+
> | 4 | 8 | 0.212919 | 0.036856 |
> | 4 | 12 | 0.069696 | 0.060257 |
> | 4 | 16 | 0.63995 | 0.542028 |
> | 8 | 8 | 2.158079 | 0.211775 |
> | 8 | 12 | 0.089159 | 0.187436 |
> | 8 | 16 | 0.798565 | 1.669003 |
> +------+-------+----------+----------+
>
> ------
>
> Analysis:
>
> - [1]
> Without the patch, 2 tasks end up on one little CPU. This consumes
> less energy than using the medium/big CPU according to the energy model,
> but EAS should not be capable of doing such task placement as the little
> CPU becomes overutilized.
> Without the patch, the system is overutilized a lot more than with the patch.
>
> -
> Looking at the overutilized ratio, being overutilized 0.5% of the time or
> 0.05% of the time might seem close, but it means that EAS ended up
> doing a bad task placement multiple, independent times.
>
> -
> The overutilized ratio should be checked along the energy results as it
> shows how much EAS was involved in the task placement.
>
> -
> Overall, the energy consumed is less. The quantity of energy saved varies
> with the workload.
>
> ------
>
> On another note, I wanted to ask if there would be a v2 of this present
> patchset (sched/fair: Rework EAS to handle more cases),
yes, I have been side tracked by other stuff since LPC and haven't
been able to finalize test on v2 but it's ongoing
>
> Regards,
> Pierre
>
> ------
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bb343136ddd0..812d5bf88875 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1592,6 +1592,12 @@ struct task_struct {
> struct user_event_mm *user_event_mm;
> #endif
>
> + /*
> + * Keep track of the CPU feec() migrated this task to.
> + * There is a per-cpu 'eas_pending_enqueue' value to reset.
> + */
> + int eas_target_cpu;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c157d4860a3b..34911eb059cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
> se->sched_delayed = 0;
> }
>
> +DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
> +
> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> check_update_overutilized_status(rq);
>
> enqueue_throttle:
> + if (p->eas_target_cpu != -1) {
> + atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
> + p->eas_target_cpu = -1;
> + }
> +
> assert_list_leaf_cfs_rq(rq);
>
> hrtick_update(rq);
> @@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
>
> + /* Skip this CPU as its util signal will be invalid soon. */
> + if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
> + cpu != prev_cpu)
> + continue;
> +
> util = cpu_util(cpu, p, cpu, 0);
> cpu_cap = capacity_of(cpu);
>
> @@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
> target = best_energy_cpu;
>
> + /*
> + *'Lock' the target CPU if there is a migration. Prevent other feec()
> + * calls to use the same target CPU until util signals are not updated.
> + */
> + if (prev_cpu != target) {
> + if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
> + p->eas_target_cpu = target;
> + else
> + target = prev_cpu;
> + }
> +
> return target;
>
> unlock:
Powered by blists - more mailing lists