linux-kernel - Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <22c7f8cb-41a1-4004-9017-fb9a313f80ae@arm.com>
Date: Thu, 7 Nov 2024 11:14:40 +0100
From: Pierre Gondois <pierre.gondois@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, lukasz.luba@....com, rafael.j.wysocki@...el.com,
 linux-kernel@...r.kernel.org
Cc: qyousef@...alina.io, hongyan.xia2@....com
Subject: Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

Hello Vincent,
Related to feec(), but not to this patchset, I think there might be a
concurrency issue while running feec().

Feec() doesn't have any locking mechanism. This means that multiple CPUs
might run the function at the same time.
If:
- 2 tasks with approximately the same utilization wake up at the same time
- some space on an energy efficient CPU is available
feec() will likely select the same target for the 2 tasks.

Once feec() determined a target for a task, util signals are updated in
enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
is ~20us (on a Pixel6). This is not much, but this still allows some other
CPUs to run feec() util signals that will be wrong in no time.

Note that it is also possible for one CPU to run feec() for 2 different tasks,
decide to migrate the 2 tasks to another target CPU, and then start enqueueing
the tasks. Meaning one single CPU will run feec() using util signals it knows
are wrong.

The issue is problematic as it creates some instability. Once a
'parallel selection' is done, the following scenarios can happen:
- the system goes overutilized, and EAS is disabled
- a frequency spike happen to handle the unexpected load.
   Then the perf. domain becomes less energy efficient compared to other
   perf. domains, and tasks are migrated out of this perf. domain

I made the following prototype to avoid 'parallel selections'. The goal here
is to tag CPUs that are under pending migration.
A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
CPUs should therefore not consider this CPU as valid candidate.

The implementation is a bit raw, but it gives some good results. Using rt-app
workloads, and trying not to have tasks waking up at the same timing during
the whole test:
Workload1:
N tasks with a period of 16ms and a util of 4/8. Each task starts with a
4ms delay. Each workload lasts 20s and is run over 5 iterations.

Workload2:
N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
20s and is run over 5 iterations.

Are presented:
- the measured energy consumed, according to the Pixel6 energy meters
- the estimated energy consumed, lisa uses the util signals along with
   the CPU frequencies and the Energy Model to do an estimation.
- the amount of time spent in the overutilized state, in percentage.

------

Workload1:

Measured energy:
+------+-------+--------------+--------------+------------+
| util | count | without      | with         | ratio      |
+------+-------+--------------+--------------+------------+
| 4    | 8     | 3220.970324  | 3312.097508  | 2.829184   |
| 4    | 12    | 5942.486726  | 5016.106047  | -15.589108 |
| 4    | 16    | 10412.26692  | 10017.633658 | -3.79008   |
| 8    | 8     | 7524.271751  | 7479.451427  | -0.595677  |
| 8    | 12    | 14782.214144 | 14567.282266 | -1.45399   |
| 8    | 16    | 21452.863497 | 19561.143385 | -8.818031  |
+------+-------+--------------+--------------+------------+
Std:
+------+-------+-------------+-------------+
| util | count | without     | with        |
+------+-------+-------------+-------------+
| 4    | 8     | 165.563394  | 48.903514   |
| 4    | 12    | 518.609612  | 81.170952   |
| 4    | 16    | 329.729882  | 192.739245  |
| 8    | 8     | 105.144497  | 336.796522  |
| 8    | 12    | 384.615323  | 339.86986   |
| 8    | 16    | 1252.735561 | 2563.268952 |
+------+-------+-------------+-------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4372e10 | 1.2791e10 | -11.000273 |
| 4    | 12    | 3.1881e10 | 2.3743e10 | -25.526193 |
| 4    | 16    | 5.7663e10 | 5.4079e10 | -6.215679  |
| 8    | 8     | 2.5622e10 | 2.5337e10 | -1.109823  |
| 8    | 12    | 6.4332e10 | 6.9335e10 | 7.776814   | [1]
| 8    | 16    | 9.5285e10 | 8.2331e10 | -13.594508 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+-----------+
| util | count | without  | with      |
+------+-------+----------+-----------+
| 4    | 8     | 1.3896e9 | 5.4265e8  |
| 4    | 12    | 4.7511e9 | 5.1521e8  |
| 4    | 16    | 3.5486e9 | 1.2625e9  |
| 8    | 8     | 3.0033e8 | 2.3168e9  |
| 8    | 12    | 8.7739e9 | 3.0743e9  |
| 8    | 16    | 6.7982e9 | 2.2393e10 |
+------+-------+----------+-----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 0.187941  | 0.015834  | -91.575158 |
| 4    | 12    | 0.543073  | 0.045483  | -91.624815 |
| 4    | 16    | 8.510734  | 8.389077  | -1.429448  |
| 8    | 8     | 1.056678  | 0.876095  | -17.089643 |
| 8    | 12    | 36.457757 | 9.260862  | -74.598378 | [1]
| 8    | 16    | 72.327933 | 78.693558 | 8.801061   |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+-----------+-----------+
| util | count | without   | with      |
+------+-------+-----------+-----------+
| 4    | 8     | 0.232077  | 0.016531  |
| 4    | 12    | 0.338637  | 0.040252  |
| 4    | 16    | 0.729743  | 6.368214  |
| 8    | 8     | 1.702964  | 1.722589  |
| 8    | 12    | 34.436278 | 17.314564 |
| 8    | 16    | 14.540217 | 33.77831  |
+------+-------+-----------+-----------+

------

Workload2:

Measured energy:
+------+-------+--------------+--------------+-----------+
| util | count | without      | with         | ratio     |
+------+-------+--------------+--------------+-----------+
| 4    | 8     | 3357.578785  | 3324.890715  | -0.973561 |
| 4    | 12    | 5024.573746  | 4903.394533  | -2.411731 |
| 4    | 16    | 10114.715431 | 9762.803821  | -3.479204 |
| 8    | 8     | 7485.230678  | 6961.782086  | -6.993086 |
| 8    | 12    | 13720.482516 | 13374.765825 | -2.519712 |
| 8    | 16    | 24846.806317 | 24444.012805 | -1.621108 |
+------+-------+--------------+--------------+-----------+
Std:
+------+-------+------------+------------+
| util | count | without    | with       |
+------+-------+------------+------------+
| 4    | 8     | 87.450628  | 76.955783  |
| 4    | 12    | 106.062839 | 116.882891 |
| 4    | 16    | 182.525881 | 172.819307 |
| 8    | 8     | 874.292359 | 162.790237 |
| 8    | 12    | 151.830636 | 339.286741 |
| 8    | 16    | 904.751446 | 154.419644 |
+------+-------+------------+------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without   | with      | ratio      |
+------+-------+-----------+-----------+------------+
| 4    | 8     | 1.4778e10 | 1.4805e10 | 0.184658   |
| 4    | 12    | 2.6105e10 | 2.5485e10 | -2.374486  |
| 4    | 16    | 5.8394e10 | 5.7177e10 | -2.083208  |
| 8    | 8     | 3.0275e10 | 2.5973e10 | -14.211178 |
| 8    | 12    | 7.0616e10 | 6.9085e10 | -2.168347  |
| 8    | 16    | 1.3133e11 | 1.2891e11 | -1.839725  |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 3.5449e8 | 8.2454e8 |
| 4    | 12    | 9.4248e8 | 1.1364e9 |
| 4    | 16    | 8.3240e8 | 1.2084e9 |
| 8    | 8     | 9.0364e9 | 5.0381e8 |
| 8    | 12    | 9.9112e8 | 3.0836e9 |
| 8    | 16    | 4.9429e8 | 1.9533e9 |
+------+-------+----------+----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+----------+------------+
| util | count | without   | with     | ratio      |
+------+-------+-----------+----------+------------+
| 4    | 8     | 0.154992  | 0.049429 | -68.108419 |
| 4    | 12    | 0.132593  | 0.061762 | -53.420202 |
| 4    | 16    | 6.798091  | 4.606102 | -32.244179 |
| 8    | 8     | 1.360703  | 0.174626 | -87.166465 |
| 8    | 12    | 0.519704  | 0.250469 | -51.805502 |
| 8    | 16    | 12.114269 | 8.969281 | -25.961019 |
+------+-------+-----------+----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without  | with     |
+------+-------+----------+----------+
| 4    | 8     | 0.212919 | 0.036856 |
| 4    | 12    | 0.069696 | 0.060257 |
| 4    | 16    | 0.63995  | 0.542028 |
| 8    | 8     | 2.158079 | 0.211775 |
| 8    | 12    | 0.089159 | 0.187436 |
| 8    | 16    | 0.798565 | 1.669003 |
+------+-------+----------+----------+

------

Analysis:

- [1]
Without the patch, 2 tasks end up on one little CPU. This consumes
less energy than using the medium/big CPU according to the energy model,
but EAS should not be capable of doing such task placement as the little
CPU becomes overutilized.
Without the patch, the system is overutilized a lot more than with the patch.

-
Looking at the overutilized ratio, being overutilized 0.5% of the time or
0.05% of the time might seem close, but it means that EAS ended up
doing a bad task placement multiple, independent times.

-
The overutilized ratio should be checked along the energy results as it
shows how much EAS was involved in the task placement.

-
Overall, the energy consumed is less. The quantity of energy saved varies
with the workload.

------

On another note, I wanted to ask if there would be a v2 of this present
patchset (sched/fair: Rework EAS to handle more cases),

Regards,
Pierre

------


diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd0..812d5bf88875 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,6 +1592,12 @@ struct task_struct {
         struct user_event_mm            *user_event_mm;
  #endif
  
+       /*
+        * Keep track of the CPU feec() migrated this task to.
+        * There is a per-cpu 'eas_pending_enqueue' value to reset.
+        */
+       int eas_target_cpu;
+
         /*
          * New fields for task_struct should be added above here, so that
          * they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c157d4860a3b..34911eb059cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
         se->sched_delayed = 0;
  }
  
+DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
+
  /*
   * The enqueue_task method is called before nr_running is
   * increased. Here we update the fair scheduling stats and
@@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                 check_update_overutilized_status(rq);
  
  enqueue_throttle:
+       if (p->eas_target_cpu != -1) {
+               atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
+               p->eas_target_cpu = -1;
+       }
+
         assert_list_leaf_cfs_rq(rq);
  
         hrtick_update(rq);
@@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
                                 continue;
  
+                       /* Skip this CPU as its util signal will be invalid soon. */
+                       if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
+                           cpu != prev_cpu)
+                               continue;
+
                         util = cpu_util(cpu, p, cpu, 0);
                         cpu_cap = capacity_of(cpu);
  
@@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
             ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
                 target = best_energy_cpu;
  
+       /*
+        *'Lock' the target CPU if there is a migration. Prevent other feec()
+        * calls to use the same target CPU until util signals are not updated.
+        */
+       if (prev_cpu != target) {
+               if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
+                       p->eas_target_cpu = target;
+               else
+                       target = prev_cpu;
+       }
+
         return target;
  
  unlock: