linux-kernel - [PATCH V1 0/4] sched: Predict load based on conditional probability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250221084437.31284-1-15645113830zzh@gmail.com>
Date: Fri, 21 Feb 2025 16:44:38 +0800
From: zihan zhou <15645113830zzh@...il.com>
To: mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com
Cc: linux-kernel@...r.kernel.org,
	zihan zhou <15645113830zzh@...il.com>
Subject: [PATCH V1 0/4] sched: Predict load based on conditional probability

The work of this patch is to predict the load after a period of time in
the future based on the conditional probability, so as to optimize the
load balancing, wake-up preemption, cpu frequency adjustment or other
tasks. PATCH 4/4 is a simple example of optimizing wake-up preemption,
which can improve the performance of the hackbench a little.

I am optimistic about the potential of this patch, but my ability and my
time are limited. I sincerely hope that more kind people can join in and
help me...

In the kernel, pelt works very well, but I think it is not perfect. For
example, there is a task 1 that sleeps for 10s, and then runs for 10s. In
this scenario, it is likely to make the wrong choice in load balancing.

Suppose we currently have two cpus (or said rqs). The load of cpu 0 is 0,
and the load of cpu 1 is 512 with a task 2:

cpu 0 (0)       cpu 1 (512)
                task 2 (512)

After task 1 wakes up, it will select cpu 0 because of the lower load,
while task 1 has a load of 0 due to long sleep, so the next time, the load
of cpus will still be like this:

cpu 0 (0)       cpu 1 (512)
task 1 (0)      task 2 (512)

At this time, there is a task 3 that wants select a cpu. Since the load of
cpu 0 is lower than cpu 1, it will choose cpu 0 to compete with a high
load task 1:

cpu 0 (128)       cpu 1 (512)
task 1 (0)      task 2 (512)
task 3 (128)

After a while, the load of two cpus is unbalanced, and the task can only be
migrated again:

cpu 0 (1152)       cpu 1 (512)
task 1 (1024)      task 2 (512)
task 3 (128)

This scenario is a fantasy of mine, there may be some wrong guesses. I
just want to illustrate my point through this example: if we can accurately
predict the future load of the task, then we can optimize many things on
the scheduler.

My idea is to predict the load when dequeue based on the load when enqueue.
This algorithm is based on conditional probability:
p(A|B) = p(AB) / p(B)

To do it, we need to count the load of tasks when they enter or leave the
rq, if we need detailed statistics, we need to use a two-dimensional array
to record:

  record_load_array[load_when_enqueue][load_when_dequeue] = count;

When enqueue, the most likely load2 when dequeue can be predicted through
the load1, where

  record_load_array[load1][load2] == max(record_load_array[load1])

Regardless of special circumstances, the maximum load is 1024, which
requires 1024*1024*8b = 1MB,

Considering that every se will maintain such a two-dimensional array, the
cost is unacceptable, and given the complexity of the actual scene, the
probability of accurately predicting the load with this array is probably
small.

So I made two optimizations:
The first optimization is to normalize the load to 0~1024 (the maximum
value of se->avg.load_avg is related to its weight), and then offset it to
the right LOAD_GRAN_SHIFT (4), so that it represents a range of 16.

For example, if the load is 893, then 893 >> 4 = 55, 55 << 4 = 880, so it
means the load from 880 to 896 (880 + (1<<4)).
For a normal process, se_weight=1024, the load change per 1ms is about
1024*1000 / 47742 = 21 < 16. Theoretically, the statistical error will not
exceed 1ms.

The second optimization is use Boyer–Moore majority vote algorithm, to
record the maximum possible dequeue load. For details, refer to the
implementation of record_predict_load_data().

Through these two optimizations, each record_load_array only needs 128B of
memory. With some other data structures included, a total of 176B of memory
is needed. My machine uses about 3MB memory:

[root@...t tip]# cat /proc/slabinfo | grep predict_load_data
predict_load_data  14652  18584    176   ...

According to the experiment, the accuracy of load prediction is about
50%~70%, which is not so good, but we can already try to do something with
it!

zihan zhou (4):
  kconfig: Add kconfig of predict load.
  sched: do predict load.
  sched/debug: add debug for predict load.
  sched/fair: add sched feature PREDICT_NO_PREEMPT.

---
 fs/proc/base.c              |  39 ++++++
 include/linux/sched.h       |  46 +++++++
 include/linux/sched/debug.h |   5 +
 include/linux/sched/task.h  |   2 +-
 init/Kconfig                |  10 ++
 init/init_task.c            |   3 +
 kernel/fork.c               |   6 +-
 kernel/sched/core.c         |  15 ++-
 kernel/sched/debug.c        |  39 ++++++
 kernel/sched/fair.c         | 240 +++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h     |   4 +
 kernel/sched/sched.h        |   2 +
 lib/Kconfig.debug           |  12 ++
 13 files changed, 414 insertions(+), 9 deletions(-)

-- 
2.33.0