linux-kernel - Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD8CoPAJh9ggK8ODYFiUaF2WXPG4d5ERDUdpL532N5kc=-xuSw@mail.gmail.com>
Date: Wed, 24 Jan 2024 10:32:08 +0800
From: Ze Gao <zegao2021@...il.com>
To: Vishal Chourasia <vishalc@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ben Segall <bsegall@...gle.com>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Steven Rostedt <rostedt@...dmis.org>, Valentin Schneider <vschneid@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org, 
	Ze Gao <zegao@...cent.com>
Subject: Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base_slice
 as explicit time quanta

On Tue, Jan 23, 2024 at 8:42 PM Vishal Chourasia <vishalc@...ux.ibmcom> wrote:
>
> On Thu, Jan 11, 2024 at 06:57:46AM -0500, Ze Gao wrote:
> > AFAIS, We've overlooked what role of the concept of time quanta plays
> > in EEVDF. According to Theorem 1 in [1], we have
> >
> >       -r_max < log_k(t) < max(r_max, q)
> >
> > cleary we don't want either r_max (the maximum user request) or q (time
> > quanta) to be too much big.
> >
> > To trade for throughput, in [2] it chooses to do tick preemtion at
> > per request boundary (i.e., once a cetain request is fulfilled), which
> > means we literally have no concept of time quanta defined anymore.
> > Obviously this is no problem if we make
> >
> >       q = r_i = sysctl_sched_base_slice
> >
> > just as exactly what we have for now, which actually creates a implict
> > quanta for us and works well.
> >
> > However, with custom slice being possible, the lag bound is subject
> > only to the distribution of users requested slices given the fact no
> > time quantum is available now and we would pay the cost of losing
> > many scheduling opportunities to maintain fairness and responsiveness
> > due to [2]. What's worse, we may suffer unexpected unfairness and
> > lantecy.
> >
> > For example, take two cpu bound processes with the same weight and bind
> > them to the same cpu, and let process A request for 100ms whereas B
> > request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> > nr_cpu=42).  And we can clearly see that playing with custom slice can
> > actually incur unfair cpu bandwidth allocation (10706 whose request
> > length is 0.1ms gets more cpu time as well as better latency compared to
> > 10705. Note you might see the other way around in different machines but
> > the allocation inaccuracy retains, and even top can show you the
> > noticeble difference in terms of cpu util by per second reporting), which
> > is obviously not what we want because that would mess up the nice system
> > and fairness would not hold.
>
> Hi, How are you setting custom request values for process A and B?

I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
for testing w/o my patch.  You can check out [2] to see how it works.

And the userspace part looks like this to set/get slice per process:

#include <stdio.h>
#include <stdlib.h>
#include <sched.h>            /* Definition of SCHED_* constants */
#include <sys/syscall.h>      /* Definition of SYS_* constants */
#include <unistd.h>
#include <linux/sched/types.h>
/*
int syscall(SYS_sched_setattr, pid_t pid, struct sched_attr *attr,
                unsigned int flags);
int syscall(SYS_sched_getattr, pid_t pid, struct sched_attr *attr,
                unsigned int size, unsigned int flags);
*/

int main(int argc, char *argv[])
{
        int pid, slice = 0;
        int ecode = 0;;
        struct sched_attr attr = {0};
        if (argc < 2) {
                printf("please specify pid [slice]\n");
                ecode = -1;
                goto out;
        }
        pid = atoi(argv[1]);
        if (!pid || pid == 1) {
                printf("pid %d is not valid\n", pid);
                ecode = -1;
                goto out;
        }

        if (argc >= 3)
                slice = atoi(argv[2]);

        if (slice) {
                if (slice < 100 || slice > 100000) {
                        printf("slice %d[us] is not valid\n", slice);
                        ecode = -1;
                        goto out;
                }
                attr.sched_runtime = slice * 1000;
                ecode = syscall(SYS_sched_setattr, pid, &attr, 0);
                if (ecode) {
                        printf("change pid %d failed\n", pid);
                } else {
                        printf("change pid %d succeed\n", pid);
                }
        }

        ecode = syscall(SYS_sched_getattr, pid, &attr, sizeof(struct
sched_attr), 0);
        if (!ecode) {
                printf("pid: %d slice: %d\n", pid, attr.sched_runtime/1000);
        } else {
                printf("pid: %d getattr failed\n", pid);
        }
out:
        return ecode;
}

Note: here I use microseconds as my time units for convenience.

And the tests run like this:


#!/bin/bash

test() {

        echo -e "-----------------------------------------\n"
        pkill stress-ng

        sleep 1

        taskset -c 1 stress-ng -c 1  &
        ./set_slice $! 100
        taskset -c 1 stress-ng -c 1  &
        ./set_slice $! 100000

        perf sched record -- sleep 10
        perf sched latency -p -C 1
        echo -e "-----------------------------------------\n"

}

echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
test
sleep 2
echo SCHED_QUANTA > /sys/kernel/debug/sched/features
test


[1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
[2]: https://github.com/zegao96/linux/tree/sched-eevdf


Regards,
        -- Ze

> >
> >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4934.206                5025.048
> > Switches              58                      67
> > Average delay(ms)     87.074                  73.863
> > Maximum delay(ms)     101.998                 101.010
> >
> > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > in this patch gives us a better control of the allocation accuracy and
> > the avg latency:
> >
> >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4980.309                4981.356
> > Switches              1253                    1254
> > Average delay(ms)     3.990                   3.990
> > Maximum delay(ms)     5.001                   4.014
> >
> > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > less switches at the cost of worse delay:
> >
> >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4983.722                4977.035
> > Switches              456                     456
> > Average delay(ms)     10.963                  10.939
> > Maximum delay(ms)     19.002                  21.001
> >
> > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > the goal to strike a good balance between throughput and latency by
> > adjusting the frequency of context switches, and the conclusions are
> > much close to what's covered in [1] with the explicit definition of
> > a time quantum. And it aslo gives more freedom to choose the eligible
> > request length range(either through nice value or raw value)
> > without worrying about overscheduling or underscheduling too much.
> >
> > Note this change should introduce no obvious regression because all
> > processes have the same request length as sysctl_sched_base_slice as
> > in the status quo. And the result of benchmarks proves this as well.
> >
> > schbench -m2 -F128 -n10       -r90    w/patch tip/6.7-rc7
> > Wakeup  (usec): 99.0th:               3028    95
> > Request (usec): 99.0th:               14992   21984
> > RPS    (count): 50.0th:               5864    5848
> >
> > hackbench -s 512 -l 200 -f 25 -P      w/patch  tip/6.7-rc7
> > -g 10                                         0.212   0.223
> > -g 20                                 0.415   0.432
> > -g 30                                 0.625   0.639
> > -g 40                                 0.852   0.858
> >
> > [1]: https://dl.acm.org/doi/10.5555/890606
> > [2]: https://lore.kernel.org/all/20230420150537.GC4253@hirez.programming.kicks-ass.net/T/#u
> >
> > Signed-off-by: Ze Gao <zegao@...cent.com>
> > ---
>