linux-kernel - Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAD8CoPD7CC+i7TTV3Mvm3LJe=P2SMZUyDBnPLuMz7SPv23t6vw@mail.gmail.com>
Date: Wed, 7 Feb 2024 11:05:15 +0800
From: Ze Gao <zegao2021@...il.com>
To: Luis Machado <luis.machado@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ben Segall <bsegall@...gle.com>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Steven Rostedt <rostedt@...dmis.org>, Valentin Schneider <vschneid@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org, 
	Ze Gao <zegao@...cent.com>
Subject: Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base_slice
 as explicit time quanta

On Tue, Feb 6, 2024 at 9:09 PM Luis Machado <luis.machado@....com> wrote:
>
> Hi,
>
> On 1/11/24 11:57, Ze Gao wrote:
> > AFAIS, We've overlooked what role of the concept of time quanta plays
> > in EEVDF. According to Theorem 1 in [1], we have
> >
> >       -r_max < log_k(t) < max(r_max, q)
> >
> > cleary we don't want either r_max (the maximum user request) or q (time
> > quanta) to be too much big.
> >
> > To trade for throughput, in [2] it chooses to do tick preemtion at
> > per request boundary (i.e., once a cetain request is fulfilled), which
> > means we literally have no concept of time quanta defined anymore.
> > Obviously this is no problem if we make
> >
> >       q = r_i = sysctl_sched_base_slice
> >
> > just as exactly what we have for now, which actually creates a implict
> > quanta for us and works well.
> >
> > However, with custom slice being possible, the lag bound is subject
> > only to the distribution of users requested slices given the fact no
> > time quantum is available now and we would pay the cost of losing
> > many scheduling opportunities to maintain fairness and responsiveness
> > due to [2]. What's worse, we may suffer unexpected unfairness and
> > lantecy.
> >
> > For example, take two cpu bound processes with the same weight and bind
> > them to the same cpu, and let process A request for 100ms whereas B
> > request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> > nr_cpu=42).  And we can clearly see that playing with custom slice can
> > actually incur unfair cpu bandwidth allocation (10706 whose request
> > length is 0.1ms gets more cpu time as well as better latency compared to
> > 10705. Note you might see the other way around in different machines but
> > the allocation inaccuracy retains, and even top can show you the
> > noticeble difference in terms of cpu util by per second reporting), which
> > is obviously not what we want because that would mess up the nice system
> > and fairness would not hold.
> >
> >                       stress-ng-cpu:10705     stress-ng-cpu:10706
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4934.206                5025.048
> > Switches              58                      67
> > Average delay(ms)     87.074                  73.863
> > Maximum delay(ms)     101.998                 101.010
> >
> > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > in this patch gives us a better control of the allocation accuracy and
> > the avg latency:
> >
> >                       stress-ng-cpu:10584     stress-ng-cpu:10583
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4980.309                4981.356
> > Switches              1253                    1254
> > Average delay(ms)     3.990                   3.990
> > Maximum delay(ms)     5.001                   4.014
> >
> > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > less switches at the cost of worse delay:
> >
> >                       stress-ng-cpu:11208     stress-ng-cpu:11207
> > ---------------------------------------------------------------------
> > Slices(ms)            100                     0.1
> > Runtime(ms)           4983.722                4977.035
> > Switches              456                     456
> > Average delay(ms)     10.963                  10.939
> > Maximum delay(ms)     19.002                  21.001
>
> Thanks for the write-up, those are interesting results.
>
> While the fairness is restablished (important, no doubt), I'm wondering if the much larger number of switches is of any concern.

This patch should introduce no changes against the status quo, of
course if I understand and implement it correctly,  like I said in the
changelog when custom slices are not supported right now,

If we do the same experiments without setting custom slices,
(for 10 secs with HZ=1000 and sysctl_sched_base_slice=3ms)
the number of switches is likely to be almost 1253, due to which,
we can conclude that if no regressions are spot w/o this patch,
then there should be none w/ patch, if your concern was about
the throughput it possibly affects.

> I'm planning on giving this patch a try as well.

Cheers!
        -- Ze


> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.