linux-kernel - Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dlbtvvm5cewqzh5bcpl4cqhcwxmnnjb6pdle5jzywiiznlactd@cmhnpim42m3p>
Date:   Wed, 23 Aug 2023 20:52:26 -0400
From:   Daniel Jordan <daniel.m.jordan@...cle.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...nel.org, vincent.guittot@...aro.org,
        linux-kernel@...r.kernel.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, corbet@....net,
        qyousef@...alina.io, chris.hyser@...cle.com,
        patrick.bellasi@...bug.net, pjt@...gle.com, pavel@....cz,
        qperret@...gle.com, tim.c.chen@...ux.intel.com, joshdon@...gle.com,
        timj@....org, kprateek.nayak@....com, yu.c.chen@...el.com,
        youssefesmat@...omium.org, joel@...lfernandes.org, efault@....de,
        tglx@...utronix.de
Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Hi Peter,

On Wed, May 31, 2023 at 01:58:39PM +0200, Peter Zijlstra wrote:
> 
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> The only real change since last time is the fix for tick-preemption [2], and a 
> simple safe-guard for the mixed slice heuristic.

We're seeing regressions from EEVDF with SPEC CPU, a database workload,
and a Java workload.  We tried SPEC CPU on five systems, and here are
numbers from one of them (high core count, two-socket x86 machine).

    SPECrate2017 oversubscribed by 2x (two copies of the test per CPU)

    Base: v6.3-based kernel
    EEVDF: Base + patches from May 31 [0]

    Performance comparison: >0 if EEVDF wins

    Integer
     
     -0.5% 500.perlbench_r
     -6.6% 502.gcc_r
     -8.7% 505.mcf_r
     -9.2% 520.omnetpp_r
     -6.6% 523.xalancbmk_r
     -0.7% 525.x264_r
     -2.1% 531.deepsjeng_r
     -0.4% 541.leela_r
     -0.3% 548.exchange2_r
     -2.6% 557.xz_r
     
     -3.8% Est(*) SPECrate2017_int_base
     
    Floating Point
     
     -0.6% 503.bwaves_r
     -1.3% 507.cactuBSSN_r
     -0.8% 508.namd_r
    -17.8% 510.parest_r
      0.3% 511.povray_r
     -1.0% 519.lbm_r
     -7.7% 521.wrf_r
     -2.4% 526.blender_r
     -6.1% 527.cam4_r
     -2.0% 538.imagick_r
      0.1% 544.nab_r
     -0.7% 549.fotonik3d_r
    -11.3% 554.roms_r
     
     -4.1% Est(*) SPECrate2017_fp_base
     
    (*) SPEC CPU Fair Use rules require that tests with non-production
        components must be marked as estimates.

The other machines show similarly consistent regressions, and we've tried a
v6.5-rc4-based kernel with the latest EEVDF patches from tip/sched/core
including the recent fixlet "sched/eevdf: Curb wakeup-preemption".  I can post
the rest of the numbers, but I'm trying to keep this on the shorter side for
now.

Running the database workload on a two-socket x86 server, we see
regressions of up to 6% when the number of users exceeds the number of
CPUs.

With the Java workload on another two-socket x86 server, we see a 10%
regression.

We're investigating the other benchmarks, but here's what I've found so far
with SPEC CPU.  Some schedstats showed that eevdf is tick-preemption happy
(patches below).  These stats were taken over 1 minute near the middle of a ~26
minute benchmark (502.gcc_r).

    Base: v6.5-rc4-based kernel
    EEVDF: Base + the latest EEVDF patches from tip/sched/core

    schedstat                     Base            EEVDF

    sched                    1,243,911        3,947,251

    tick_check_preempts     12,899,049
    tick_preempts            1,022,998

    check_deadline                           15,878,463
    update_deadline                           3,895,530
    preempt_deadline                          3,751,580

In both kernels, tick preemption is primarily what drives schedule()s.
Preemptions happen over three times more often for EEVDF because in the base,
tick preemption happens after a task has run through its ideal timeslice as a
fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
slice determines when it gets tick-preempted, and that's 3ms by default.  It
seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
load like this.  When I set the base_slice_ns sysctl to 12000000, the
regression disappears.

I'm still thinking about how to fix it.  Pre-EEVDF, tick preemption was
more flexible in that a task's timeslice could change depending on how
much competition it had on the same CPU, but with EEVDF the timeslice is
fixed no matter what else is running, and growing or shrinking it
depending on nr_running doesn't honor whatever deadline was set for the
task.

The schedstat patch for the base:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..fb5a35aa07ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,6 +4996,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        struct sched_entity *se;
        s64 delta;

+       schedstat_inc(rq_of(cfs_rq)->tick_check_preempts);
+
        /*
         * When many tasks blow up the sched_period; it is possible that
         * sched_slice() reports unusually large results (when many tasks are
@@ -5005,6 +5007,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
@@ -5028,8 +5031,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        if (delta < 0)
                return;

-       if (delta > ideal_runtime)
+       if (delta > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
+       }
 }

 static void
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..1bf12e271756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,6 +1123,10 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* tick preempt stats */
+       unsigned int            tick_check_preempts;
+       unsigned int            tick_preempts;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..7997b8538b72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,13 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->tick_check_preempts, rq->tick_preempts);

                seq_printf(seq, "\n");


The schedstat patch for eevdf:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffec98724f3..675f4bbac471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,18 +975,21 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+       schedstat_inc(rq_of(cfs_rq)->check_deadline);
        if ((s64)(se->vruntime - se->deadline) < 0)
                return;

        /*
         * EEVDF: vd_i = ve_i + r_i / w_i
         */
+       schedstat_inc(rq_of(cfs_rq)->update_deadline);
        se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

        /*
         * The task has consumed its request, reschedule.
         */
        if (cfs_rq->nr_running > 1) {
+               schedstat_inc(rq_of(cfs_rq)->preempt_deadline);
                resched_curr(rq_of(cfs_rq));
                clear_buddies(cfs_rq, se);
        }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93c2dc80143f..c44b59556367 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,11 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* update_deadline() stats */
+       unsigned int            check_deadline;
+       unsigned int            update_deadline;
+       unsigned int            preempt_deadline;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..2a8bd742507d 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->check_deadline, rq->update_deadline,
+                   rq->preempt_deadline);

                seq_printf(seq, "\n");


[0] https://lore.kernel.org/all/20230531115839.089944915@infradead.org/