linux-kernel - Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBHm+CjBTA614P9F2Vx3Bj7vv9Pt0CGFsiwqcrTFmKzjg@mail.gmail.com>
Date:   Mon, 26 Apr 2021 16:58:36 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Odin Ugedal <odin@...d.al>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        cgroups@...r.kernel.org,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

On Sun, 25 Apr 2021 at 10:11, Odin Ugedal <odin@...d.al> wrote:
>
> TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
> to cc more relevant people, and (as I am pretty new to this) feedback
> is highly appreciated!
>
> I thought it was smart to split the discussion of the patch and the issuse, so
> here is some (wall of text, sorry) discussion around it:
>
> This patch fixes an extremely strange edge case, discovered more or less by
> accident, where the fairness of the cfs scheduler get skewed quite
> extensively. This often leads to two cgroups who are supposed to get
> share cpu time 50/50, instead getting something like 90/10 or
> even 99/1 (!!) in some rare cases (see below for example). I find this
> edge case quite interesting, especially since I am able to so easily
> reproduce it on all machines and all (4.8+) kernels.
>
> The issue can be easily reproduced on all kernels 4.8+, and afaik. most
> "container runtimes" are affected. I do believe that I have wondered about
> this issue earlier, but not actually understood what was causing it. There
> is really (afaik.) no way to find the issue from userspace without using
> a tracing tool like bpftrace. I am _sure_ this affects real production
> environments _everywhere_, at least those running the normal container/using
> cgroup tools together with cpu pinning; but verifying that the
> fairness is skewed is quite hard, or more or less impossible in such cases,
> especially when running multiple workloads simultaneously. I did inspect some
> of my production servers, and found a set of this on multiple cgroups.
>
> It _is_ possible to read '/proc/sched_debug', but since that list only prints
> the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
> are missing; making it look like the problem isn't becuase of cfs.
>
> I think the current patch is the best way solve the issue, but I will happily
> discuss other possible solutions. One could also just "don't" propagate the
> load to the task_group, but I think the current implementation on that part
> is the correct way to do it. This is at least my best solution to avoid adding
> logic to a "hotter" path that also would require more logic.
>
> The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is
> throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since
> it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for
> (cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will
> this code however interfere with the throttling mechanism? I have tested inserting new procs
> to throttled cgroups (or children cgroups of throttled ones) with the patch applied,
> and it works as I expect it to do (but other people might have more insight).
>
> Also, another solution may be to don't add the load before the task is actually
> enqueued, avoiding this issue all together. But is that better (I assume that will
> be more in the "hot path", and require quite a bit more code)?
>
> There may definetly be a better solution to this that I still haven't found, and that
> other more experienced kernel devs see right away. :)
>
>
> Also feel free to suggest additions/wording of the patch
> message and title, and/or the comment in the code to make it more clear.
>
> This issue was introduced in 4.8, so all stable (4.9+) releases should probably
> get this (the final solution at least), or?
>

Have you been able to reproduce this on mainline ?

>
>
> Below is various ways/scripts to reproduce - sorry this is so long (and
> sorry for bash in general), but thought people might be interested in them:
>
> note: due to the nature of the issue, the "lost" load is different each time, so
> the values change each time, and sometimes/often end up at ~50/50; but my testing
> shows that it keep happening almost every time:
>
>
> Example on cgruoup v1. Often results in 60/40 load:
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       18026 63.0  0.0   3676   100 pts/7    R+   13:09   0:06 stress --cpu 1
> root       18036 36.6  0.0   3676   100 pts/7    R+   13:09   0:04 stress --cpu 1

When running the script below on v5.12, I'm not able to reproduce your problem

> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
>   local CG_CPUSET="$1"
>   local CG_CPU="$2"
>   local CMD="$3"
>
>   local PIPE="$(mktemp -u)"
>   mkfifo "$PIPE"
>   sh -c "read < $PIPE ; exec $CMD" &
>   local TASK="$!"
>   sleep .1
>   mkdir -p "$CG_CPUSET"
>   mkdir -p "$CG_CPU"
>   tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
>   tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
>   tee "$PIPE" <<< sandox_done
>   rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
> Example on cgroup v2 with sub cgroup (same as described in commit message),
> where both should get 50/50, but instead getting 99% and 1% (!!).
>
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
> root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1
>
> (in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use
> cpusets, otherwise systemd will fight you)
> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
>   local CG_CPUSET="$1"
>   local CG_CPU="$2"
>   local CMD="$3"
>
>   local PIPE="$(mktemp -u)"
>   mkfifo "$PIPE"
>   sh -c "read < $PIPE ; exec $CMD" &
>   local TASK="$!"
>   sleep .01
>   mkdir -p "$CG_CPUSET"
>   mkdir -p "$CG_CPU"
>   tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
>   tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
>   tee "$PIPE" <<< sandox_done
>   rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
>
>
> For those who only want to run docker stuff:
> (on cgroup v1, you must use systemd driver)
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root        9291 60.1  0.0   7320    96 pts/0    R+   13:18   0:07 /usr/bin/stress --verbose --cpu 1
> root        9388 45.5  0.0   7320    96 pts/0    R+   13:18   0:04 /usr/bin/stress --verbose --cpu 1
> --- bash start
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> --- bash end
>
>
> Here is a bpftrace script that can show what is happening (again. sorry for bash,
> but bpftrace only allow constants as array indexes as of now, so this is the best I can do).
> This script also needs bpftrace v0.13 (not released as of now, so currently you have
> to compile master) or newer.
>
>
> --- bash start
> PROBE='kfunc:sched_group_set_shares{
>   printf("cgroup: %s/%s\n",
>     str(args->tg->css.cgroup->kn->parent->name),
>     str(args->tg->css.cgroup->kn->name));
>   printf(
> "cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg\n"
> );
> }'
> for i in $(seq 0 $(($(nproc)-1))); do
>     PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{
>     printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n",
>     cpu_nr,
>     (args->tg->cfs_rq[cpu_nr])->load.weight,
>     (args->tg->cfs_rq[cpu_nr])->avg.load_avg,
>     (args->tg->cfs_rq[cpu_nr])->removed.load_avg,
>     (args->tg->cfs_rq[cpu_nr])->removed.nr,
>     (args->tg->cfs_rq[cpu_nr])->on_list,
>     (args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib,
>     args->tg->load_avg.counter
>     );
> }')"
> done
> PROBE="$PROBE"'kfunc:sched_group_set_shares{
>   printf("\n");
> }'
>
> bpftrace -e "$PROBE"
> --- bash end
>
>
> When running the bpftrace script when the sub cgroup example is running, and
> executing (just setting the weight of the cgroup the same value as before, no change):
>
> --- bash start
> tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1
> tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000
> tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100
> tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100
> --- bash end
>
> the output is:
>
> --- output start
> Attaching 6 probes...
> cgroup: cg-1/sub
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    1048576      1023          0                 0           1                1034                 1662
> 1    0            0             0                 0           0                0                    1662
> 2    0            0             0                 0           0                0                    1662
> 3    0            628           628               1           0                628                  1662
>
> cgroup: cg-2/sub
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    1048576      1023          0                 0           1                1023                 1830
> 1    0            0             0                 0           0                0                    1830
> 2    0            0             0                 0           0                0                    1830
> 3    0            807           807               1           0                807                  1830
>
> cgroup: slice/cg-1
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    6347         5             0                 0           1                5                    593
> 1    0            0             0                 0           0                0                    593
> 2    0            0             0                 0           0                0                    593
> 3    0            5             0                 0           0                588                  593
>
> cgroup: slice/cg-2
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    58642371     57263         0                 0           1                57909                58665
> 1    0            0             0                 0           0                0                    58665
> 2    0            0             0                 0           0                0                    58665
> 3    0            75615         0                 0           0                756                  58665
>
> --- output end
>
> We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3,
> and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would
> never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
> be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
> the patch will properly decay the load on them as well).
>
> With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.
>
> Odin Ugedal (1):
>   sched/fair: Fix unfairness caused by missing load decay
>
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> --
> 2.31.1
>