linux-kernel - Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFpoUr1FgZhuBmor2vCFqC9z7wao+XSybPxJZKFfK-wvZOagCA@mail.gmail.com>
Date:   Mon, 26 Apr 2021 18:33:19 +0200
From:   Odin Ugedal <odin@...dal.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Odin Ugedal <odin@...d.al>, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

Hi,

> Have you been able to reproduce this on mainline ?

Yes. I have been debugging and testing with v5.12-rc8. After I found
the suspected
commit in ~v4.8, I compiled both the v4.4.267 and v4.9.267, and was able to
successfully reproduce it on v4.9.267 and not on v4.4.267. It is also
reproducible
on 5.11.16-arch1-1 that my distro ships, and it is reproducible on all
the machines
I have tested.

> When running the script below on v5.12, I'm not able to reproduce your problem

v5.12 is pretty fresh, so I have not tested on anything before v5.12-rc8. I did
compile v5.12.0 now, and I am able to reproduce it there as well.

Which version did you try (the one for cgroup v1 or v2)? And/or did you try
to run the inspection bpftrace script? If you tested the cg v1
version, it will often
end up at 50/50, 51/49 etc., and sometimes 60/40+-, making it hard to
verify without inspection.

I have attached a version of the "sub cgroup" example for cgroup v1,
that also force
the process to start on cpu 1 (CPU_ME), and sends it over to cpu 0
(CPU) after attaching
to the new cgroup. That will make it evident each time. This example should also
always end up with 50/50 per stress process, but "always" ends up more
like 99/1.

Can you confirm if you are able to reproduce with this version?

--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
CPU=0
CPU_ME=1

function run_sandbox {
  local CG_CPUSET="$1"
  local CG_CPU="$2"
  local INNER_SHARES="$3"
  local CMD="$4"

  local PIPE="$(mktemp -u)"
  mkfifo "$PIPE"
  sh -c "read < $PIPE ; exec $CMD" &
  local TASK="$!"
  sleep .1
  mkdir -p "$CG_CPUSET"
  mkdir -p "$CG_CPU"/sub
  tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
  tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"

  tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

  tee "$PIPE" <<< sandox_done
  rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
mkdir -p "$CGROUP_CPUSET_ME"

tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"

tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2     "stress --cpu 1"

read # click enter to cleanup and stop all stress procs
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
--- bash end

Thanks
Odin