[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk>
Date: Thu, 25 Sep 2025 15:33:54 +0200
From: Matteo Martelli <matteo.martelli@...ethink.co.uk>
To: Aaron Lu <ziqianlu@...edance.com>, K Prateek Nayak
<kprateek.nayak@....com>
Cc: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Chengming Zhou
<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
<xii@...gle.com>, linux-kernel@...r.kernel.org, Juri Lelli
<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>, Songtang Liu
<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>,
Michal Koutný <mkoutny@...e.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Matteo Martelli <matteo.martelli@...ethink.co.uk>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
Hi Aaron,
On Thu, 25 Sep 2025 20:05:04 +0800, Aaron Lu <ziqianlu@...edance.com> wrote:
> On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
> >
> > On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > > Hi Prateek,
> > >
> > > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> > >> Hello Aaron, Matteo,
> > >>
> > >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> > >>>> ...
> > >>>> The test setup is the same used in my previous testing for v3 [2], where
> > >>>> the CFS throttling events are mostly triggered by the first ssh logins
> > >>>> into the system as the systemd user slice is configured with CPUQuota of
> > >>>> 25%. Also note that the same systemd user slice is configured with CPU
> > >>>
> > >>> I tried to replicate this setup, below is my setup using a 4 cpu VM
> > >>> and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
> > >>> throttled cfs_rq"):
> > >>> # pwd
> > >>> /sys/fs/cgroup/user.slice
> > >>> # cat cpu.max
> > >>> 25000 100000
> > >>> # cat cpuset.cpus
> > >>> 0
> > >>>
> > >>> I then login using ssh as a normal user and I can see throttle happened
> > >>> but couldn't hit this warning. Do you have to do something special to
> > >>> trigger it?
It wasn't very reproducible in my setup either, but I found out that the
warning was being triggered more often when I tried to ssh into the
system just after boot, probably due to some additional load from
processes spawned during the boot phase. Therefore I prepared a
reproducer script that resemble my initial setup, plus a stress-ng
worker in the background while connecting with ssh to the system. I also
reduced the CPUQuota to 10% which seemed to increase the probability to
trigger the warning. With this script I can reproduce the warning about
once or twice every 10 ssh executions. See the script at the end of this
email.
> > >>>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> > >>>
> > >>> I stared at the code and haven't been able to figure out when
> > >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> > >>>
> > >>
> > >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> > >> couldn't trigger it :(
> > >>
> > >> But I'm wondering if all we are missing is:
> > >>
> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >> index f993de30e146..5f9e7b4df391 100644
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> > >>
> > >> cfs_rq->throttle_count = pcfs_rq->throttle_count;
> > >> cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> > >> + cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> > >> }
> > >>
> > >> /* conditionally throttle active cfs_rq's from put_prev_entity() */
> > >> ---
> > >>
> > >> This is the only way we can currently have a break in
> > >> cfs_rq_pelt_clock_throttled() hierarchy.
> > >>
> ...
>
> Hi Matteo,
>
> Can you test the above diff Prateek sent in his last email? Thanks.
>
I have just tested with the same script below the diff sent by Prateek
in [1] (also quoted above) that changes sync_throttle(), and I couldn't
reproduce the warning.
Here's the script (I hope it doesn't add too much noise to the email
thread).
---
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
#!/usr/bin/bash
script_path=$(realpath "$0")
wrk_dir=$(pwd)
kernel_version=fe8d238e646e
kernel_dir=${wrk_dir}/../linux
kernel_image=${kernel_dir}/arch/x86_64/boot/bzImage
qemu_image_url=https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
qemu_image_src=${wrk_dir}/$(basename "${qemu_image_url}")
qemu_image=${wrk_dir}/image.qcow2
guest_pkgs="stress-ng" # comma separated list of additional packages to install
run_qemu() {
qemu-system-x86_64 \
-m 2G -smp 4 \
-nographic \
-nic user,hostfwd=tcp::2222-:22 \
-M q35,accel=kvm \
-drive format=qcow2,file="${qemu_image}" \
-virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
-serial file:console.log \
-monitor none \
-append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
-kernel "${kernel_image}"
}
run_ssh() {
local cmd=$1
ssh root@...alhost -p 2222 \
-i ${wrk_dir}/id_ed25519 -F none \
-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
$cmd
}
setup_kernel() {
echo "Building kernel ..."
pushd $kernel_dir
git reset --hard $kernel_version && git clean -f -d
make mrproper
make defconfig
scripts/config -k -e EXPERT
scripts/config -k -e PREEMPT_RT
scripts/config -k -e CFS_BANDWIDTH
scripts/config -k -e FUNCTION_TRACER
make olddefconfig
make -j8
popd
}
setup_image() {
echo "Preparing qemu image ..."
killall -9 qemu-system-x86_64
if [ ! -f "${qemu_image_src}" ] ; then
wget ${qemu_image_url} -O ${qemu_image_src}
fi
cp ${qemu_image_src} ${qemu_image}
echo "[Unit]
Description=User and Session Slice
Before=slices.target
[Slice]
CPUQuota=10%
AllowedCPUs=0
" > user.slice
yes | ssh-keygen -t ed25519 -f id_ed25519 -N ''
# https://wiki.debian.org/ThomasChung/CloudImage
virt-customize -a ${qemu_image} \
--install ssh,${guest_pkgs} \
--append-line "/root/.ssh/authorized_keys:$(cat id_ed25519.pub)" \
--upload "user.slice:/etc/systemd/system/user.slice" \
--chown "0:0:/etc/systemd/system/user.slice"
}
setup_kernel
setup_image
echo "Run test..."
ts=$(date --utc "+%FT%TZ")
out_dir=${wrk_dir}/out/${ts}
out_log_dir=${out_dir}/logs
mkdir -p ${out_dir} ${out_log_dir}
cp ${script_path} ${out_dir}
cp ${kernel_dir}/.config ${out_dir}/kernel.config
trap "exit" INT TERM
trap "kill 0" EXIT
export -f run_ssh
export wrk_dir
while true; do
run_qemu &
qemu_pid=$!
sleep 10
run_ssh 'stress-ng --cpu 1 --timeout 60' &
for i in $(seq 1 10) ; do
echo "running ssh $i"
timeout 3 bash -c run_ssh #launch interactive ssh
done
run_ssh "systemctl poweroff"
wait $qemu_pid
serial_out_file=${out_log_dir}/serial-$(date "+%F-%T").log
mv ${wrk_dir}/console.log ${serial_out_file}
grep -e "Kernel panic" -e "Call Trace" -e "Kernel BUG" -e "cut here" ${serial_out_file} && \
echo "${serial_out_file}" >> ${out_dir}/panics-warns
grep -e "rcu.*detected.*stall" ${serial_out_file} && \
echo "${serial_out_file}" >> ${out_dir}/rcu-stalls
done
---
[1]: https://lore.kernel.org/all/db7fc090-5c12-450b-87a4-bcf06e10ef68@amd.com/
Best regards,
Matteo Martelli
Powered by blists - more mailing lists