linux-kernel - Re: [RFC PATCH 0/7] Defer throttle when task exits to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com>
Date: Thu, 13 Mar 2025 01:31:16 -0700
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>, 
	K Prateek Nayak <kprateek.nayak@....com>, Peter Zijlstra <peterz@...radead.org>, 
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Mel Gorman <mgorman@...e.de>, Chengming Zhou <chengming.zhou@...ux.dev>, 
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 0/7] Defer throttle when task exits to user

It appears this mail's message-id is changed and becomes a separate
thread, I'll check what is going wrong, sorry about this.

On Thu, Mar 13, 2025 at 02:20:59AM -0500, Aaron Lu wrote:
> Tests:
> - A basic test to verify functionality like limit cgroup cpu time and
>   change task group, affinity etc.

Here is the basic test script:

pid=$$
CG_PATH1=/sys/fs/cgroup/1
CG_PATH2=/sys/fs/cgroup/2

[ -d $CG_PATH1 ] && sudo rmdir $CG_PATH1
[ -d $CG_PATH2 ] && sudo rmdir $CG_PATH2

sudo mkdir -p $CG_PATH1
sudo mkdir -p $CG_PATH2

sudo sh -c "echo $pid > $CG_PATH1/cgroup.procs"

echo "start nop"
~/src/misc/nop &
nop_pid=$!
cat /proc/$nop_pid/cgroup
pidstat -p $nop_pid 1 &
sleep 5

echo "limit $CG_PATH1 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "limit $CG_PATH1 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "move to $CG_PATH2"
sudo sh -c "echo $nop_pid > $CG_PATH2/cgroup.procs"
sleep 5

echo "limit $CG_PATH2 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "limit $CG_PATH2 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "set affinity to cpu3"
taskset -p 0x8 $nop_pid
sleep 5

echo "set affinity to cpu10"
taskset -p 0x400 $nop_pid
sleep 5

echo "unlimit $CG_PATH2"
sudo sh -c "echo max 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "move to $CG_PATH1"
sudo sh -c "echo $nop_pid > $CG_PATH1/cgroup.procs"
sleep 5

echo "change to rr with priority 10"
sudo chrt -r -p 10 $nop_pid
sleep 5

echo "change to fifo with priority 10"
sudo chrt -f -p 10 $nop_pid
sleep 5

echo "change back to fair"
sudo chrt -o -p 0 $nop_pid
sleep 5

echo "unlimit $CG_PATH1"
sudo sh -c "echo max 100000 > $CG_PATH1/cpu.max"
sleep 5

kill $nop_pid

note: nop is a cpu hog that does: while (1) spin();

> - A script that tried to mimic a large cgroup setup is used to see how
>   bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
>   in hrtime context.

Here are the test scripts:

CG_ROOT=/sys/fs/cgroup

nr_level1=2
nr_level2=100
nr_level3=10

for i in `seq $nr_level1`; do
	CG_LEVEL1=$CG_ROOT/$i
	echo "cg_level1: $CG_LEVEL1"
	[ -d $CG_LEVEL1 ] || sudo mkdir -p $CG_LEVEL1
	sudo sh -c "echo +cpu > $CG_LEVEL1/cgroup.subtree_control"

	for j in `seq $nr_level2`; do
		CG_LEVEL2=$CG_LEVEL1/${i}_$j
		echo "cg_level2: $CG_LEVEL2"
		[ -d $CG_LEVEL2 ] || sudo mkdir -p $CG_LEVEL2
		sudo sh -c "echo +cpu > $CG_LEVEL2/cgroup.subtree_control"

		for k in `seq $nr_level3`; do
			CG_LEVEL3=$CG_LEVEL2/${i}_${j}_$k
			[ -d $CG_LEVEL3 ] || sudo mkdir -p $CG_LEVEL3
			~/test/run_in_cg.sh $CG_LEVEL3
		done
	done
done

function set_quota()
{
	quota=$1

	for i in `seq $nr_level1`; do
		CG_LEVEL1=$CG_ROOT/$i
		sudo sh -c "echo $quota 100000 > $CG_LEVEL1/cpu.max"
		echo "$CG_LEVEL1: `cat $CG_LEVEL1/cpu.max`"
	done
}

while true; do
	echo "sleep 20"
	sleep 20

	echo "set 20cpu quota to first level cgroups"
	set_quota 2000000
	echo "sleep 20"
	sleep 20

	echo "set 10cpu quota to first level cgroups"
	set_quota 1000000
	echo "sleep 20"
	sleep 20

	echo "set 5cpu quota to first level cgroups"
	set_quota 500000
	echo "sleep 20"
	sleep 20

	echo "unlimit first level cgroups"
	set_quota max
done

run_in_cg.sh:

set -e

CG_PATH=$1
[ -z "$CG_PATH" ] && {
	echo "need cgroup path"
	exit
}

echo "CG_PATH: $CG_PATH"

sudo sh -c "echo $$ > $CG_PATH/cgroup.procs"

for i in `seq 10`; do
	~/src/misc/nop &
done

>   The test was done on a 2sockets/384threads AMD CPU with the following
>   cgroup setup: 2 first level cgroups with quota setting, each has 100
>   child cgroups and each child cgroup has 10 leaf child cgroups, with a
>   total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
>   tasks are created there. Below is the durations of
>   distribute_cfs_runtime() during a 1 minute window:

@durations:
[8K, 16K)            274 |@@@@@@@@@@@@@@@@@@@@@                               |
[16K, 32K)           132 |@@@@@@@@@@                                          |
[32K, 64K)             6 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           2 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)           117 |@@@@@@@@@                                           |
[1M, 2M)             665 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M)              10 |                                                    |

The bpftrace script used to capture this:

kfunc:distribute_cfs_runtime
{
	@start[args->cfs_b] = nsecs;
}

kretfunc:distribute_cfs_runtime
{
	if (@start[args->cfs_b]) {
		$duration = nsecs - @start[args->cfs_b];
		@durations = hist($duration);
		delete(@start[args->cfs_b]);
	}
}

interval:s:60
{
	exit();
}

>   So the biggest duration is in 2-4ms range in this hrtime context. How
>   bad is this number? I think it is acceptable but maybe the setup I
>   created is not complex enough?
>   In older kernels where async unthrottle is not available, the largest
>   time range can be about 100ms+.