linux-kernel - Re: [PATCH] sched: Forward deadline for early tick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241217061317.92811-1-zhouzihan30@jd.com>
Date: Tue, 17 Dec 2024 14:13:19 +0800
From: zhouzihan30 <15645113830zzh@...il.com>
To: vincent.guittot@...aro.org
Cc: 15645113830zzh@...il.com,
	bsegall@...gle.com,
	dietmar.eggemann@....com,
	juri.lelli@...hat.com,
	linux-kernel@...r.kernel.org,
	mgorman@...e.de,
	mingo@...hat.com,
	peterz@...radead.org,
	rostedt@...dmis.org,
	vschneid@...hat.com,
	yaozhenguo@...com,
	zhouzihan30@...com
Subject: Re: [PATCH] sched: Forward deadline for early tick


Thank you Vincent Guittot for solving my confusion about tick error: why 
is it always less than 1ms on some machines.

It is normal for tick not to be equal to 1ms due to software or hardware,
but on some machines, tick is always less than 1ms, which is a bit strange.
I have not provided a good explanation for it, but now I know the reason.
The root cause is CONFIG_IRQ_TIME_ACCOUNTING.

I used bpftrace to monitor changes in the rq clock (task) in the system:

kprobe:update_rq_clock_task /pid == 6388/
{
    @rq = (struct rq *)arg0;
    $delta = (int64)arg1;
    @clock_pre = @rq->clock_task;
    printf("rq clock delta is %llu\n", $delta);
}

kretprobe:update_rq_clock_task /pid == 6388/
{
    $clock_post = @rq->clock_task;
    printf("rq clock task delta: %llu\n", $clock_post - @clock_pre);
}


result:
rq clock delta is 999994                                                          
rq clock task delta: 996616                    
rq clock delta is 1000026
rq clock task delta: 996550
rq clock delta is 1000047
rq clock task delta: 996716
rq clock delta is 999995 
rq clock task delta: 996454
rq clock delta is 1000058
rq clock task delta: 996621
rq clock delta is 999987 
rq clock task delta: 996457
rq clock delta is 1000047 
rq clock task delta: 996621
rq clock delta is 999966 
rq clock task delta: 996594
rq clock delta is 1000071
rq clock task delta: 996470
rq clock delta is 1000073
rq clock task delta: 996586
rq clock delta is 999958                       
rq clock task delta: 996446                    
rq clock delta is 1000018
rq clock task delta: 996574
rq clock delta is 999993 
rq clock task delta: 996908
rq clock delta is 1000037
rq clock task delta: 996547


As Vincent Guittot said:

< the delta of rq_clock_task is always
< less than 1ms on my system but the delta of rq_clock is sometimes
< above and sometime below 1ms

According to the kernel function: update_rq_clock_task, Both
 CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING often
 result in the delta of rq_clock_task being lower than 1ms. I counted
 13016 delta cases, and in the end, 47% of the delta of rq_clock was
 less than 1ms, but all of the delta of rq_clock_task is always less
 than 1ms

In order to conduct a comparative experiment, I turned off those CONFIG
 and re checked the changes in clock, It is found that the values of
 rq clock and rq clock task become completely consistent, However, 
according to the information from perf, there are still errors in tick
 (slice=3ms) :

      time    cpu  task name     wait time  sch delay   run time
                   [tid/pid]        (msec)     (msec)     (msec)
---------- ------  ------------  ---------  ---------  ---------
110.436513 [0001]  perf[1414]        0.000      0.000      0.000 
110.440490 [0001]  bash[1341]        0.000      0.000      3.977 
110.441490 [0001]  bash[1344]        0.000      0.000      0.999 
110.441548 [0001]  perf[1414]        4.976      0.000      0.058 
110.445491 [0001]  bash[1344]        0.058      0.000      3.942 
110.449490 [0001]  bash[1341]        5.000      0.000      3.999 
110.452490 [0001]  bash[1344]        3.999      0.000      2.999 
110.456491 [0001]  bash[1341]        2.999      0.000      4.000 
110.460489 [0001]  bash[1344]        4.000      0.000      3.998 
110.463490 [0001]  bash[1341]        3.998      0.000      3.001 
110.467493 [0001]  bash[1344]        3.001      0.000      4.002 
110.471490 [0001]  bash[1341]        4.002      0.000      3.996 
110.474489 [0001]  bash[1344]        3.996      0.000      2.999 
110.477490 [0001]  bash[1341]        2.999      0.000      3.000 


It seems that regardless of whether or not there is
 CONFIG_IRQ_TIME_ACCOUNTING, tick errors can cause random variations in
 runtime between 3 and 4ms.



< This means that the task didn't effectively get its slice because of
< time spent in IRQ context. Would it be better to set a default slice
< slightly lower than an integer number of tick

We once considered subtracting a little from a slice when setting it,
for example, if someone sets 3ms, we can subtract 0.1ms from it and
 make it 2.9ms. But this is not a good solution. If someone sets it to
 3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
 particularly good option, and it may lead to even greater system errors.

Changing the default value is a simple solution, in fact, we did it on 
the old kernel we used (we just set it 2.9ms. On our old kernel 6.6,
 tick error caused processes with the same weight have different run time,
the new kernel did not have this problem, but we still submitted this
 patch because we thought unexpected behavior might occur in other
 scenarios). However, apart from the kernel's default value, 
different OS seemes to have different behaviors, and the default value is
 often an integer number of tick... so we still hope to solve this
 problem in kernel.