lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZiAWTU5xb/JMn/Hs@chenyu5-mobl2>
Date: Thu, 18 Apr 2024 02:34:53 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Abel Wu <wuyun.abel@...edance.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli
	<juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Tiwei Bie
	<tiwei.btw@...group.com>, Honglei Wang <wanghonglei@...ichuxing.com>, "Aaron
 Lu" <aaron.lu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Yujie Liu
	<yujie.liu@...el.com>, <linux-kernel@...r.kernel.org>, kernel test robot
	<oliver.sang@...el.com>
Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf()
 if no eligible entity is found

On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > 
> > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > thus:
> > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > >      delta * 2677 = 7733399554989275921
> > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > negative value into a positive value, thus eligible check fails.
> > > > 
> > > > Indeed.
> > > 
> > > From the data presented it looks like min_vruntime is wrong and needs
> > > update. If you can readily reproduce this, dump the vruntime of all
> > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > >
> > 
> > This was the dump of all the entities on the tree, from left to right,
> 
> Oh, my bad, I thought it was the pick path.
> 
> > and also from top down in middle order traverse, when this issue happens:
> > 
> > [  514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > [  514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > 			min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > 
> > 
> > [  514.536772][ T8390] Traverse rb-tree from left to right
> > [  514.537138][ T8390]  se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible  <-- leftmost se
> > [  514.537835][ T8390]  se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > 
> > [  514.538539][ T8390] Traverse rb-tree from topdown
> > [  514.538877][ T8390]  middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible   <-- root se
> > [  514.539605][ T8390]  middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > 
> > The tree looks like:
> > 
> >           se (0xec1234e0)
> >                   |
> >                   |
> >                   ----> se (0xec4fcf20)
> > 
> > 
> > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > vruntime are both 763383370431, which is aligned with
> > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > updated correctly, because it is monotonic increasing.
> 
> Right.
> 
> > My guess is that, for some reason, one newly forked se in a newly
> > created task group, in the rb-tree has not been picked for a long
> > time(maybe not eligible). Its vruntime stopped at the negative
> > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > is long behind the cfs_rq->vruntime, thus the overflow happens.
> 
> I'll have to do the math again, but that's something in the order of not
> picking a task in about a day, that would be 'bad' :-)
> 
> Is there any sane way to reproduce this, and how often does it happen?

After adding some ftrace in place_entity() and pick_eevdf(), with the
help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
is very small seems to be related to task group's reweight_entity():

vlag = (s64)(avruntime - se->vruntime);                             
vlag = div_s64(vlag * old_weight, weight);
se->vruntime = avruntime - vlag;

The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
if the new weight is very small, which is very likely, then the vlag
could be very large, results in a very small vruntime.


The followings are the details why I think above could bring problems:

Here is the debug log printed by place_entity():


[  397.597268]cfs_rq:0xe75f7100
              cfs_rq.avg_vruntime:-1111846207333767
              cfs_rq.min_vruntime:810640668779
              avg_vruntime():686982466017
              curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
              leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
..

[  397.877251]cfs_rq:0xe75f7100
              cfs_rq.avg_vruntime:-759390883821798
              cfs_rq.min_vruntime:810640668779
              avg_vruntime(): 689577229374
              curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
              leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)


The leftmost se is a task group, its vruntime reduces from 332464705486 to
-59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
on the tree entities.

Back to reweight_entity():
vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;                      
se->vruntime = avruntime - vlag = -59751717438670;

the new se vruntime -59751717438670 is close to what we printed -59752941080010,
consider that the avg_vruntime() vary.

Then later this leftmost se has changed its load back and forth, and when the load is 2,
the vuntime has reached a dangerous threshold to trigger the s64 overflow in
eligible check:

[  398.011991]cfs_rq:0xe75f7100
              cfs_rq.avg_vruntime:-11875977385353427
              cfs_rq.min_vruntime:810640668779
              cfs_rq.avg_load:96985
              leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)

vruntime_eligible()
{

   key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
   key * avg_load overflow s64...
}

As a result the leftmost one can not be picked, and NULL is returned.

One workaround patch I'm thinking of, if this analysis is in the
right direction, maybe I can have a test later:

thanks,
Chenyu

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e0968fb9ba8..7ab26cdc3487 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	 *	   = V  - vl'
 	 */
 	if (avruntime != se->vruntime) {
+		s64 limit;
+
 		vlag = (s64)(avruntime - se->vruntime);
 		vlag = div_s64(vlag * old_weight, weight);
+		/* TBD: using old weight or new weight? */
+		limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+		vlag = clamp(lag, -limit, limit);
 		se->vruntime = avruntime - vlag;
 	}
 
-- 
2.25.1



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ