linux-kernel - Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZiEaKOQwiNEglYtS@chenyu5-mobl2>
Date: Thu, 18 Apr 2024 21:03:36 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Xuewen Yan <xuewen.yan94@...il.com>
CC: Peter Zijlstra <peterz@...radead.org>, Abel Wu <wuyun.abel@...edance.com>,
	Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, "Tiwei
 Bie" <tiwei.btw@...group.com>, Honglei Wang <wanghonglei@...ichuxing.com>,
	Aaron Lu <aaron.lu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Yujie Liu
	<yujie.liu@...el.com>, <linux-kernel@...r.kernel.org>, kernel test robot
	<oliver.sang@...el.com>, Tianchen Ding <dtcccc@...ux.alibaba.com>
Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf()
 if no eligible entity is found

Hi Xuewen,

On 2024-04-18 at 10:57:22 +0800, Xuewen Yan wrote:
> Hi Yu
> 
> On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <yu.c.chen@...el.com> wrote:
> >
> > On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > > >
> > > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > > thus:
> > > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > > >      delta * 2677 = 7733399554989275921
> > > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > > negative value into a positive value, thus eligible check fails.
> > > > > >
> > > > > > Indeed.
> > > > >
> > > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > > >
> > > >
> > > > This was the dump of all the entities on the tree, from left to right,
> > >
> > > Oh, my bad, I thought it was the pick path.
> > >
> > > > and also from top down in middle order traverse, when this issue happens:
> > > >
> > > > [  514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > > [  514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > > >                     min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > > >
> > > >
> > > > [  514.536772][ T8390] Traverse rb-tree from left to right
> > > > [  514.537138][ T8390]  se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible  <-- leftmost se
> > > > [  514.537835][ T8390]  se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > [  514.538539][ T8390] Traverse rb-tree from topdown
> > > > [  514.538877][ T8390]  middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible   <-- root se
> > > > [  514.539605][ T8390]  middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > > >
> > > > The tree looks like:
> > > >
> > > >           se (0xec1234e0)
> > > >                   |
> > > >                   |
> > > >                   ----> se (0xec4fcf20)
> > > >
> > > >
> > > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > > vruntime are both 763383370431, which is aligned with
> > > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > > updated correctly, because it is monotonic increasing.
> > >
> > > Right.
> > >
> > > > My guess is that, for some reason, one newly forked se in a newly
> > > > created task group, in the rb-tree has not been picked for a long
> > > > time(maybe not eligible). Its vruntime stopped at the negative
> > > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> > >
> > > I'll have to do the math again, but that's something in the order of not
> > > picking a task in about a day, that would be 'bad' :-)
> > >
> > > Is there any sane way to reproduce this, and how often does it happen?
> >
> > After adding some ftrace in place_entity() and pick_eevdf(), with the
> > help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> > is very small seems to be related to task group's reweight_entity():
> >
> > vlag = (s64)(avruntime - se->vruntime);
> > vlag = div_s64(vlag * old_weight, weight);
> > se->vruntime = avruntime - vlag;
> >
> > The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> > if the new weight is very small, which is very likely, then the vlag
> > could be very large, results in a very small vruntime.
> >
> >
> > The followings are the details why I think above could bring problems:
> >
> > Here is the debug log printed by place_entity():
> >
> >
> > [  397.597268]cfs_rq:0xe75f7100
> >               cfs_rq.avg_vruntime:-1111846207333767
> >               cfs_rq.min_vruntime:810640668779
> >               avg_vruntime():686982466017
> >               curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
> >               leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> > ..
> >
> > [  397.877251]cfs_rq:0xe75f7100
> >               cfs_rq.avg_vruntime:-759390883821798
> >               cfs_rq.min_vruntime:810640668779
> >               avg_vruntime(): 689577229374
> >               curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
> >               leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
> >
> >
> > The leftmost se is a task group, its vruntime reduces from 332464705486 to
> > -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> > on the tree entities.
> >
> > Back to reweight_entity():
> > vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> > vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> > se->vruntime = avruntime - vlag = -59751717438670;
> >
> > the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> > consider that the avg_vruntime() vary.
> >
> > Then later this leftmost se has changed its load back and forth, and when the load is 2,
> > the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> > eligible check:
> >
> > [  398.011991]cfs_rq:0xe75f7100
> >               cfs_rq.avg_vruntime:-11875977385353427
> >               cfs_rq.min_vruntime:810640668779
> >               cfs_rq.avg_load:96985
> >               leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
> >
> > vruntime_eligible()
> > {
> >
> >    key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
> >    key * avg_load overflow s64...
> > }
> >
> > As a result the leftmost one can not be picked, and NULL is returned.
> >
> > One workaround patch I'm thinking of, if this analysis is in the
> > right direction, maybe I can have a test later:
> >
> > thanks,
> > Chenyu
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e0968fb9ba8..7ab26cdc3487 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >          *         = V  - vl'
> >          */
> >         if (avruntime != se->vruntime) {
> > +               s64 limit;
> > +
> >                 vlag = (s64)(avruntime - se->vruntime);
> >                 vlag = div_s64(vlag * old_weight, weight);
> > +               /* TBD: using old weight or new weight? */
> > +               limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > +               vlag = clamp(lag, -limit, limit);
> >                 se->vruntime = avruntime - vlag;
> >         }
> >
> 
> According to previous discussion:
> https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/
> 
> Could this patch avoid this problem?
>

100 cycles has passed with the following change, with the fork() cased removed.
There was no NULL pointer issue observed so far. Meanwhile, I've launched a test
with Tianchen's patch set on top of this one to see if everything goes well.

Can you take a look if the following change make sense, and if yes, could you
send v2 and CC the reporters, so they could also have a try if this helps.

Reported-by: Sergei Trofimovich <slyich@...il.com>
Closes: https://lore.kernel.org/all/ZhuYyrh3mweP_Kd8@nz.home/
Reported-by: Igor Raits <igor@...ddata.com>
Closes: https://lore.kernel.org/all/CA+9S74ih+45M_2TPUY_mPPVDhNvyYfy1J1ftSix+KjiTVxg8nw@mail.gmail.com/
Reported-by: Breno Leitao <leitao@...ian.org>
Reported-by: kernel test robot <oliver.sang@...el.com>
Closes: https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/
Reported-by: Yujie Liu <yujie.liu@...el.com>


And in v2, it would be helpful to include the reason why we did this change, how
about something like this:

"
kernel encounters the following error when running workload:

BUG: kernel NULL pointer dereference, address: 0000002c
EIP: set_next_entity (fair.c:?)

which was caused by NULL pointer returned by pick_eevdf().

Further investigation has shown that, the entity_eligible() has an
false-negative issue when the entity's vruntime is far behind the
cfs_rq.min_vruntime that, the (vruntime - cfs_rq->min_vruntime) * load
caused a s64 overflow, thus every entity on the rb-tree is not
eligible, which results in a NULL candidate.

The reason why entity's vruntime is far behind the cfs_rq.min_vruntime
is because during a on_rq task group's update_cfs_group()->reweight_eevdf(),
there is no limit on the new entity's vlag. If the new weight is much
smaller than the old one,

vlag = div_s64(vlag * old_weight, weight)

generates a huge vlag, and results in very small(negative) vruntime.

Thus limit the range of vlag accordingly.

"

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31bca05c3612..9f203012e8f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -696,15 +696,23 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
+
+static s64 limit_entity_lag(struct sched_entity *se, s64 lag)
+{
+	s64 limit;
+
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	return clamp(lag, -limit, limit);
+}
+
 static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	s64 lag, limit;
+	s64 lag;
 
 	SCHED_WARN_ON(!se->on_rq);
 	lag = avg_vruntime(cfs_rq) - se->vruntime;
 
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
-	se->vlag = clamp(lag, -limit, limit);
+	se->vlag = limit_entity_lag(se, lag);
 }
 
 /*
@@ -3721,6 +3729,7 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	if (avruntime != se->vruntime) {
 		vlag = (s64)(avruntime - se->vruntime);
 		vlag = div_s64(vlag * old_weight, weight);
+		vlag = limit_entity_lag(se, vlag);
 		se->vruntime = avruntime - vlag;
 	}
 
@@ -3768,6 +3777,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 
 	update_load_set(&se->load, weight);
 
+	if (!se->on_rq)
+		se->vlag = limit_entity_lag(se, se->vlag);
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
-- 
2.25.1



thanks,
Chenyu