linux-kernel - Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB8ipk-fejQ41Jgk6z52+T6CP+impwbaOAfhA9vG_-FB9BeRyw@mail.gmail.com>
Date: Thu, 18 Apr 2024 10:57:22 +0800
From: Xuewen Yan <xuewen.yan94@...il.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Abel Wu <wuyun.abel@...edance.com>, 
	Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, 
	Tiwei Bie <tiwei.btw@...group.com>, Honglei Wang <wanghonglei@...ichuxing.com>, 
	Aaron Lu <aaron.lu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, 
	Yujie Liu <yujie.liu@...el.com>, linux-kernel@...r.kernel.org, 
	kernel test robot <oliver.sang@...el.com>
Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf()
 if no eligible entity is found

Hi Yu

On Thu, Apr 18, 2024 at 2:35 AM Chen Yu <yu.c.chen@...el.com> wrote:
>
> On 2024-04-09 at 11:21:04 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2024 at 09:11:39PM +0800, Chen Yu wrote:
> > > On 2024-04-08 at 13:58:33 +0200, Peter Zijlstra wrote:
> > > > On Thu, Feb 29, 2024 at 05:00:18PM +0800, Abel Wu wrote:
> > > >
> > > > > > According to the log, vruntime is 18435852013561943404, the
> > > > > > cfs_rq->min_vruntime is 763383370431, the load is 629 + 2048 = 2677,
> > > > > > thus:
> > > > > > s64 delta = (s64)(18435852013561943404 - 763383370431) = -10892823530978643
> > > > > >      delta * 2677 = 7733399554989275921
> > > > > > that is to say, the multiply result overflow the s64, which turns the
> > > > > > negative value into a positive value, thus eligible check fails.
> > > > >
> > > > > Indeed.
> > > >
> > > > From the data presented it looks like min_vruntime is wrong and needs
> > > > update. If you can readily reproduce this, dump the vruntime of all
> > > > tasks on the runqueue and see if min_vruntime is indeed correct.
> > > >
> > >
> > > This was the dump of all the entities on the tree, from left to right,
> >
> > Oh, my bad, I thought it was the pick path.
> >
> > > and also from top down in middle order traverse, when this issue happens:
> > >
> > > [  514.461242][ T8390] cfs_rq avg_vruntime:386638640128 avg_load:2048 cfs_rq->min_vruntime:763383370431
> > > [  514.535935][ T8390] current on_rq se 0xc5851400, deadline:18435852013562231446
> > >                     min_vruntime:18437121115753667698 vruntime:18435852013561943404, load:629
> > >
> > >
> > > [  514.536772][ T8390] Traverse rb-tree from left to right
> > > [  514.537138][ T8390]  se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible  <-- leftmost se
> > > [  514.537835][ T8390]  se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > [  514.538539][ T8390] Traverse rb-tree from topdown
> > > [  514.538877][ T8390]  middle se 0xec1234e0 deadline:763384870431 min_vruntime:763383370431 vruntime:763383370431 non-eligible   <-- root se
> > > [  514.539605][ T8390]  middle se 0xec4fcf20 deadline:763762447228 min_vruntime:763760947228 vruntime:763760947228 non-eligible
> > >
> > > The tree looks like:
> > >
> > >           se (0xec1234e0)
> > >                   |
> > >                   |
> > >                   ----> se (0xec4fcf20)
> > >
> > >
> > > The root se 0xec1234e0 is also the leftmost se, its min_vruntime and
> > > vruntime are both 763383370431, which is aligned with
> > > cfs_rq->min_vruntime. It seems that the cfs_rq's min_vruntime gets
> > > updated correctly, because it is monotonic increasing.
> >
> > Right.
> >
> > > My guess is that, for some reason, one newly forked se in a newly
> > > created task group, in the rb-tree has not been picked for a long
> > > time(maybe not eligible). Its vruntime stopped at the negative
> > > value(near (unsigned long)(-(1LL << 20)) for a long time, its vruntime
> > > is long behind the cfs_rq->vruntime, thus the overflow happens.
> >
> > I'll have to do the math again, but that's something in the order of not
> > picking a task in about a day, that would be 'bad' :-)
> >
> > Is there any sane way to reproduce this, and how often does it happen?
>
> After adding some ftrace in place_entity() and pick_eevdf(), with the
> help from Yujie in lkp, the issue was reproduced today. The reason why se's vruntime
> is very small seems to be related to task group's reweight_entity():
>
> vlag = (s64)(avruntime - se->vruntime);
> vlag = div_s64(vlag * old_weight, weight);
> se->vruntime = avruntime - vlag;
>
> The vlag above is not limited by neither 2*se->slice nor TICK_NSEC,
> if the new weight is very small, which is very likely, then the vlag
> could be very large, results in a very small vruntime.
>
>
> The followings are the details why I think above could bring problems:
>
> Here is the debug log printed by place_entity():
>
>
> [  397.597268]cfs_rq:0xe75f7100
>               cfs_rq.avg_vruntime:-1111846207333767
>               cfs_rq.min_vruntime:810640668779
>               avg_vruntime():686982466017
>               curr(0xc59f4f20 rb_producer weight:15 vruntime:1447773196654 sum_exec_ns:187707021870 ctx(0 73)
>               leftmost(0xeacb6e00 vruntime:332464705486 sum_exec_ns:78776125437 load:677)
> ..
>
> [  397.877251]cfs_rq:0xe75f7100
>               cfs_rq.avg_vruntime:-759390883821798
>               cfs_rq.min_vruntime:810640668779
>               avg_vruntime(): 689577229374
>               curr(0xc59f4f20 rb_producer weight:15 vruntime:1453640907998 sum_ns:187792974673 ctx(0 73)
>               leftmost(0xeacb6e00 vruntime:-59752941080010 sum_ns:78776125437 load:4)
>
>
> The leftmost se is a task group, its vruntime reduces from 332464705486 to
> -59752941080010, because its load reduced from 677 to 4 due to update_cfs_group()
> on the tree entities.
>
> Back to reweight_entity():
> vlag = avruntime - se->vruntime = 689577229374 - 332464705486 = 357112523888;
> vlag = vlag * old_weight / weight = 357112523888 * 677 / 4 = 60441294668044;
> se->vruntime = avruntime - vlag = -59751717438670;
>
> the new se vruntime -59751717438670 is close to what we printed -59752941080010,
> consider that the avg_vruntime() vary.
>
> Then later this leftmost se has changed its load back and forth, and when the load is 2,
> the vuntime has reached a dangerous threshold to trigger the s64 overflow in
> eligible check:
>
> [  398.011991]cfs_rq:0xe75f7100
>               cfs_rq.avg_vruntime:-11875977385353427
>               cfs_rq.min_vruntime:810640668779
>               cfs_rq.avg_load:96985
>               leftmost(0xeacb6e00 vruntime:18446623907344963655 load:2)
>
> vruntime_eligible()
> {
>
>    key = se.vruntime - cfs_rq.min_vruntime = -120977005256740;
>    key * avg_load overflow s64...
> }
>
> As a result the leftmost one can not be picked, and NULL is returned.
>
> One workaround patch I'm thinking of, if this analysis is in the
> right direction, maybe I can have a test later:
>
> thanks,
> Chenyu
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e0968fb9ba8..7ab26cdc3487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3965,8 +3965,13 @@ static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
>          *         = V  - vl'
>          */
>         if (avruntime != se->vruntime) {
> +               s64 limit;
> +
>                 vlag = (s64)(avruntime - se->vruntime);
>                 vlag = div_s64(vlag * old_weight, weight);
> +               /* TBD: using old weight or new weight? */
> +               limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> +               vlag = clamp(lag, -limit, limit);
>                 se->vruntime = avruntime - vlag;
>         }
>

According to previous discussion:
https://lore.kernel.org/all/CAB8ipk9N9verfQp6U9s8+TQgNbA5J0DWkOB1dShf20n0xbx94w@mail.gmail.com/

Could this patch avoid this problem?

BR

> --
> 2.25.1
>
>
>