linux-kernel - Re: [RFC PATCH] sched/fair: update the vruntime to be max vruntime when yield

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230228133111.6i5tlhvthnfljvmf@airbuntu>
Date:   Tue, 28 Feb 2023 13:31:11 +0000
From:   Qais Yousef <qyousef@...alina.io>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Xuewen Yan <xuewen.yan94@...il.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Xuewen Yan <xuewen.yan@...soc.com>, mingo@...hat.com,
        juri.lelli@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com,
        linux-kernel@...r.kernel.org, ke.wang@...soc.com,
        zhaoyang.huang@...soc.com
Subject: Re: [RFC PATCH] sched/fair: update the vruntime to be max vruntime
 when yield

On 02/28/23 10:07, Vincent Guittot wrote:
> On Tue, 28 Feb 2023 at 09:21, Xuewen Yan <xuewen.yan94@...il.com> wrote:
> >
> > Hi Vincent
> >
> > On Tue, Feb 28, 2023 at 3:53 PM Vincent Guittot
> > <vincent.guittot@...aro.org> wrote:
> > >
> > > On Tue, 28 Feb 2023 at 08:42, Xuewen Yan <xuewen.yan94@...il.com> wrote:
> > > >
> > > > Hi
> > > >
> > > > Thanks very much for comments!
> > > >
> > > > On Tue, Feb 28, 2023 at 6:33 AM Qais Yousef <qyousef@...alina.io> wrote:
> > > > >
> > > > > On 02/27/23 16:40, Peter Zijlstra wrote:
> > > > > > On Wed, Feb 22, 2023 at 04:03:14PM +0800, Xuewen Yan wrote:
> > > > > > > When task call the sched_yield, cfs would set the cfs's skip buddy.
> > > > > > > If there is no other task call the sched_yield syscall, the task would
> > > > > > > always be skiped when there are tasks in rq.
> > > > > >
> > > > > > So you have two tasks A) which does sched_yield() and becomes ->skip,
> > > > > > and B) which is while(1). And you're saying that once A does it's thing,
> > > > > > B runs forever and starves A?
> > > > >
> > > > > I read it differently.
> > > > >
> > > > > I understood that there are multiple tasks.
> > > > >
> > > > > If Task A becomes ->skip; then it seems other tasks will continue to be picked
> > > > > instead. Until another task B calls sched_yield() and become ->skip, then Task
> > > > > A is picked but with wrong vruntime causing it to run for multiple ticks (my
> > > > > interpretation of 'always run' below).
> > > > >
> > > > > There are no while(1) task running IIUC.
> > > > >
> > > > > >
> > > > > > > As a result, the task's
> > > > > > > vruntime would not be updated for long time, and the cfs's min_vruntime
> > > > > > > is almost not updated.
> > > > > >
> > > > > > But the condition in pick_next_entity() should ensure that we still pick
> > > > > > ->skip when it becomes too old. Specifically, when it gets more than
> > > > > > wakeup_gran() behind.
> > > > >
> > > > > I am not sure I can see it either. Maybe __pick_first_entity() doesn't return
> > > > > the skipped one, or for some reason vdiff for second is almost always
> > > > > < wakeup_gran()?
> > > > >
> > > > > >
> > > > > > > When this scenario happens, when the yield task had wait for a long time,
> > > > > > > and other tasks run a long time, once there is other task call the sched_yield,
> > > > > > > the cfs's skip_buddy is covered, at this time, the first task can run normally,
> > > > > > > but the task's vruntime is small, as a result, the task would always run,
> > > > > > > because other task's vruntime is big. This would lead to other tasks can not
> > > > > > > run for a long time.
> > > > >
> > > > > The error seems that when Task A finally runs - it consumes more than its fair
> > > > > bit of sched_slice() as it looks it was starved.
> > > > >
> > > > > I think the question is why it was starved? Can you shed some light Xuewen?
> > > > >
> > > > > My attempt to help to clarify :) I have read this just like you.
> > > >
> > > > Thanks for Qais's clarify. And that's exactly what I want to say:)
> > > >
> > > > >
> > > > > FWIW I have seen a report of something similar, but I didn't managed to
> > > > > reproduce and debug (mostly due to ENOBANDWIDTH); and not sure if the details
> > > > > are similar to what Xuewen is seeing. But there was a task starving for
> > > > > multiple ticks - RUNNABLE but never RUNNING in spite of other tasks getting
> > > > > scheduled in instead multiple times. ie: there was a task RUNNING for most of
> > > > > the time, and I could see it preempted by other tasks multiple time, but not by
> > > > > the starving RUNNABLE task that is hung on the rq. It seems to be vruntime
> > > > > related too but speculating here.
> > > >
> > > > Yes, now we met the similar scenario when running a monkey test on the
> > > > android phone.
> > > > There are multiple tasks on cpu, but the runnable task could not be
> > > > got scheduled for a long time,
> > > > there is task running and we could see the task preempted by other
> > > > tasks multiple times.
> > > > Then we dump the tasks, and find the vruntime of each task varies
> > > > greatly, and the task which running call the sched_yield frequently.
> > >
> > > If I'm not wrong you are using cgroups and as a result you can't
> > > compare the vruntime of tasks that belongs to different group, you
> > > must compare the vruntime of entities at the same level. We might have
> > > to look the side because I can't see why the task would not be
> > > schedule if other tasks in the same group move forward their vruntime
> >
> > All the tasks belong to the same cgroup.

Could they move between cpusets though?

> 
> ok.
> I have tried to reproduce your problem but can't see it so far. I'm
> probably missing something.
> 
> With rt-app, I start:
> - 3 tasks A, B, C which are always running
> - 1 task D which always runs but yields every 1ms for 1000 times and
> then stops yielding and always run
> 
> All tasks are pinned on the same cpu in the same cgroup.
> 
> I don't see anything wrong.
> task A, B, C runs their slices
> task D is preempted by others after 1ms for a couple of times when it
> calls yield. Then the yield doesn't have effect and task D runs a few
> consecutive ms although the yield. Then task D restart to be preempted
> by others when it calls yield when its vruntime is close to others
> 
> Once task D stop calling yield, the 4 tasks runs normally

Could vruntime be inflated if a task gets stuck on a little core for a while
(where it'll run slower) then compared to another task running on a bigger core
the vruntime will appear smaller for the latter?


Cheers

--
Qais Yousef