linux-kernel - Re: [PATCH] sched: fix infinity loop in update_blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtA5WBKQ1FMyGqGjn1DdeDbSwT=bTQjD=oW7WMz6iKiS7A@mail.gmail.com>
Date:   Fri, 28 Dec 2018 10:30:07 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Sargun Dhillon <sargun@...gun.me>,
        Xie XiuQi <xiexiuqi@...wei.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>, xiezhipeng1@...wei.com,
        huawei.libin@...wei.com,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Dmitry Adamushko <dmitry.adamushko@...il.com>,
        Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH] sched: fix infinity loop in update_blocked_averages

On Fri, 28 Dec 2018 at 03:02, Tejun Heo <tj@...nel.org> wrote:
>
> On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
> > Vincent knows that part way better than me but I think the safest way
> > would be doing the optimization removal iff tmp_alone_branch is
> > already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
> > something else only while a branch is being built and deferring
> > optimization removal by an avg update cycle isn't gonna make any
> > difference anyway.

But the lock should not be released during the build of a branch and
tmp_alone_branch must always points to rq->leaf_cfs_rq_list at the end
and before the lock is released

I think that there is a bigger problem with commit a9e7f6544b9c and
cfs_rq throttling:
Let take the example of the following topology TG2 --> TG1 --> root
1-The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
one path because it has never been used and can't be throttled so
tmp_alone_branch will point to leaf_cfs_rq_list at the end.
2-Then TG1 is throttled
3-and we add TG3 as a new child of TG1.
4-The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

With commit a9e7f6544b9c, we can del a cfs_rq from
rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes null before
step 2 above, TG1 cfs_rq is removed from the list.
Then at step 4, TG3 cfs_rq is added at the beg of rq->leaf_cfs_rq_list
but tmp_alone_branch still points to TG3 cfs_rq  because its throttled
parent can't be enqueued when the lock is released
tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

so if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
points on another TG cfs_rq, the next TG cfs_rq that will be added,
will be linked  outside rq->leaf_cfs_rq_list

In addition, we can break the ordering of the cfs_rq in
rq->leaf_cfs_rq_list but this ordering is used to update  and
propagate the update from leaf down to root.

>
> So, something like the following.  Xie, can you see whether the
> following patch resolves the problem?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1907506318a..88b9118b5191 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>                  * decayed cfs_rqs linger on the list.
>                  */
> -               if (cfs_rq_is_decayed(cfs_rq))
> +               if (cfs_rq_is_decayed(cfs_rq) &&
> +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>                         list_del_leaf_cfs_rq(cfs_rq);

This patch reduces the cases but I don't thinks it's enough because it
doesn't cover the case of unregister_fair_sched_group()
And we can still break the ordering of the cfs_rq

>
>                 /* Don't need periodic decay once load/util_avg are null */