linux-kernel - Re: Crash in list_add_leaf_cfs_rq due to bad tmp_alone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAytCk4ZTrhkODCdKkMzif7kjWo4y4i3=YQdjT1v=CD7A@mail.gmail.com>
Date:   Mon, 18 Feb 2019 09:04:12 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Gabriel Hartmann <gabriel.hartmann@...il.com>
Cc:     Sargun Dhillon <sargun@...gun.me>,
        LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tejun Heo <tj@...nel.org>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>,
        Gabriel Hartmann <ghartmann@...flix.com>
Subject: Re: Crash in list_add_leaf_cfs_rq due to bad tmp_alone_branch

Hi Gabriel,

On Sat, 16 Feb 2019 at 00:06, Gabriel Hartmann
<gabriel.hartmann@...il.com> wrote:
>
> Hi Vincent,
>
> Apologies for the slow turn around on this.  We have tried both approaches to fixing the bug now.  In both cases for a particularly long duration CPU intensive workload we are seeing ~33% slowdown.

This was somehow expected because the unused cfs_rq are not removed
anymore but at least the list is correctly ordered with my patch.
the official version of this patch is there:  https://lkml.org/lkml/2019/2/4/121
Then, more patches have been queued that removed unused cfs_rq and
keep a correct list ordering: https://lkml.org/lkml/2019/2/6/499

With these 3 patches, the slowdown should disappear and the list
ordering will stay correct

Regards,
Vincent

>
> -- Gabriel
>
> On Fri, Jan 25, 2019 at 6:31 AM Vincent Guittot <vincent.guittot@...aro.org> wrote:
>>
>> Hi Sargun,
>>
>> On Mon, 21 Jan 2019 at 15:46, Vincent Guittot
>> <vincent.guittot@...aro.org> wrote:
>> >
>> > Hi Sargun,
>> >
>> > Le Friday 18 Jan 2019 à 15:06:28 (+0100), Vincent Guittot a écrit :
>> > > On Fri, 18 Jan 2019 at 11:16, Vincent Guittot
>> > > <vincent.guittot@...aro.org> wrote:
>> > > >
>> > > > On Wed, 9 Jan 2019 at 23:43, Sargun Dhillon <sargun@...gun.me> wrote:
>> > > > >
>> > > > > On Wed, Jan 9, 2019 at 2:14 PM Sargun Dhillon <sargun@...gun.me> wrote:
>> > > > > >
>> > > > > > I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
>> > > > > > infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
>> > > > > > and put it on top of 4.19.13. In addition to this, I uninlined
>> > > > > > list_add_leaf_cfs_rq for debugging.
>> > >
>> > > With the fix above applied, the code that manages the leaf_cfs_rq_list
>> > > is the same since v4.9.
>> > > Have you noticed similar problem on other older kernel version between
>> > > v4.9 and v4.19 ? The problem might have been introduce while modifying
>> > > other part of the scheduler like the sequence for adding/removing
>> > > cgroup.
>> > >
>> > > Knowing the most recent kernel version without the problem could help
>> > > to narrow the problem
>> > >
>> > > Thanks,
>> > > Vincent
>> > >
>> > > > > >
>> > > > > > This revealed a new bug that we didn't get to because we kept getting
>> > > > > > crashes from the previous issue. When we are running with cgroups that
>> > > > > > are rapidly changing, with CFS bandwidth control, and in addition
>> > > > > > using the cpusets cgroup, we see this crash. Specifically, it seems to
>> > > > > > occur with cgroups that are throttled and we change the allowed
>> > > > > > cpuset.
>> > > >
>> > > > Thanks for the context, I will try to reproduce the problem and
>> > > > understand how we can stop in the middle of walking to the
>> > > > sched_entity branch with a parent not already added
>> > > >
>> > > > How many cgroup level have you got in you setup ?
>> > > >
>> > > > > >
>> > > > >
>> > > > > This patch from Gabriel should fix the problem:
>> > > > >
>> > > > >
>> > > > > [PATCH] sched/fair: Reset tmp_alone_branch on cfs_rq delete
>> > > > >
>> > > > > When a child cfs_rq is added to the leaf cfs_rq list before its parent
>> > > > > tmp_alone_branch is set to point to the child in preparation for the
>> > > > > parent being added.
>> > > > >
>> > > > > If the child is deleted before the parent is added then tmp_alone_branch
>> > > > > points to a freed cfs_rq. Any future reference to tmp_alone_branch will
>> > > > > result in a use after free.
>> > > >
>> > > > So, the patch below is a temporary fix that helps to recover from the
>> > > > situation where tmp_alone_branch doesn't finished back to
>> > > > rq->leaf_cfs_rq_list
>> > > > But this situation should not happened at the beginning
>> >
>> > I have been able to reproduce the situation where tmp_alone_branch doesn't
>> > point to rq->leaf_cfs_rq_list after enqueuing a task.
>> >
>> > Can you try the patch below which ensures all cfs_rq of a cgroup branch will
>> > be added in the list even if throttled ?
>>
>> Did you get a chance to test this patch ?
>>
>> Regards,
>> Vincent
>>
>> >
>> > The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
>> > it will walk down to root the 1st time a cfs_rq is used and we will finished
>> > to add either a cfs_rq without parent or a cfs_rq with a parent that is already
>> > on the list. But this is not always true in presence of throttling.
>> > Because a cfs_rq can be throttled even if it has never been used but other CPUS
>> > of the cgroup have already used all the bandwdith, we are not sure to go down to
>> > the root and add all cfs_rq in the list.
>> >
>> > Ensure that all cfs_rq will be added in the list even if they are throttled.
>> >
>> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>> > ---
>> >  kernel/sched/fair.c | 17 +++++++++++++++++
>> >  1 file changed, 17 insertions(+)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 6483834..ae468ab 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>> >         }
>> >  }
>> >
>> > +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
>> > +{
>> > +struct cfs_rq *cfs_rq;
>> > +
>> > +       for_each_sched_entity(se) {
>> > +               cfs_rq = cfs_rq_of(se);
>> > +               list_add_leaf_cfs_rq(cfs_rq);
>> > +
>> > +               /* If parent is already in the list, we can stop */
>> > +               if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>> > +                       break;
>> > +       }
>> > +}
>> > +
>> >  /* Iterate through all leaf cfs_rq's on a runqueue: */
>> >  #define for_each_leaf_cfs_rq(rq, cfs_rq) \
>> >         list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
>> > @@ -5177,6 +5191,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> >
>> >         }
>> >
>> > +       /* Ensure that all cfs_rq have been added to the list */
>> > +       list_add_branch_cfs_rq(se, rq);
>> > +
>> >         hrtick_update(rq);
>> >  }
>> >
>> >
>> >
>> > > >
>> > > >
>> > > > >
>> > > > > Signed-off-by: Gabriel Hartmann <gabriel.hartmann@...il.com>
>> > > > > Reported-by: Sargun Dhillon <sargun@...gun.me>
>> > > > > ---
>> > > > >  kernel/sched/fair.c | 5 +++++
>> > > > >  1 file changed, 5 insertions(+)
>> > > > >
>> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > > > > index 7137bc343b4a..0987629cbb76 100644
>> > > > > --- a/kernel/sched/fair.c
>> > > > > +++ b/kernel/sched/fair.c
>> > > > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct
>> > > > > cfs_rq *cfs_rq)
>> > > > >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>> > > > >  {
>> > > > >      if (cfs_rq->on_list) {
>> > > > > +        struct rq *rq = rq_of(cfs_rq);
>> > > > > +
>> > > > > +        if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
>> > > > > +            rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
>> > > > > +
>> > > > >          list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
>> > > > >          cfs_rq->on_list = 0;
>> > > > >      }