linux-kernel - Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAM=kgF7Fz-JKFY+s_k5KFirs-8Bub3s1Eqtq7P0NMa0w@mail.gmail.com>
Date:   Wed, 12 Feb 2020 15:47:30 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Phil Auld <pauld@...hat.com>, Parth Shah <parth@...ux.ibm.com>,
        Valentin Schneider <valentin.schneider@....com>
Subject: Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path

On Wed, 12 Feb 2020 at 14:20, Mel Gorman <mgorman@...e.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:48PM +0100, Vincent Guittot wrote:
> > The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > is split in 2 distinct parts for throttled cfs_rq without any added value
> > but making code less readable.
> >
> > Change the code ordering such that everything related to a cfs_rq
> > (throttled or not) will be done in the same loop.
> >
> > In addition, the same steps ordering is used when updating a cfs_rq:
> > - update_load_avg
> > - update_cfs_group
> > - update *h_nr_running
> >
> > No functional and performance changes are expected and have been noticed
> > during tests.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> > ---
> >  kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
> >  1 file changed, 20 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1a0ce83e835a..a1ea02b5362e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >               cfs_rq = cfs_rq_of(se);
> >               enqueue_entity(cfs_rq, se, flags);
> >
> > -             /*
> > -              * end evaluation on encountering a throttled cfs_rq
> > -              *
> > -              * note: in the case of encountering a throttled cfs_rq we will
> > -              * post the final h_nr_running increment below.
> > -              */
> > -             if (cfs_rq_throttled(cfs_rq))
> > -                     break;
> >               cfs_rq->h_nr_running++;
> >               cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > +             /* end evaluation on encountering a throttled cfs_rq */
> > +             if (cfs_rq_throttled(cfs_rq))
> > +                     goto enqueue_throttle;
> > +
> >               flags = ENQUEUE_WAKEUP;
> >       }
> >
> >       for_each_sched_entity(se) {
> >               cfs_rq = cfs_rq_of(se);
> > -             cfs_rq->h_nr_running++;
> > -             cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > +             /* end evaluation on encountering a throttled cfs_rq */
> >               if (cfs_rq_throttled(cfs_rq))
> > -                     break;
> > +                     goto enqueue_throttle;
> > AFAICT, there are in tip/sched/core
> >               update_load_avg(cfs_rq, se, UPDATE_TG);
> >               update_cfs_group(se);
> > +
> > +             cfs_rq->h_nr_running++;
> > +             cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >       }
> >
> > +enqueue_throttle:
> >       if (!se) {
> >               add_nr_running(rq, 1);
> >               /*
>
> I'm having trouble reconciling the patch with the description and the
> comments explaining the intent behind the code are unhelpful.
>
> There are two loops before and after your patch -- the first dealing with
> sched entities that are not on a runqueue and the second for the remaining
> entities that are. The intent appears to be to update the load averages
> once the entity is active on a runqueue.
>
> I'm not getting why the changelog says everything related to cfs is
> now done in one loop because there are still two. But even if you do
> get throttled, it's not clear why you jump to the !se check given that
> for_each_sched_entity did not complete. What it *does* appear to do is
> have all the h_nr_running related to entities being enqueued updated in
> one loop and all remaining entities stats updated in the other.

Let's take the example of 2 levels in addition to root so we have :
root->cfs1->cfs2
Now we enqueue a task T1 on cfs2 but cfs1 is throttled, we will have
the sequence:

In 1st for_each_sched_entity loop:
  loop 1
    enqueue_entity (T1->se, cfs2) which calls update load_avg(cfs2)
    cfs2->h_nr_running++;
  loop 2
    enqueue_entity (cfs2->gse, cfs1) which calls update load_avg(cfs1)
    break because cfs1 is throttled

In 2nd for_each_sched_entity loop:
  loop 1
    cfs1->h_nr_running++
    break because throttled

Using the 2nd loop for incrementing h_nr_running of the throttled cfs
is useless and we could do that directly in 1st loop and skip the 2nd
loop

With this patch we have :

In 1st for_each_sched_entity loop:
  loop 1
    enqueue_entity (T1->se, cfs2) which update load_avg(cfs2)
    cfs2->h_nr_running++;
  loop 2
    enqueue_entity (cfs2->gse, cfs1) which update load_avg(cfs1)
    cfs1->h_nr_running++
    skip the 2nd for_each_sched_entity entirely

Then the patch also reorders the call to update_load_avg() and the
increment of h_nr_running

Before the patch we had different order between the to
for_each_sched_entity which is not a problem because there is
currently no relation between both. But the following patches make
PELT using h_nr_running so we must have the same ordering to prevent
updating pelt with the wrong h_nr_running value

>
> Following the accounting is tricky. Before the patch, if throttling
> happened then h_nr_running was updated without updating the corresponding
> nr_running counter in rq. They are out of sync until unthrottle_cfs_rq
> is called at the very least. After your patch, the same is true and while
> the accounting appears to be equivalent, it's not clear it's correct and
> I do not find the code any easier to understand after the patch or how
> it's connected to runnable_load_avg which this series is about :(
>
> I think the patch is functionally ok but I am having trouble figuring
> out the motive. Maybe it'll be obvious after I read the rest of the series.
>
> --
> Mel Gorman
> SUSE Labs