linux-kernel - Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCCQ0__kz8UfaSm7qrpOQc47rPheD+KLoyBAbmp_tLP0w@mail.gmail.com>
Date:   Tue, 14 Mar 2023 14:26:25 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Zhang Qiao <zhangqiao22@...wei.com>
Cc:     linux-kernel@...r.kernel.org, mingo@...hat.com,
        juri.lelli@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, rkagan@...zon.de,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

On Tue, 14 Mar 2023 at 12:03, Zhang Qiao <zhangqiao22@...wei.com> wrote:
>
>
>
> 在 2023/3/13 22:23, Vincent Guittot 写道:
> > On Sat, 11 Mar 2023 at 10:57, Zhang Qiao <zhangqiao22@...wei.com> wrote:
> >>
> >>
> >>
> >> 在 2023/3/10 22:29, Vincent Guittot 写道:
> >>> Le jeudi 09 mars 2023 à 16:14:38 (+0100), Vincent Guittot a écrit :
> >>>> On Thu, 9 Mar 2023 at 15:37, Peter Zijlstra <peterz@...radead.org> wrote:
> >>>>>
> >>>>> On Thu, Mar 09, 2023 at 03:28:25PM +0100, Peter Zijlstra wrote:
> >>>>>> On Thu, Mar 09, 2023 at 02:34:05PM +0100, Vincent Guittot wrote:
> >>>>>>
> >>>>>>> Then, even if we don't clear exec_start before migrating and keep
> >>>>>>> current value to be used in place_entity on the new cpu, we can't
> >>>>>>> compare the rq_clock_task(rq_of(cfs_rq)) of 2 different rqs AFAICT
> >>>>>>
> >>>>>> Blergh -- indeed, irq and steal time can skew them between CPUs :/
> >>>>>> I suppose we can fudge that... wait_start (which is basically what we're
> >>>>>> making it do) also does that IIRC.
> >>>>>>
> >>>>>> I really dislike having this placement muck spreadout like proposed.
> >>>>>
> >>>>> Also, I think we might be over-engineering this, we don't care about
> >>>>> accuracy at all, all we really care about is 'long-time'.
> >>>>
> >>>> you mean taking the patch 1/2 that you mentioned here to add a
> >>>> migrated field:
> >>>> https://lore.kernel.org/all/68832dfbb60fda030540b5f4e39c5801942689b1.1648228023.git.tim.c.chen@linux.intel.com/T/#ma5637eb8010f3f4a4abff778af8db705429d003b
> >>>>
> >>>> And assume that the divergence between the rq_clock_task() can be ignored ?
> >>>>
> >>>> That could probably work but we need to replace the (60LL *
> >>>> NSEC_PER_SEC) by ((1ULL << 63) / NICE_0_LOAD) because 60sec divergence
> >>>> would not be unrealistic.
> >>>> and a comment to explain why it's acceptable
> >>>
> >>> Zhang,
> >>>
> >>> Could you try the patch below ?
> >>> This is a rebase/merge/update of:
> >>> -patch 1/2 above and
> >>> -https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
> >>
> >>
> >> I applyed and tested this patch, and it make hackbench slower.
> >> According to my previous test results. The good result is 82.1(s).
> >> But the result of this patch is 108.725(s).
> >
> > By "the result of this patch is 108.725(s)",  you mean the result of
> > https://lore.kernel.org/lkml/20230209193107.1432770-1-rkagan@amazon.de/
> > alone, don't you ?
>
> No, with your patch, the test results is 108.725(s),

Ok

>
> git diff:
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b1a..93a3909ae4c4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -550,6 +550,7 @@ struct sched_entity {
>         struct rb_node                  run_node;
>         struct list_head                group_node;
>         unsigned int                    on_rq;
> +       unsigned int                    migrated;
>
>         u64                             exec_start;
>         u64                             sum_exec_runtime;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff4dbbae3b10..e60defc39f6e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         /*
>          * We are starting a new run period:
>          */
> +       se->migrated = 0;
>         se->exec_start = rq_clock_task(rq_of(cfs_rq));
>  }
>
> @@ -4690,9 +4691,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>          * inversed due to s64 overflow.
>          */
>         sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> -       if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> +       if ((s64)sleep_time > (1ULL << 63) / scale_load_down(NICE_0_LOAD) / 2) {
>                 se->vruntime = vruntime;
> -       else
> +       } else
>                 se->vruntime = max_vruntime(se->vruntime, vruntime);
>  }
>
> @@ -7658,8 +7659,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>         se->avg.last_update_time = 0;
>
>         /* We have migrated, no longer consider this task hot */
> -       se->exec_start = 0;
> -
> +       se->migrated = 1;
>         update_scan_period(p, new_cpu);
>  }
>
> @@ -8343,6 +8343,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>
>         if (sysctl_sched_migration_cost == 0)
>                 return 0;
> +       if (p->se.migrated)
> +               return 0;
>
>         delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>
>
>
> >
> >>
> >>
> >>> version1: v6.2
> >>> version2: v6.2 + commit 829c1651e9c4
> >>> version3: v6.2 + commit 829c1651e9c4 + this patch
> >>>
> >>> -------------------------------------------------
> >>>       version1        version2        version3
> >>> test1 81.0            118.1           82.1
> >>> test2 82.1            116.9           80.3
> >>> test3 83.2            103.9           83.3
> >>> avg(s)        82.1            113.0           81.9
> >
> > Ok, it looks like we are back to normal figures

What do those results refer to then ?


> >
> >>>
> >>> -------------------------------------------------
> >>>
> >>> The proposal accepts a divergence of up to 52 days between the 2 rqs.
> >>>
> >>> If this work, we will prepare a proper patch
> >>>
> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>> index 63d242164b1a..cb8af0a137f7 100644
> >>> --- a/include/linux/sched.h
> >>> +++ b/include/linux/sched.h
> >>> @@ -550,6 +550,7 @@ struct sched_entity {
> >>>         struct rb_node                  run_node;
> >>>         struct list_head                group_node;
> >>>         unsigned int                    on_rq;
> >>> +       unsigned int                    migrated;
> >>>
> >>>         u64                             exec_start;
> >>>         u64                             sum_exec_runtime;
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 7a1b1f855b96..36acd9598b40 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -1057,6 +1057,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>>         /*
> >>>          * We are starting a new run period:
> >>>          */
> >>> +       se->migrated = 0;
> >>>         se->exec_start = rq_clock_task(rq_of(cfs_rq));
> >>>  }
> >>>
> >>> @@ -4684,13 +4685,23 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >>>
> >>>         /*
> >>>          * Pull vruntime of the entity being placed to the base level of
> >>> -        * cfs_rq, to prevent boosting it if placed backwards.  If the entity
> >>> -        * slept for a long time, don't even try to compare its vruntime with
> >>> -        * the base as it may be too far off and the comparison may get
> >>> -        * inversed due to s64 overflow.
> >>> +        * cfs_rq, to prevent boosting it if placed backwards.
> >>> +        * However, min_vruntime can advance much faster than real time, with
> >>> +        * the exterme being when an entity with the minimal weight always runs
> >>> +        * on the cfs_rq. If the new entity slept for long, its vruntime
> >>> +        * difference from min_vruntime may overflow s64 and their comparison
> >>> +        * may get inversed, so ignore the entity's original vruntime in that
> >>> +        * case.
> >>> +        * The maximal vruntime speedup is given by the ratio of normal to
> >>> +        * minimal weight: NICE_0_LOAD / MIN_SHARES, so cutting off on the
> >>
> >> why not is `scale_load_down(NICE_0_LOAD) / MIN_SHARES` here ?
> >
> > yes, you are right.
> >
> >>
> >>
> >>> +        * sleep time of 2^63 / NICE_0_LOAD should be safe.
> >>> +        * When placing a migrated waking entity, its exec_start has been set
> >>> +        * from a different rq. In order to take into account a possible
> >>> +        * divergence between new and prev rq's clocks task because of irq and
> >>
> >> This divergence might be larger, it cause `sleep_time` maybe negative.
> >
> > AFAICT, we are safe with ((1ULL << 63) / scale_load_down(NICE_0_LOAD)
> > / 2) as long as the divergence between the 2 rqs clocks task is lower
> > than 2^52nsec. Do you expect a divergence higher than 2^52 nsec
> > (around 52 days)?
> >
> > We can probably keep using (1ULL << 63) / scale_load_down(NICE_0_LOAD)
> > which is already half the max value if needed.
> >
> > the fact that sleep_time can be negative is not a problem as
> > s64)sleep_time > will take care of this.
>
> In my opinion, when comparing signed with unsigned, the compiler converts the signed value to unsigned.
> So, if sleep_time < 0, "(s64)sleep_time > (1ULL << 63) / NICE_0_LOAD / 2" will be true.
>
> >
> >>
> >>> +        * stolen time, we take an additional margin.
> >>>          */
> >>>         sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> >>> -       if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> >>> +       if ((s64)sleep_time > (1ULL << 63) / NICE_0_LOAD / 2)>                 se->vruntime = vruntime;
> >>>         else
> >>>                 se->vruntime = max_vruntime(se->vruntime, vruntime);
> >>> @@ -7658,7 +7669,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> >>>         se->avg.last_update_time = 0;
> >>>
> >>>         /* We have migrated, no longer consider this task hot */
> >>> -       se->exec_start = 0;
> >>> +       se->migrated = 1;
> >>>
> >>>         update_scan_period(p, new_cpu);
> >>>  }
> >>> @@ -8344,6 +8355,9 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
> >>>         if (sysctl_sched_migration_cost == 0)
> >>>                 return 0;
> >>>
> >>> +       if (p->se.migrated)
> >>> +               return 0;
> >>> +
> >>>         delta = rq_clock_task(env->src_rq) - p->se.exec_start;
> >>>
> >>>         return delta < (s64)sysctl_sched_migration_cost;
> >>>
> >>>
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>> .
> >>>
> > .
> >