linux-kernel - Re: [PATCH] sched/fair: favor non-idle group in tick preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29NsbGuMPcA8NJagMfPyHij3864F3DVQ6wpCj3UAo8tbFNQ@mail.gmail.com>
Date:   Fri, 28 Oct 2022 15:40:44 -0700
From:   Josh Don <joshdon@...gle.com>
To:     Chuyi Zhou <zhouchuyi@...edance.com>
Cc:     peterz@...radead.org, juri.lelli@...hat.com, mingo@...hat.com,
        vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: favor non-idle group in tick preemption

On Thu, Oct 27, 2022 at 8:57 PM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
>
>
>
> 在 2022/10/28 07:34, Josh Don 写道:
> > Hi Chuyi,
> >
> > On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
> >>
> >> The non-idle se dominates competition vs the idle se when they
> >> are belong to the same group. We ensure that idle groups would not
> >> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()).
> >> However, this can happen in tick preemption, since check_preempt_tick()
> >> dose not check current/se is idle or not. This patch adds this check.
> >>
> >> Signed-off-by: Chuyi Zhou <zhouchuyi@...edance.com>
> >> ---
> >>   kernel/sched/fair.c | 12 +++++++++++-
> >>   1 file changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e4a0b8bd941c..f3324b8753b3 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4750,6 +4750,7 @@ static void
> >>   check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >>   {
> >>          unsigned long ideal_runtime, delta_exec;
> >> +       int cse_is_idle, pse_is_idle;
> >>          struct sched_entity *se;
> >>          s64 delta;
> >>
> >> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >>          if (delta < 0)
> >>                  return;
> >>
> >> -       if (delta > ideal_runtime)
> >> +       if (delta > ideal_runtime) {
> >> +               /*
> >> +                * Favor non-idle group even in tick preemption
> >> +                */
> >> +               cse_is_idle = se_is_idle(curr);
> >> +               pse_is_idle = se_is_idle(se);
> >> +               if (unlikely(!cse_is_idle && pse_is_idle))
> >> +                       return;
> >
> Hi Josh, thanks for your reply,
> > This would make it so that we never have tick based preemption of a
> > non-idle entity by an idle entity. That's a recipe for starvation of
> > the idle entity, if the non-idle entity is cpu bound.
> >
> > Beyond that though, I'm not quite sure the issue being solved here.
> > The large difference in weight between the idle and non-idle entity
> > means that the non-idle entity will not be preempted for quite a while
> > due to its ideal_runtime being quite high. The idle entity will
> > quickly be preempted on the next tick it takes due to the smaller
> > value of sysctl_sched_idle_min_granularity.
> >
> Actually, I did some tests and traced this issue. the result shows that
> this can happen with small probability. I also do some benchmarks. The
> result seems it has no performance harm, and we can reduce 2%～3%
> context switch when idle group & non-idle group are present at the same
> time. In addition, for throughput concern, I think we better let
> non-idle entity consume it's ideal_runtime. However, as you said, it may
> cause starvation of the idle entity.....

I don't doubt it improves the performance of the non-idle entity, but
it is never advisable to indefinitely starve threads. That's a recipe
for hard lockups, priority inversion, etc.

Does your non-idle entity have a reasonable number of cpu.shares? You
can push out the round-robin interval by tuning CFS parameters without
completely starving the idle entity.

> There is another question I would like to take this opportunity to
> consult you. In our production environment, in some cases, we want to
> adjust the weight/shares of the idle-cgroup which means we need to
> change the logic of sched_group_set_shares(), and it can increase the
> probability of the above issue. So, for what reasons did you prohibit
> users from changing weights of idle cgroup.

The reason for limiting the control of weight for idle cgroups is to
match the semantics of the per-task SCHED_IDLE api, which gives
SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that
these entities are intended to soak "unused" cpu cycles, and should
give minimal interference to any non-idle thread. However, we don't
have strict priority between idle and non-idle, due to the potential
for starvation issues.

Perhaps you could clarify your use case a bit further. Why do you want
to change the weight? Is it to adjust the competition between two idle
groups, or something else?

>
> Thanks again for your review.
>
> Best,
> Chuyi
> > The wakeup check is useful for latency sensitive non-idle tasks.
> > However, in steady state competition between idle and non-idle, we
> > must allow some amount of round-robin.
> >
> >> +
> >>                  resched_curr(rq_of(cfs_rq));
> >> +       }
> >>   }
> >>
> >>   static void
> >> --
> >> 2.20.1
> >>
> >
> > Best,
> > Josh