linux-kernel - Re: [PATCH] sched/fair: favor non-idle group in tick preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29Nt38d31LnazbcSVB036jUYemC1KRpR2Dn7YgLmXkp-QBQ@mail.gmail.com>
Date:   Mon, 31 Oct 2022 15:44:46 -0700
From:   Josh Don <joshdon@...gle.com>
To:     Chuyi Zhou <zhouchuyi@...edance.com>
Cc:     peterz@...radead.org, juri.lelli@...hat.com, mingo@...hat.com,
        vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: favor non-idle group in tick preemption

On Mon, Oct 31, 2022 at 1:39 AM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
>
> >> 在 2022/10/28 07:34, Josh Don 写道:
> > The reason for limiting the control of weight for idle cgroups is to
> > match the semantics of the per-task SCHED_IDLE api, which gives
> > SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that
> > these entities are intended to soak "unused" cpu cycles, and should
> > give minimal interference to any non-idle thread. However, we don't
> > have strict priority between idle and non-idle, due to the potential
> > for starvation issues.
> >
> > Perhaps you could clarify your use case a bit further. Why do you want
> > to change the weight? Is it to adjust the competition between two idle
> > groups, or something else?
> >
> Suppose we have two cgroups（idle & non-idle）in /sys/fs/cgroup/cpu.
> Idle cgroup contains some offline service, such as beg data processing;
> non-idle cgroup contains some online service which have
> higher priority to users and are sensitive to latency. We set
> quota/period for idle cgroup which indicates it's *cpu limit*.
> In general, we consider that the idle cgroup's cpu usage
> closer to the limit, the better. However, when the system is busy,
> the idle cgroups can only get little cpu resources with minimum weight.
> To cope with the above situation, we changed the default weight.

I see. So you want the part of SCHED_IDLE that makes the entity highly
preemptible (and avoids preemption of non idle entities), but want to
adjust weight to reach a target cpu split? That seems a bit
counterintuitive to me, since by giving the idle entities higher
weight, you'll end up pushing out the round-robin latency for the
non-idle entities.

Worth noting that SCHED_IDLE is a bit of a CFS hack, but the intended
semantics of it are that these threads soak only "remaining cycles".
This comes with many implications beyond just weight. For example, a
cpu running only SCHED_IDLE entities is considered as "idle" from the
perspective of non-idle entities. If we give these idle entities
meaningful weight, we start to break assumptions there, for example
see sched_idle_cpu() and load balancing.

I wonder if maybe dusting off SCHED_BATCH is a better answer here, for
this type of use case (some amount of throughput "guarantee", but with
preemption properties similar to SCHED_IDLE). Peter, thoughts?

> One more question is, why you think this patch can strave idle entity?
>
>         /*
>          * Ensure that a task that missed wakeup preemption by a
>          * narrow margin doesn't have to wait for a full slice.
>          * This also mitigates buddy induced latencies under load.
>          */
>         se = __pick_first_entity(cfs_rq);
>         delta = curr->vruntime - se->vruntime;
>
>         if (delta < 0)
>                 return;
>
>         if (delta > ideal_runtime)
>                 resched_curr(rq_of(cfs_rq));
>
> se can preempt curr only when
>         curr->vruntime > se->vruntime &&
>                 curr->vruntime - se->vruntime > ideal_runtime
> is true. I think the original purpose is that se doesn't have to wait
> for a full slice, reduce response time if se is latency sensitive.
> This patch just let curr exhaust it's idleal_runtime when se is idle and
> curr is non-idle. Normally se will be choosed by pick_next_entity().
>
> Maybe I missed something ?
> Thanks

No that was my mistake, I accidentally thought this delta was being
applied to the 'if (delta_exec > ideal_runtime) {' above in
check_preempt_tick().

Some weirdness about this change though, is that if there is a
non-idle current entity, and the two next entities on the cfs_rq are
idle and non-idle respectively, we'll now take longer to preempt the
on-cpu non-idle entity, because the non-idle entity on the cfs_rq is
'hidden' by the idle 'first' entity. Wakeup preemption is different
because we're always directly comparing the current entity with the
newly woken entity.

Best,
Josh