linux-kernel - Re: [PATCH] sched/fair: favor non-idle group in tick preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5af26ac9-3bdb-32d2-77a7-6cd8feca97aa@bytedance.com>
Date:   Mon, 31 Oct 2022 16:39:43 +0800
From:   Chuyi Zhou <zhouchuyi@...edance.com>
To:     Josh Don <joshdon@...gle.com>
Cc:     peterz@...radead.org, juri.lelli@...hat.com, mingo@...hat.com,
        vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: favor non-idle group in tick preemption



在 2022/10/29 06:40, Josh Don 写道:
> On Thu, Oct 27, 2022 at 8:57 PM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
>>
>>
>>
>> 在 2022/10/28 07:34, Josh Don 写道:
>>> Hi Chuyi,
>>>
>>> On Thu, Oct 27, 2022 at 1:16 AM Chuyi Zhou <zhouchuyi@...edance.com> wrote:
>>>>
>>>> The non-idle se dominates competition vs the idle se when they
>>>> are belong to the same group. We ensure that idle groups would not
>>>> preempt non-idle group in wakeup preemption(see check_preempt_wakeup()).
>>>> However, this can happen in tick preemption, since check_preempt_tick()
>>>> dose not check current/se is idle or not. This patch adds this check.
>>>>
>>>> Signed-off-by: Chuyi Zhou <zhouchuyi@...edance.com>
>>>> ---
>>>>    kernel/sched/fair.c | 12 +++++++++++-
>>>>    1 file changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e4a0b8bd941c..f3324b8753b3 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4750,6 +4750,7 @@ static void
>>>>    check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>>    {
>>>>           unsigned long ideal_runtime, delta_exec;
>>>> +       int cse_is_idle, pse_is_idle;
>>>>           struct sched_entity *se;
>>>>           s64 delta;
>>>>
>>>> @@ -4779,8 +4780,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>>           if (delta < 0)
>>>>                   return;
>>>>
>>>> -       if (delta > ideal_runtime)
>>>> +       if (delta > ideal_runtime) {
>>>> +               /*
>>>> +                * Favor non-idle group even in tick preemption
>>>> +                */
>>>> +               cse_is_idle = se_is_idle(curr);
>>>> +               pse_is_idle = se_is_idle(se);
>>>> +               if (unlikely(!cse_is_idle && pse_is_idle))
>>>> +                       return;
>>>
>> Hi Josh, thanks for your reply,
>>> This would make it so that we never have tick based preemption of a
>>> non-idle entity by an idle entity. That's a recipe for starvation of
>>> the idle entity, if the non-idle entity is cpu bound.
>>>
>>> Beyond that though, I'm not quite sure the issue being solved here.
>>> The large difference in weight between the idle and non-idle entity
>>> means that the non-idle entity will not be preempted for quite a while
>>> due to its ideal_runtime being quite high. The idle entity will
>>> quickly be preempted on the next tick it takes due to the smaller
>>> value of sysctl_sched_idle_min_granularity.
>>>
>> Actually, I did some tests and traced this issue. the result shows that
>> this can happen with small probability. I also do some benchmarks. The
>> result seems it has no performance harm, and we can reduce 2%～3%
>> context switch when idle group & non-idle group are present at the same
>> time. In addition, for throughput concern, I think we better let
>> non-idle entity consume it's ideal_runtime. However, as you said, it may
>> cause starvation of the idle entity.....
> 
> I don't doubt it improves the performance of the non-idle entity, but
> it is never advisable to indefinitely starve threads. That's a recipe
> for hard lockups, priority inversion, etc.
> 
> Does your non-idle entity have a reasonable number of cpu.shares? You
> can push out the round-robin interval by tuning CFS parameters without
> completely starving the idle entity.
> 
>> There is another question I would like to take this opportunity to
>> consult you. In our production environment, in some cases, we want to
>> adjust the weight/shares of the idle-cgroup which means we need to
>> change the logic of sched_group_set_shares(), and it can increase the
>> probability of the above issue. So, for what reasons did you prohibit
>> users from changing weights of idle cgroup.
> 
> The reason for limiting the control of weight for idle cgroups is to
> match the semantics of the per-task SCHED_IDLE api, which gives
> SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that
> these entities are intended to soak "unused" cpu cycles, and should
> give minimal interference to any non-idle thread. However, we don't
> have strict priority between idle and non-idle, due to the potential
> for starvation issues.
> 
> Perhaps you could clarify your use case a bit further. Why do you want
> to change the weight? Is it to adjust the competition between two idle
> groups, or something else?
> 
Suppose we have two cgroups（idle & non-idle）in /sys/fs/cgroup/cpu.
Idle cgroup contains some offline service, such as beg data processing; 
non-idle cgroup contains some online service which have
higher priority to users and are sensitive to latency. We set
quota/period for idle cgroup which indicates it's *cpu limit*.
In general, we consider that the idle cgroup's cpu usage
closer to the limit, the better. However, when the system is busy,
the idle cgroups can only get little cpu resources with minimum weight. 
To cope with the above situation, we changed the default weight.

One more question is, why you think this patch can strave idle entity?
	
	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	se = __pick_first_entity(cfs_rq);
	delta = curr->vruntime - se->vruntime;

	if (delta < 0)
		return;

	if (delta > ideal_runtime)
		resched_curr(rq_of(cfs_rq));

se can preempt curr only when
	curr->vruntime > se->vruntime &&
		curr->vruntime - se->vruntime > ideal_runtime
is true. I think the original purpose is that se doesn't have to wait 
for a full slice, reduce response time if se is latency sensitive.
This patch just let curr exhaust it's idleal_runtime when se is idle and 
curr is non-idle. Normally se will be choosed by pick_next_entity().

Maybe I missed something ?
Thanks


	

>>
>> Thanks again for your review.
>>
>> Best,
>> Chuyi
>>> The wakeup check is useful for latency sensitive non-idle tasks.
>>> However, in steady state competition between idle and non-idle, we
>>> must allow some amount of round-robin.
>>>
>>>> +
>>>>                   resched_curr(rq_of(cfs_rq));
>>>> +       }
>>>>    }
>>>>
>>>>    static void
>>>> --
>>>> 2.20.1
>>>>
>>>
>>> Best,
>>> Josh