linux-kernel - Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a903d0dc-1d88-4ae7-ac81-3eed0445654d@linux.alibaba.com>
Date: Thu, 14 Nov 2024 14:36:54 +0800
From: Tianchen Ding <dtcccc@...ux.alibaba.com>
To: 解 咏梅 <xieym_ict@...mail.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] sched/eevdf: Force propagating min_slice of cfs_rq
 when a task changing slice

On 2024/11/14 14:06, 解 咏梅 wrote:
> Let analyze it case by case:P
> 
> say cgroup A has 3 tasks: task A, task B, task C
> 
> 1) assign taskA's slice to 0.1 ms, task B, tack C, task C all have the default slice (0.75ms)
> 
> 2) task A is picked by __schedule as next task, because task A is still on rq,
> so the cfs_rq hierarchical doesn't have to change cfs_rq's min_slice, it will report it to the root cgroup
> 
> 3)  task A is preempted by other task, it's still runnable. it will be requeued cgroup A's cfs_rq. similar as case 2
> 
> 4) task A is preempted since it's blocked, task A's se will be retained in cgroup A's cfs_rq until it reach 0-lag state.
> 4.1 before 0-lag, I guess it's similar as case 2
>       the logic is based on cfs_rq's avg_runtime, it supposed task A won't be pick as next task before it reach 0-lag state.
>       If my understanding is wrong, pls correct me. Thanks.
> 4.2 After it reached 0-lag state, If it's picked by pick_task_fair, it will be removed from cgroup A cfs_rq ultimately.
>       pick_next_entity->dequeue_entities(DEQUEUE_SLEEP | DEQUEUE_DELAYED)->__dequeue_entity (taskA)
>      so, cgroup A's cfs_rq min_slice will be re-calculated. So the cfs_rq hierarchical  will modify their own min_slice bottom up.
> 4.3 After it reached 0-lag state, it will waked up. Because, the current __schedule() split the path for block/sleep from migration path. only migration path will call deactivate. so p->on_rq is still 1, ttwu_runnable will work for it to just call requeue_delayed_entity. similar as case 2
> 
> I think only case 1 has such problem.
> 
> Regards,
> Yongmei.
> 

I think you misunderstood the case. We're not talking about the DELAY_DEQUEUE 
feature. We're simply talking about enqueue(waking up) and dequeue(sleeping).
For convenience, let's turn DELAY_DEQUEUE off.

Consider the following cgroup hierarchy on one cpu:


                     root_cgroup
                         |
              ------------------------
              |                      |
          cgroup_A(curr)      other_cgroups...
              |
        --------------
        |            |
    any_se(curr)   cgroup_B(runnable)
                     |
                ------------
                |          |
           task_A(sleep)  task_B(runnable)

Assume task_A has a smaller slice(0.1ms) and all other tasks have default 
slice(0.75ms).

Because task_A is sleeping, it is not actually on the tree.

Now task_A is woken up. It is enqueued to cgroup_B. So slice of cgroup_B is 
updated to 0.1ms. This is ok.

However, Since cgroup_B is already on_rq, it cannot be "enqueued" again to 
cgroup_A. The code is running to the bottom half.(the second 
for_each_sched_entity loop in enqueue_task_fair)

So the slice of cgroup_A is not updated. It is still 0.75ms.

Thanks.