linux-kernel - Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 16 Jun 2016 20:57:36 +0200
From:	Vincent Guittot <vincent.guittot@...aro.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Yuyang Du <yuyang.du@...el.com>, Ingo Molnar <mingo@...nel.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Benjamin Segall <bsegall@...gle.com>,
	Paul Turner <pjt@...gle.com>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Matt Fleming <matt@...eblueprint.co.uk>
Subject: Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice
 when switching to fair or changing task group

On 16 June 2016 at 19:17, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, Jun 16, 2016 at 06:30:13PM +0200, Vincent Guittot wrote:
>> Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit :
>> > On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
>> > > I still have concerned with this change of the behavior that  attaches
>> > > the task only when it is enqueued. The load avg of the task will not
>> > > be decayed between the time we move it into its new group until its
>> > > enqueue. With this change, a task's load can stay high whereas it has
>> > > slept for the last couple of seconds. Then, its load and utilization
>> > > is no more accounted anywhere in the mean time just because we have
>> > > moved the task which will be enqueued on the same rq.
>> > > A task should always be attached to a cfs_rq and its load/utilization
>> > > should always be accounted on a cfs_rq and decayed for its sleep
>> > > period
>> >
>> > OK; so I think I agree with that. Does the below (completely untested,
>> > hasn't even been near a compiler) look reasonable?
>> >
>> > The general idea is to always attach to a cfs_rq -- through
>> > post_init_entity_util_avg(). This covers both the new task isn't
>> > attached yet and was never in the fair class to begin with issues.
>>
>> Your patch ensures that a task will be attached to a cfs_rq and fix
>> the issue raised by Yuyang because of se->avg.last_update_time = 0 at
>> init.
>
>> During the test the following message has raised  "BUG: using
>> smp_processor_id() in preemptible [00000000] code: systemd/1" because
>> of cfs_rq_util_change that is called in attach_entity_load_avg
>
> But per commit:
>
>   b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
>
> that should be with rq->lock held !?

yes, you're right, i forgot to fetch latest commits

>
>> With patch [1] for the init of cfs_rq side, all use cases will be
>> covered regarding the issue linked to a last_update_time set to 0 at
>> init [1] https://lkml.org/lkml/2016/5/30/508
>
> Yes, I saw that patch. However, it felt strange to me to set
> last_update_time to 1, would that not result in a massive aging because
> now - last_update_time ends up as a giant delta?

yes, it will but it's similar to what can also happen when the other
tasks will be enqueued in the cfs_rq. The cfs_rq->avg.last_update_time
can be really far from now for cfs_rq that is idle.

>
> By doing attach_entity_load_avg() we actually add the initial values to
> the cfs_rq avg and set last_update_time to the current time. Ensuring
> 'regular' contribution and progress.
>
>> > That only leaves a tiny hole in fork() where the task is hashed but
>> > hasn't yet passed through wake_up_new_task() in which someone can do
>> > cgroup move on it. That is closed with TASK_NEW and can_attach()
>> > refusing those tasks.
>> >
>>
>> But a new fair task is still detached and attached from/to task_group
>> with
>
>> cgroup_post_fork()-->
>     ss->fork(child)-->
>       cpu_cgroup_fork()-->
>         sched_move_task()-->
>           task_move_group_fair().
>
> Blergh, I knew I missed something in there..
>
>> cpu_cgroup_can_attach is not used in this path and sched_move_task do
>> the move unconditionally for fair task.
>>
>> With your patch, we still have the sequence
>>
>> sched_fork()
>>     set_task_cpu()
>> cgroup_post_fork()--> ... --> task_move_group_fair()
>>     detach_task_cfs_rq()
>>     set_task_rq()
>>     attach_task_cfs_rq()
>> wake_up_new_task()
>>     select_task_rq() can select a new cpu
>>     set_task_cpu()
>>         migrate_task_rq_fair if the new_cpu != task_cpu
>>              remove_load()
>>         __set_task_cpu
>>     post_init_entity_util_avg
>>         attach_task_cfs_rq()
>>     activate_task
>>         enqueue_task
>>
>> In fact, cpu_cgroup_fork needs a small part of sched_move_task so we
>> can just call this small part directly instead sched_move_task. And
>> the task doesn't really migrate because it is not yet attached so we
>> need the sequence:
>
>> sched_fork()
>>     __set_task_cpu()
>> cgroup_post_fork()--> ... --> task_move_group_fair()
>>     set_task_rq() to set task group and runqueue
>> wake_up_new_task()
>>     select_task_rq() can select a new cpu
>>     __set_task_cpu
>>     post_init_entity_util_avg
>>         attach_task_cfs_rq()
>>     activate_task
>>         enqueue_task
>>
>> The patch below on top of your patch, ensures that we follow the right sequence :
>>
>> ---
>>  kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 40 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 7895689a..a21e3dc 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>>        * Silence PROVE_RCU.
>>        */
>>       raw_spin_lock_irqsave(&p->pi_lock, flags);
>> -     set_task_cpu(p, cpu);
>> +     __set_task_cpu(p, cpu);
>
> Indeed, this should not be calling migrate, we're setting the CPU for
> the first time.
>
>>       if (p->sched_class->task_fork)
>>               p->sched_class->task_fork(p);
>>       raw_spin_unlock_irqrestore(&p->pi_lock, flags);
>> @@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p)
>>        *  - cpus_allowed can change in the fork path
>>        *  - any previously selected cpu might disappear through hotplug
>>        */
>> -     set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>> +     __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>
> Similarly here I suppose, since we're not yet properly attached as such.
>
>>  #endif
>>       /* Post initialize new task's util average when its cfs_rq is set */
>>       post_init_entity_util_avg(&p->se);
>
> You were indeed running an 'old' kernel, as this
> post_init_entity_util_avg() call should be _after_ the __task_rq_lock().
>
>> @@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg)
>>       spin_unlock_irqrestore(&task_group_lock, flags);
>>  }
>>
>> +/* Set task's runqueue and group
>> + *     In case of a move between group, we update src and dst group
>> + *     thanks to sched_class->task_move_group. Otherwise, we just need to set
>> + *     runqueue and group pointers. The task will be attached to the runqueue
>> + *     during its wake up.
>
> Broken comment style.
>
>> + */
>> +static void sched_set_group(struct task_struct *tsk, bool move)
>> +{
>> +     struct task_group *tg;
>> +
>> +     /*
>> +      * All callers are synchronized by task_rq_lock(); we do not use RCU
>> +      * which is pointless here. Thus, we pass "true" to task_css_check()
>> +      * to prevent lockdep warnings.
>> +      */
>> +     tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
>> +                       struct task_group, css);
>> +     tg = autogroup_task_group(tsk, tg);
>> +     tsk->sched_task_group = tg;
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +     if (move && tsk->sched_class->task_move_group)
>> +             tsk->sched_class->task_move_group(tsk);
>> +     else
>> +#endif
>> +             set_task_rq(tsk, task_cpu(tsk));
>> +
>> +}
>> +
>>  /* change task's runqueue when it moves between groups.
>>   *   The caller of this function should have put the task in its new group
>>   *   by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
>> @@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg)
>>   */
>>  void sched_move_task(struct task_struct *tsk)
>>  {
>> -     struct task_group *tg;
>>       int queued, running;
>>       struct rq_flags rf;
>>       struct rq *rq;
>> @@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk)
>>       if (unlikely(running))
>>               put_prev_task(rq, tsk);
>>
>> +     sched_set_group(tsk, true);
>>
>>       if (unlikely(running))
>>               tsk->sched_class->set_curr_task(rq);
>> @@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
>>
>>  static void cpu_cgroup_fork(struct task_struct *task)
>>  {
>> -     sched_move_task(task);
>> +     struct rq_flags rf;
>> +     struct rq *rq;
>> +
>> +     rq = task_rq_lock(task, &rf);
>> +
>> +     sched_set_group(task, false);
>> +
>> +     task_rq_unlock(rq, task, &rf);
>>  }
>
> Hmm, yeah, I think you're right.
>
> Let me fold that.