linux-kernel - Re: [PATCH 2/2 v3] sched: Rewrite per entity runnable load average tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140716154614.GP26542@e103034-lin>
Date:	Wed, 16 Jul 2014 16:46:14 +0100
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	Yuyang Du <yuyang.du@...el.com>
Cc:	"mingo@...hat.com" <mingo@...hat.com>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"pjt@...gle.com" <pjt@...gle.com>,
	"bsegall@...gle.com" <bsegall@...gle.com>,
	"arjan.van.de.ven@...el.com" <arjan.van.de.ven@...el.com>,
	"len.brown@...el.com" <len.brown@...el.com>,
	"rafael.j.wysocki@...el.com" <rafael.j.wysocki@...el.com>,
	"alan.cox@...el.com" <alan.cox@...el.com>,
	"mark.gross@...el.com" <mark.gross@...el.com>,
	"fengguang.wu@...el.com" <fengguang.wu@...el.com>,
	"umgwanakikbuti@...il.com" <umgwanakikbuti@...il.com>
Subject: Re: [PATCH 2/2 v3] sched: Rewrite per entity runnable load average
 tracking

On Wed, Jul 16, 2014 at 02:50:47AM +0100, Yuyang Du wrote:

[...]

> +/*
> + * Update load_avg of the cfs_rq along with its own se. They should get
> + * synchronized: group se's load_avg is used for task_h_load calc, and
> + * group cfs_rq's load_avg is used for task_h_load (and update_cfs_share
> + * calc).
> + */
> +static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       long old_contrib = se->avg.load_avg_contrib;
> +       int decayed;
> 
> -       if (entity_is_task(se)) {
> -               __update_task_entity_contrib(se);
> -       } else {
> -               __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
> -               __update_group_entity_contrib(se);
> +       if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> +               long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> +               cfs_rq->avg.load_avg = subtract_until_zero(cfs_rq->avg.load_avg, r);
> +               r *= LOAD_AVG_MAX;
> +               cfs_rq->avg.load_sum = subtract_until_zero(cfs_rq->avg.load_sum, r);
>         }
> 
> -       return se->avg.load_avg_contrib - old_contrib;
> -}
> +       decayed = __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
> +#ifndef CONFIG_64BIT
> +       if (cfs_rq->avg.last_update_time != cfs_rq->load_last_update_time_copy)
> +               sa_q->last_update_time_copy = sa_q->last_update_time;

This doesn't build on 32 bit. You need:

-               sa_q->last_update_time_copy = sa_q->last_update_time;
+               cfs_rq->load_last_update_time_copy = cfs_rq->avg.last_update_time;

to make it build. But I'm not convinced that this synchronization is
right.

First let me say that I'm not an expert on synchronization. It seems to
me that there is nothing preventing reordering of the writes in
__update_load_avg() which sets cfs_rq->avg.last_update_time and the
update of cfs_rq->avg.load_last_update_time_copy.

> +#endif
> 
> -static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> -                                                long load_contrib)
> -{
> -       if (likely(load_contrib < cfs_rq->blocked_load_avg))
> -               cfs_rq->blocked_load_avg -= load_contrib;
> -       else
> -               cfs_rq->blocked_load_avg = 0;
> +       return decayed;
>  }

[...]

> @@ -4551,18 +4381,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>  {
>         struct sched_entity *se = &p->se;
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       u64 last_update_time;
> 
>         /*
> -        * Load tracking: accumulate removed load so that it can be processed
> -        * when we next update owning cfs_rq under rq->lock.  Tasks contribute
> -        * to blocked load iff they have a positive decay-count.  It can never
> -        * be negative here since on-rq tasks have decay-count == 0.
> +        * Task on old CPU catches up with its old cfs_rq, and subtract itself from
> +        * the cfs_rq (task must be off the queue now).
>          */
> -       if (se->avg.decay_count) {
> -               se->avg.decay_count = -__synchronize_entity_decay(se);
> -               atomic_long_add(se->avg.load_avg_contrib,
> -                                               &cfs_rq->removed_load);
> -       }
> +#ifndef CONFIG_64BIT
> +       u64 last_update_time_copy;
> +
> +       do {
> +               last_update_time_copy = cfs_rq->load_last_update_time_copy;
> +               smp_rmb();
> +               last_update_time = cfs_rq->avg.last_update_time;
> +       } while (last_update_time != last_update_time_copy);

Here is the matching read side of the synchronization code. Since
reodering might happen on the write side (I believe), I'm not convinced
that this works.

Morten

> +#else
> +       last_update_time = cfs_rq->avg.last_update_time;
> +#endif
> +       __update_load_avg(last_update_time, &se->avg, 0);
> +       atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> +
> +       /*
> +        * We are supposed to update the task to "current" time, then its up to date
> +        * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
> +        * what current time is, so simply throw away the out-of-date time. This
> +        * will result in the wakee task is less decayed, but giving the wakee more
> +        * load sounds not bad.
> +        */
> +       se->avg.last_update_time = 0;
> 
>         /* We have migrated, no longer consider this task hot */
>         se->exec_start = 0;
> @@ -5399,36 +5245,6 @@ next:
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/