linux-kernel - Re: [PATCH] sched/fair: initialize throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5762E090.4000706@yandex-team.ru>
Date:	Thu, 16 Jun 2016 20:23:28 +0300
From:	Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:	bsegall@...gle.com
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH] sched/fair: initialize throttle_count for new task-groups
 lazily

On 16.06.2016 20:03, bsegall@...gle.com wrote:
> Konstantin Khlebnikov <khlebnikov@...dex-team.ru> writes:
>
>> Cgroup created inside throttled group must inherit current throttle_count.
>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>> later this leads to null pointer dereference in pick_next_task_fair().
>>
>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>> allows to skip locking all rq at group creation. Lazy approach also allows
>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
>> Cc: Stable <stable@...r.kernel.org> # v3.2+
>> ---
>>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>>   kernel/sched/sched.h |    2 +-
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 218f8e83db73..fe809fe169d2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>   	if (!cfs_bandwidth_used())
>>   		return;
>>
>> +	/* synchronize hierarchical throttle counter */
>> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
>> +		struct rq *rq = rq_of(cfs_rq);
>> +		struct cfs_rq *pcfs_rq;
>> +		struct task_group *tg;
>> +
>> +		cfs_rq->throttle_uptodate = 1;
>> +		/* get closest uptodate node because leaves goes first */
>> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
>> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
>> +			if (pcfs_rq->throttle_uptodate)
>> +				break;
>> +		}
>> +		if (tg) {
>> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
>> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
>> +		}
>> +	}
>> +
>
> Checking just in enqueue is not sufficient - throttled_lb_pair can check
> against a cfs_rq that has never been enqueued (and possibly other
> paths).

Looks like this is minor problem: in worst case load-balancer will migrate
task into throttled hierarchy. And this could happens only once for each
newly created group.

>
> It might also make sense to go ahead and initialize all the cfs_rqs we
> skipped over to avoid some n^2 pathological behavior. You could also use
> throttle_count == -1 or < 0. (We had our own version of this patch that
> I guess we forgot to push?)

n^2 shouldn't be a problem while this happens only once for each group.

throttle_count == -1 could be overwritten when parent throttles/unthrottles
before initialization. We could set it to INT_MIN/2 and check <0 but this
will hide possible bugs here. One more int in the same cacheline shouldn't
add noticeable overhead.

I've also added this into our kernel to catch such problems without crash.
Probably it's worth to add into upstream because stale buddy is a real pain)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->last = se;
+       }
  }

  static void set_next_buddy(struct sched_entity *se)
@@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->next = se;
+       }
  }

  static void set_skip_buddy(struct sched_entity *se)


>
>
>>   	/* an active group must be handled by the update_curr()->put() path */
>>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>>   		return;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 72f1f3087b04..7cbeb92a1cb9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -437,7 +437,7 @@ struct cfs_rq {
>>
>>   	u64 throttled_clock, throttled_clock_task;
>>   	u64 throttled_clock_task_time;
>> -	int throttled, throttle_count;
>> +	int throttled, throttle_count, throttle_uptodate;
>>   	struct list_head throttled_list;
>>   #endif /* CONFIG_CFS_BANDWIDTH */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */


-- 
Konstantin