linux-kernel - Re: [patch 13/15] sched: expire slack quota using generation counters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTimOP_EtyxEZ9pOEpgJPqEEic21eCA@mail.gmail.com>
Date:	Wed, 6 Apr 2011 00:22:29 -0700
From:	Paul Turner <pjt@...gle.com>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	linux-kernel@...r.kernel.org,
	Bharata B Rao <bharata@...ux.vnet.ibm.com>,
	Dhaval Giani <dhaval.giani@...il.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>,
	Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>, Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [patch 13/15] sched: expire slack quota using generation counters

On Tue, Apr 5, 2011 at 6:28 AM, Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>
> Argh, this patch is terrible for the reason that it changes the whole
> accounting just introduced and me having to re-open all the previous
> patches to look up hth stuff worked before.

I didn't think it was too drastic -- the introduction of the
generation is more incremental.  However I agree it does cause
unnecessary churn within the accounting functions across the series.
Given that expiring quota is a requirement, this can be streamlined by
introducing some of these notions earlier in the series as opposed to
bootstrapping them at the end here -- will clean it up.

>
>> @@ -436,8 +438,10 @@ void init_cfs_bandwidth(struct cfs_bandw
>>       raw_spin_lock_init(&cfs_b->lock);
>>       cfs_b->quota = cfs_b->runtime = quota;
>>       cfs_b->period = ns_to_ktime(period);
>> +     cfs_b->quota_generation = 0;
>>       INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>>
>> +
>>       hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>       cfs_b->period_timer.function = sched_cfs_period_timer;
>
> We're in desperate need of more whitespace there? :-)

8< 8<

>
>> @@ -9333,6 +9337,8 @@ static int tg_set_cfs_bandwidth(struct t
>>       raw_spin_lock_irq(&cfs_b->lock);
>>       cfs_b->period = ns_to_ktime(period);
>>       cfs_b->runtime = cfs_b->quota = quota;
>> +
>> +     cfs_bump_quota_generation(cfs_b);
>>       raw_spin_unlock_irq(&cfs_b->lock);
>>
>>       for_each_possible_cpu(i) {
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -1331,11 +1331,25 @@ static void check_cfs_rq_quota(struct cf
>>       resched_task(rq_of(cfs_rq)->curr);
>>  }
>>
>> +static void cfs_bump_quota_generation(struct cfs_bandwidth *cfs_b)
>> +{
>> +     cfs_b->quota_generation++;
>> +     smp_mb();
>> +}
>
> Memory barriers come in pairs and with a comment, you fail on both
> counts.
>

scratches head... erm right, I meant to pair this with a read barrier
on querying the generation but balked when I realized that still
yields an lfence on x86 since I didn't want to introduce that to the
update_curr() path.

While we can probably do away with the barrier completely (it's not
critical that we line them up perfectly with the new generation), I've
been thinking about this one and I think I have something a little
nicer that also reduces the shared cache hits.

We can take advantage of the fact that sched_clocks are already
synchronized within 2 jiffies and store the quota's expiration,
instead of a generation, when we refresh.

This effectively yields a fairly simple control flow (we can use
rq->clock since we're always paired with update_rq_clock operations):
a) our rq->clock < expiration always implies quota is valid

Obviously if our cpu clock is ahead of the one that issued the quota,
our quota is still valid since the real deadline is even further
behind
Even if our cpu's clock is behind the max 1.99 jiffies the amount of
time that the stale quota can remain valid is basically already within
our potential margin of error since for a long running process we
check on each tick edge anyway.

 b) our rq->clock > expiration

Again there's two cases, if our cpu clock is behind (or equal) then
the deadline has indeed passed and the quota is expired.  This can be
confirmed by comparing the global deadline with our local one (the
global expiration will have advanced with quota refresh for this to be
true).

We can also catch that our cpu is potentially ahead -- by the fact
that our rq->clock > expiration but that the global expiration has not
yet advanced.  In this case we recognize that our quota is still valid
and extend our local expiration time by either the maximum margin of
error or some fraction there of (say 1 jiffy) which is guaranteed to
push us back in case a) above.  Again this is within our existing
margin of error due to entity_tick() alignment.

This ends up looking a lot simpler and avoids much of the pressure on
the global variable since we need to compare against it in the case
where our clock passes expiration, once a quota period (as the
extension will put us in case a where we know we don't need to
consider it).

This ends up simpler than the generation muck and can be introduced
cleanly earlier in the series, avoiding the churn mentioned above.

Make sense?

>> +
>> +static inline int cfs_rq_quota_current(struct cfs_rq *cfs_rq)
>> +{
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>> +
>> +     return cfs_rq->quota_generation == cfs_b->quota_generation;
>> +}
>> +
>>  static void request_cfs_rq_quota(struct cfs_rq *cfs_rq)
>>  {
>>       struct task_group *tg = cfs_rq->tg;
>>       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>>       u64 amount = 0, min_amount;
>> +     int generation;
>
> Not initialized,
>
>>       min_amount = sched_cfs_bandwidth_slice() + (-cfs_rq->quota_remaining);
>>
>> @@ -1347,10 +1361,18 @@ static void request_cfs_rq_quota(struct
>>               } else {
>>                       amount = min_amount;
>>               }
>> +             generation = cfs_b->quota_generation;
>>               raw_spin_unlock(&cfs_b->lock);
>>       }
>
> and since there's an if there, one can fail it, leaving generation
> uninitialized,
>
>>
>> +     /* a deficit should be carried forwards, surplus should be dropped */
>> +
>> +     if (generation != cfs_rq->quota_generation &&
>> +         cfs_rq->quota_remaining > 0)
>> +             cfs_rq->quota_remaining = 0;
>> +
>>       cfs_rq->quota_remaining += amount;
>> +     cfs_rq->quota_generation = generation;
>>  }
>
> Resulting in uninitialized usage right there.
>

Truth
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/