[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <00140e95-0fe2-1ce4-1433-a3211f9da20c@joelfernandes.org>
Date: Fri, 30 Sep 2022 13:34:49 -0400
From: Joel Fernandes <joel@...lfernandes.org>
To: Qais Yousef <qais.yousef@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>,
Steven Rostedt <rostedt@...dmis.org>, juri.lelli@...hat.com,
vincent.guittot@...aro.org,
Youssef Esmat <youssefesmat@...gle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Thomas Gleixner <tglx@...utronix.de>, bristot@...hat.com,
clark.williams@...il.com, bigeasy@...utronix.de,
"Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: Sum of weights idea for CFS PI
On 9/30/2022 9:49 AM, Qais Yousef wrote:
> Hi Joel
>
> I'm interested in the topic, if I can be CCed in any future discussions I'd
> appreciate it :)
Yes, surely! Will do :)
> On 09/29/22 16:38, Joel Fernandes wrote:
>> Hi Peter, all,
>>
>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
>> to solve the CFS priority inversion issues using priority inheritance. I am not
>> sure if a straight forward summation of the weights of dependencies in the
>> chain, is sufficient (or may cause too much unfairness).
>>
>> I think it will work if all the tasks on CPU are 100% in utilization:
>>
>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
>> weight (W) except for A which has twice the weight (2W).
>> So the CPU bandwidth distribution is (assuming all are running):
>> A: 2 / 5
>> B, C. D: 1 / 5
>>
>> Say out of the 4 tasks, 3 of them are a part of a classical priority
>> inversion scenario (A, B and C).
>>
>> Say now A blocks on a lock and that lock's owner C is running, however now
>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
>> priority inversion we want to remedy.
>>
>> The correct bandwidth distribution should be (B and D should be unchanged):
>> B = 1/5
>> D = 1/5
>>
>> C = 3/5
>>
>> This means that C's weight should be 3W , and B and D should be W each
>> as before. So indeed, C's new weight is its original weight PLUS the
>> weight of the A - that's needed to keep the CPU usage of the other
>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
>> other tasks don't eat into the CPU utilization.
>>
>> However, I think this will kinda fall apart if A is asleep 50% of the time
>> (assume the sleep is because of I/O and unrelated to the PI chain).
>>
>> Because now if all were running (and assume no PI dependencies), with A being
>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>>
>> a. when A is running, it would be as above.
>> b. but if A was sleeping, B, C, and D would get 1/3.
>>
>> So on average, B, C and D get: (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
>> or 1/5 bandwidth.
>
> The average metric is interesting one. It can be confusing to reason about too.
>
> I think we have 3 events to take into account here, not 2:
>
> a. when A is running and NOT blocked on C.
> b. when A is running and BLOCKED on C.
> c. A is sleeping.
>
> This means A, B, C and D's shares will be:
>
> A , B , C , D
> a. 2/5, 1/5, 1/5, 1/5
> b. - , 3/5, 1/5, 1/5
Here did you mean:
b. -, 1/5, 3/5, 1/5 ?
A blocked on C means, C should get A's weight (on top of its own).
> c. - , 1/3, 1/3, 1/3
>
> Since A is sleeping for 50%, I don't think we can assume equal distribution for
> the 3 events (can't just divide by 3).
Oh yeah, I did not get to _that_ part of the story yet at this point of the
email, I brought it up later below where I say "But now say A happen to block"..
>
> I believe we can assume that
>
> a. occurs 25% of the time
> b. occurs 25% of the time
> c. occurs 50% of the time
>
> I *think* this should provide something more representative.
Yes possible. My basic idea was I was trying to *not* change the distribution of
B based on whether A blocks on C, or whether it does not. In my view, B's
distribution should not change just because A and C have a lock dependency,
because otherwise C can get too much or too little time. If C gets too much
time, then its hurting B. If C gets too little time, then its hurting A.
>> But now say A happen to block on a lock that C is holding. You would boost C to
>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
>> C should actually get.
>>
>> C should get (8/30 + 6/30 = 14/30) AFAICS.
>>
>> Hopefully one can see that a straight summation of weights is not enough. It
>> needs to be something like:
>>
>> C's new weight = C's original weight + (A's weight) * (A's utilization)
>>
>> Or something, otherwise the inherited weight may be too much to properly solve it.
>>
>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
>> execution, could you share it?
>
> I assume we'll be using rt-mutex inheritance property to handle this? If this
> was discussed during a talk, I'd appreciate a link to that.
Yes that's the idea but I am also aware that 'other' kinds of dependencies in
userspace that cannot benefit from a kernel-only boost.
So if we consider a bounded-buffer producer/consumer. We can consider the
producer as A, and the consumer as C, with B being a random mid-priority task.
Once the bounded buffer gets full, A is waiting on a signal from C that it
filled the buffer for which C needs to run in the first place to queue its
payload into the buffer. However, trouble-maker B is determined to eat away's
C's time and develop a prio inversion between itself and A. This pattern should
also generalize to a worker pool consuming work.
In this case, there is no lock involved yet you have a dependency. But I don't
mean to sound depressing, and just because there are cases like this does not
mean we should not solve the lock-based ones. When I looked at Android, I saw
that it uses futex directly from Android Runtime code instead of using pthread.
So perhaps this can be trivially converted to FUTEX_LOCK_PI and then what we do
in the kernel will JustWork(Tm) ?
>
> In the past in OSPM conference we brought up an issue with performance
> inversion where a task running on a smaller (slower to be more generic) CPU is
> holding the lock and causing massive delays for waiters. This is an artefact of
> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
> CPUs.
>
> Proxy execution seems to be the nice solution to all of these problems, but
> it's a long way away. I'm interested to learn how this inheritance will be
> implemented. And whether there are any userspace conversion issues. i.e: do
> we need to convert all locks to rt-mutex locks?
I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
reasonable to me.
Other thoughts?
thanks,
- Joel
Powered by blists - more mailing lists