linux-kernel - Re: Sum of weights idea for CFS PI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb28d85a-c50f-a25f-aeb4-672eecb75b55@joelfernandes.org>
Date:   Tue, 4 Oct 2022 16:27:57 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Qais Yousef <qais.yousef@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>, juri.lelli@...hat.com,
        vincent.guittot@...aro.org,
        Youssef Esmat <youssefesmat@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Thomas Gleixner <tglx@...utronix.de>, bristot@...hat.com,
        clark.williams@...il.com, bigeasy@...utronix.de,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: Sum of weights idea for CFS PI



On 10/3/2022 12:14 PM, Qais Yousef wrote:
> On 09/30/22 13:34, Joel Fernandes wrote:
>>
>>
>> On 9/30/2022 9:49 AM, Qais Yousef wrote:
>>> Hi Joel
>>>
>>> I'm interested in the topic, if I can be CCed in any future discussions I'd
>>> appreciate it :)
>>
>> Yes, surely! Will do :)
>>
>>> On 09/29/22 16:38, Joel Fernandes wrote:
>>>> Hi Peter, all,
>>>>
>>>> Just following-up about the idea Peter suggested at LPC22 about sum of weights
>>>> to solve the CFS priority inversion issues using priority inheritance. I am not
>>>> sure if a straight forward summation of the weights of dependencies in the
>>>> chain, is sufficient (or may cause too much unfairness).
>>>>
>>>> I think it will work if all the tasks on CPU are 100% in utilization:
>>>>
>>>> Say if you have 4 tasks (A, B, C, D) running and each one has equal
>>>> weight (W) except for A which has twice the weight (2W).
>>>> So the CPU bandwidth distribution is (assuming all are running):
>>>> A:   2 / 5
>>>> B, C. D:  1 / 5
>>>>
>>>> Say out of the 4 tasks, 3 of them are a part of a classical priority
>>>> inversion scenario (A, B and C).
>>>>
>>>> Say now A blocks on a lock and that lock's owner C is running, however now
>>>> because A has blocked, B gets 1/3 bandwidth, where as it should have been
>>>> limited to 1/5. To remedy this, say you give C a weight of 2W. B gets 1/4
>>>> bandwidth - still not fair since B is eating away CPU bandwidth causing the
>>>> priority inversion we want to remedy.
>>>>
>>>> The correct bandwidth distribution should be (B and D should be unchanged):
>>>> B = 1/5
>>>> D = 1/5
>>>>
>>>> C = 3/5
>>>>
>>>> This means that C's weight should be 3W , and B and D should be W each
>>>> as before. So indeed, C's new weight is its original weight PLUS the
>>>> weight of the A - that's needed to keep the CPU usage of the other
>>>> tasks (B, D) in check so that C makes forward progress on behalf of A and the
>>>> other tasks don't eat into the CPU utilization.
>>>>
>>>> However, I think this will kinda fall apart if A is asleep 50% of the time
>>>> (assume the sleep is because of I/O and unrelated to the PI chain).
>>>>
>>>> Because now if all were running (and assume no PI dependencies), with A being
>>>> 50%, the bandwidth of B, C and D each would be divided into 2 components:
>>>>
>>>> a.  when A is running, it would be as above.
>>>> b.  but if A was sleeping, B, C, and D would get 1/3.
>>>>
>>>> So on average, B, C and D get:  (1/3 + 1/5) / 2 = 8/30. This gives A about 6/30
>>>> or 1/5 bandwidth.
>>>
>>> The average metric is interesting one. It can be confusing to reason about too.
>>>
>>> I think we have 3 events to take into account here, not 2:
>>>
>>> a. when A is running and NOT blocked on C.
>>> b. when A is running and BLOCKED on C.
>>> c. A is sleeping.
>>>
>>> This means A, B, C and D's shares will be:
>>>
>>>     A ,  B ,  C ,  D
>>> a. 2/5, 1/5, 1/5, 1/5
>>> b. -  , 3/5, 1/5, 1/5
>>
>> Here did you mean:
>> b.  -, 1/5, 3/5, 1/5 ?
> 
> Yes! Sorry a typo.
> 
>>
>> A blocked on C means, C should get A's weight (on top of its own).
>>
>>> c. -  , 1/3, 1/3, 1/3
>>>
>>> Since A is sleeping for 50%, I don't think we can assume equal distribution for
>>> the 3 events (can't just divide by 3).
>>
>> Oh yeah, I did not get to _that_ part of the story yet at this point of the
>> email, I brought it up later below where I say "But now say A happen to block"..
> 
> Hmm, but that's my point, I think you need to take _that_ part here or we're
> not be doing an apple to apple comparison.
> 
>>
>>>
>>> I believe we can assume that
>>>
>>> a. occurs 25% of the time
>>> b. occurs 25% of the time
>>> c. occurs 50% of the time
>>>
>>> I *think* this should provide something more representative.
>>
>> Yes possible. My basic idea was I was trying to *not* change the distribution of
>> B based on whether A blocks on C, or whether it does not. In my view, B's
>> distribution should not change just because A and C have a lock dependency,
>> because otherwise C can get too much or too little time. If C gets too much
>> time, then its hurting B. If C gets too little time, then its hurting A.
> 
> Maybe I am getting confused. But AFAICT you're treating
> 
> 
> 	a.  when A is running, it would be as above.
> 	b.  but if A was sleeping, B, C, and D would get 1/3.
> 
> similar to
> 
> 	a.  when A is running *and blocked on C for all its runtime*
> 	b.  but if A was sleeping, B, C, and D would get 1/3.


I am treating the following the same:

 	a.  when A is running, it would be as above.
 	b.  but if A was sleeping, B, C, and D would get 1/3.

 similar to

 	a.  when A is running *and blocked on C for all its runtime*
		^^ -- in this case, B and D should not have their distributions
		      changed at all because they are not participating in the
		      lock acquire and release. So they should neither be hurt
		      any more, nor be boosted. They should simply stay same [1]

 	b.  but if A was sleeping, B, C, and D would get 1/3.


[1] Why? Consider 3 tasks in the all-RT case, A high, B medium and C low prio.

If all are running 100% and A does not block on C, B is blocked by A
indefinitely. So the prio of A and B are inverted. We seek to rectify this, that
is we need make changes such that, B is returned back to the blocked state. We
do this by boosting C.

In other words, the prio inheritance will cause B's distribution to not be
changed (it was supposed to be blocked before and it is now going to be blocked
state again).

CFS should not behave any differently, B's distribution should not be changed
before/after the priority inhertiance of A by C. That's just my opinion - and
that's how I calculated to distribution. With that mind, could you go back to
seeing if my math was originally correct or did I mess something up?

I do think though that Youssef's point of not worrying about being too accurate
is reasonable if the critical sections are short lived but I'm not sure.

> 
> I don't think this is valid. If A is blocked on C for 50% of the time, and
> sleeping for 50% of the time, when did it get blocked/unblocked?
> 
> This will have an impact on the average share for C and skew it, no?
> 
> Unless I missed something, the average share of C being (3/5 + 1/3) is an
> impossible state. You need to consider the portion of time when C runs as 1/5,
> when A is actually not blocked on anything, too.
> 
> Hmm actually I just re-read your statement below and you just say 3/5 (18/30)
> is too much. You didn't consider the average. I'll leave the above in hope to
> help me understand what am I missing and where I went wrong :-)
> 
> Generally IMHO looking at the average will not help. I think if the share
> values make sense in each state individually (and I believe they are), that
> would be enough. AFAICS, B and D are still taking the right amount of time when
> C inherits the bandwidth. And C by definition will run longer when A is blocked
> on it for the whole duration of this blocked time.

I was degenerating the case where A sleeps (say I/O) vs A blocks, to simplify
the math, and then taking average of that. I think that's reasonable?

> 
>>
>>>> But now say A happen to block on a lock that C is holding. You would boost C to
>>>> weight 3W which gives it 3/5 (or 18/30) as we saw above, which is more than what
>>>> C should actually get.
>>>>
>>>> C should get (8/30 + 6/30 = 14/30) AFAICS.
>>>>
>>>> Hopefully one can see that a straight summation of weights is not enough. It
>>>> needs to be something like:
>>>>
>>>> C's new weight = C's original weight + (A's weight) * (A's utilization)
>>>>
>>>> Or something, otherwise the inherited weight may be too much to properly solve it.
>>>>
>>>> Any thoughts on this? You mentioned you had some notes on this and/or proxy
>>>> execution, could you share it?
>>>
>>> I assume we'll be using rt-mutex inheritance property to handle this? If this
>>> was discussed during a talk, I'd appreciate a link to that.
>>
>> Yes that's the idea but I am also aware that 'other' kinds of dependencies in
>> userspace that cannot benefit from a kernel-only boost.
>>
>> So if we consider a bounded-buffer producer/consumer. We can consider the
>> producer as A, and the consumer as C, with B being a random mid-priority task.
>> Once the bounded buffer gets full, A is waiting on a signal from C that it
>> filled the buffer for which C needs to run in the first place to queue its
>> payload into the buffer. However, trouble-maker B is determined to eat away's
>> C's time and develop a prio inversion between itself and A. This pattern should
>> also generalize to a worker pool consuming work.
> 
> Will proxy-execution be able to handle these other cases? I think it is tied to
> rt-mutex to be able to detect the dependency chain too. Looks a generic
> extension of the problem that all potential solutions will need to deal with.

Yeah, in theory if the producer and consumer can "mark" themselves as
potentially have someone waiting for them, then the kernel has that info and any
implementation may capitalize on that. I am not sure if this is already possible
by some command in the futex API.

>>> In the past in OSPM conference we brought up an issue with performance
>>> inversion where a task running on a smaller (slower to be more generic) CPU is
>>> holding the lock and causing massive delays for waiters. This is an artefact of
>>> DVFS. For HMP, there's an additional cause due to the unequal capacities of the
>>> CPUs.
>>>
>>> Proxy execution seems to be the nice solution to all of these problems, but
>>> it's a long way away. I'm interested to learn how this inheritance will be
>>> implemented. And whether there are any userspace conversion issues. i.e: do
>>> we need to convert all locks to rt-mutex locks?
>>
>> I am not an expert on FUTEX_LOCK_PI and this could be a good time for tglx to
>> weigh in, but I think converting all userspace locks to use FUTEX_LOCK_PI sounds
>> reasonable to me.
> 
> I realized whether we need to worry about in-kernel kthreads users too..

Possibly.

> pthreads requires users to specify if they want PTHREAD_PRIO_{INHERIT, PROTECT}
> when initializing their mutex.

PTHREAD_PRIO_PROTECT seems like an interesting approach! Thanks for mentioning.
> AFAICS from glibc, PTHREAD_PRIO_INHERIT maps to FUTEX_LOCK_PI.

Ok.
> But PTHREAD_PRIO_PROTOTECT uses a different algorithm that is implemented in
> glibc. This makes me think how much this makes sense in linux as for CFS
> priorities are transalted into weights and not actual priorites like in RT :-/
> 
> I need to dig more..

Ah same here, more reading/digging to do ;).


- Joel