linux-kernel - Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 22 Aug 2008 12:08:00 -0400
From:	Gregory Haskins <ghaskins@...ell.com>
To:	Esben Nielsen <nielsen.esben@...glemail.com>
CC:	mingo@...e.hu, paulmck@...ux.vnet.ibm.com, peterz@...radead.org,
	tglx@...utronix.de, rostedt@...dmis.org,
	linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
	gregory.haskins@...il.com, David.Holmes@....com, jkacur@...il.com
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Gregory Haskins wrote:
> Hi Esben,
>   Thank you for the review.  Comments inline.
>
> Esben Nielsen wrote:
>   
>> Disclaimer: I am no longer actively involved and I must admit I might
>> have lost out on much of
>> what have been going on since I contributed to the PI system 2 years
>> ago. But I allow myself to comment
>> anyway.
>>
>> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@...ell.com> wrote:
>>   
>>     
>>> The kernel currently addresses priority-inversion through priority-
>>> inheritence.  However, all of the priority-inheritence logic is
>>> integrated into the Real-Time Mutex infrastructure.  This causes a few
>>> problems:
>>>
>>>  1) This tightly coupled relationship makes it difficult to extend to
>>>    other areas of the kernel (for instance, pi-aware wait-queues may
>>>    be desirable).
>>>  2) Enhancing the rtmutex infrastructure becomes challenging because
>>>    there is no seperation between the locking code, and the pi-code.
>>>
>>> This patch aims to rectify these shortcomings by designing a stand-alone
>>> pi framework which can then be used to replace the rtmutex-specific
>>> version.  The goal of this framework is to provide similar functionality
>>> to the existing subsystem, but with sole focus on PI and the
>>> relationships between objects that can boost priority, and the objects
>>> that get boosted.
>>>     
>>>       
>> This is really a good idea. When I had time (2 years ago) to actively
>> work on these problem
>> I also came to the conclusion that PI should be more general than just
>> the rtmutex. Preemptive RCU
>> was the example which drove it.
>>
>> But I do disagree that general objects should get boosted: The end
>> targets are always tasks. The objects might
>> be boosted as intermediate steps, but priority end the only applies to tasks.
>>   
>>     
> Actually I fully agree with you here.  Its probably just poor wording on
> my part, but this is exactly what happens.  We may "boost" arbitrary
> objects on the way to boosting a task...but the intermediate objects are
> just there to help find our way to the proper tasks.  Ultimately
> everything ends up at the scheduler eventually ;)
>
>   
>> I also have a few comments to the actual design:
>>
>>   
>>     
>>> ....
>>> +
>>> +Multiple sinks per Node:
>>> +
>>> +We allow multiple sinks to be associated with a node.  This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on").  The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>>> +
>>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se.  Rather, they act as endpoints to a priority boosting.  Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information.  However, there may be any number of leaf-sinks along a chain as well.  Each one will act on its localized priority in its own implementation specific way.  For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary.  Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>>     
>>>       
>> This is bad from a RT point of view: You have a hard time determininig
>> the number of sinks per node. An rw-lock could have an arbitrary
>> number of readers (is supposed to really). Therefore
>> you have no chance of knowing how long the boost/deboost operation
>> will take. And you also know for how long the boosted tasks stay
>> boosted. If there can be an arbitrary number of
>> such tasks you can no longer be deterministic.
>>   
>>     
>
> While you may have a valid concern about what rwlocks can do to
> determinism, not that we already have PI enabled rwlocks before my
> patch, so I am not increasing nor decreasing determinism in this
> regard.  That being said, Steven Rostedt (author of the pi-rwlocks,
> CC'd) has facilities to manage this (such as limiting the number of
> readers to num_online_cpus) which this design would retain.  Long story
> short, I do not believe I have made anything worse here, so this is a
> different discussion if you are still concerned.
>
>
>   
>>   
>>     
>>> ...
>>> +
>>> +#define MAX_PI_DEPENDENCIES 5
>>>     
>>>       
>> WHAT??? There is a finite lock depth defined. I know we did that
>> originally but it wasn't hardcoded (as far as I remember) and
>> it was certainly not as low as 5.
>>   
>>     
>
> Note that this is simply in reference to how many direct sinks you can
> link to a node, not how long the resulting chain can grow.  The chain
> depth is actually completely unconstrained by the design.  I chose "5"
> here because typically we need 1 sink for the next link in the chain,
> and 1 sink for local notifications.  The other 3 are there for head-room
> (we often hit 3-4 as we transition between nodes (add one node -> delete
> another, etc).
>   

To clarify what I meant here:  If you think of a normal linked-list node
having a single "next" pointer, this implementation is like each node
having up to 5 "next" pointers.   However typically only 1-2 are used,
and all but one will usually point to a "leaf" node, meaning it does not
form a chain but terminates processing locally.  Typically there will be
only one link to something that forms a chain with other nodes.  I did
this because I realized the pattern (boost/deboost/update) was similar
whether the node was a leaf or a chain-link, so I unified both behind
the single pi_sink interface.

That being understood, note that as with any linked-list, the nodes can
still have an arbitrary chaining depth (and I will fix this to be
iterative instead of recursive, as previously mentioned).

> You are not the first to comment about this, however, so it makes me
> realize it is not very clear ;)  I will comment the code better.
>
>
>   
>> Remember: PI is used by the user space futeces as well!
>>   
>>     
>
> Yes, and on a slight tangent from your point, this incidentally is
> actually a problem in the design such that I need to respin at least a
> v5.  My current design uses recursion against the sink->update()
> methods, which Peter Zijlstra pointed out would blow up with large
> userpspace chains.  My next version will forgo the recursion in favor of
> an iterative method more reminiscent of the original design.
>
>   
>>   
>>     
>>> ....
>>> +/*
>>> + * _pi_node_update - update the chain
>>> + *
>>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>>> + * that need to propagate up the chain.  This is a step-wise process where we
>>> + * have to be careful about locking and preemption.  By trying MAX_PI_DEPs
>>> + * times, we guarantee that this update routine is an effective barrier...
>>> + * all modifications made prior to the call to this barrier will have completed.
>>> + *
>>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>>> + * form a graph of arbitrary structure.  While the graph should technically
>>> + * never close on itself barring any bugs, we still want to protect against
>>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>>> + * from detecting this potential).  To do this, we employ a dual-locking
>>> + * scheme where we can carefully control the order.  That is: node->lock
>>> + * protects most of the node's internal state, but it will never be held
>>> + * across a chain update.  sinkref->lock, on the other hand, can be held
>>> + * across a boost/deboost, and also guarantees proper execution order. Also
>>> + * note that no locks are held across an sink->update.
>>> + */
>>> +static int
>>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>>> +{
>>> +       struct pi_node    *node = node_of(sink);
>>> +       struct pi_sinkref *sinkref;
>>> +       unsigned long      iflags;
>>> +       int                count = 0;
>>> +       int                i;
>>> +       int                pprio;
>>> +       struct updater     updaters[MAX_PI_DEPENDENCIES];
>>> +
>>> +       spin_lock_irqsave(&node->lock, iflags);
>>> +
>>> +       pprio = node->prio;
>>> +
>>> +       if (!plist_head_empty(&node->srcs))
>>> +               node->prio = plist_first(&node->srcs)->prio;
>>> +       else
>>> +               node->prio = MAX_PRIO;
>>> +
>>> +       list_for_each_entry(sinkref, &node->sinks, list) {
>>> +               /*
>>> +                * If the priority is changing, or if this is a
>>> +                * BOOST/DEBOOST, we consider this sink "stale"
>>> +                */
>>> +               if (pprio != node->prio
>>> +                   || sinkref->state != pi_state_boosted) {
>>> +                       struct updater *iter = &updaters[count++];
>>>     
>>>       
>> What prevents count from overrun?
>>   
>>     
>
> The node->sinks list will never have more than MAX_PI_DEPs in it, by design.
>
>   
>>   
>>     
>>> +
>>> +                       BUG_ON(!atomic_read(&sinkref->sink->refs));
>>> +                       _pi_sink_get(sinkref);
>>> +
>>> +                       iter->update  = 1;
>>> +                       iter->sinkref = sinkref;
>>> +                       iter->sink     = sinkref->sink;
>>> +               }
>>> +       }
>>> +
>>> +       spin_unlock(&node->lock);
>>> +
>>> +       for (i = 0; i < count; ++i) {
>>> +               struct updater    *iter = &updaters[i];
>>> +               unsigned int       lflags = PI_FLAG_DEFER_UPDATE;
>>> +               struct pi_sink    *sink;
>>> +
>>> +               sinkref = iter->sinkref;
>>> +               sink = iter->sink;
>>> +
>>> +               spin_lock(&sinkref->lock);
>>> +
>>> +               switch (sinkref->state) {
>>> +               case pi_state_boost:
>>> +                       sinkref->state = pi_state_boosted;
>>> +                       /* Fall through */
>>> +               case pi_state_boosted:
>>> +                       sink->ops->boost(sink, &sinkref->src, lflags);
>>> +                       break;
>>> +               case pi_state_deboost:
>>> +                       sink->ops->deboost(sink, &sinkref->src, lflags);
>>> +                       sinkref->state = pi_state_free;
>>> +
>>> +                       /*
>>> +                        * drop the ref that we took when the sinkref
>>> +                        * was allocated.  We still hold a ref from
>>> +                        * above.
>>> +                        */
>>> +                       _pi_sink_put_all(node, sinkref);
>>> +                       break;
>>> +               case pi_state_free:
>>> +                       iter->update = 0;
>>> +                       break;
>>> +               default:
>>> +                       panic("illegal sinkref type: %d", sinkref->state);
>>> +               }
>>> +
>>> +               spin_unlock(&sinkref->lock);
>>> +
>>> +               /*
>>> +                * We will drop the sinkref reference while still holding the
>>> +                * preempt/irqs off so that the memory is returned synchronously
>>> +                * to the system.
>>> +                */
>>> +               _pi_sink_put_local(node, sinkref);
>>> +       }
>>> +
>>> +       local_irq_restore(iflags);
>>>     
>>>       
>> Yack! You keep interrupts off while doing the chain.
>>     
>
> Actually, not quite.  The first pass (with interrupts off) simply sets
> the new priority value at each local element (limited to 5, typically
> 1-2).  Short and sweet.  Its the "update" that happens next (with
> interrupts/preemption enabled) that updates the chain.
>
>
>   
>>  I think my main
>> contribution to the PI system 2 years ago was to do this preemptively.
>> I.e. there was points in the loop where interrupts and preemption
>> where turned on.
>>   
>>     
>
> I agree this is important, but I think you will see with further review
> that this is in fact what I do too.
>
>   
>> Remember: It goes into user space again. An evil user could craft an
>> application with a very long lock depth and keep higher priority real
>> time tasks from running for an arbitrary long time (if
>> no limit on the lock depth is set, which is bad because it will be too
>> low in some cases.)
>>
>> But as I said I have had no time to watch what has actually been going
>> on in the kernel for the last 2 years roughly. The said defects might
>> have creeped in by other contributers already :-(
>>
>> Esben
>>   
>>     
>
> Esben,
>   Your review and insight are very much appreciated.  I will be sure to
> address the concerns mentioned above and CC you on the next release.
>
> Thanks again,
> -Greg
>
>
>   



Download attachment "signature.asc" of type "application/pgp-signature" (258 bytes)