[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <48AEE460.6030802@novell.com>
Date: Fri, 22 Aug 2008 12:08:00 -0400
From: Gregory Haskins <ghaskins@...ell.com>
To: Esben Nielsen <nielsen.esben@...glemail.com>
CC: mingo@...e.hu, paulmck@...ux.vnet.ibm.com, peterz@...radead.org,
tglx@...utronix.de, rostedt@...dmis.org,
linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
gregory.haskins@...il.com, David.Holmes@....com, jkacur@...il.com
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface
Gregory Haskins wrote:
> Hi Esben,
> Thank you for the review. Comments inline.
>
> Esben Nielsen wrote:
>
>> Disclaimer: I am no longer actively involved and I must admit I might
>> have lost out on much of
>> what have been going on since I contributed to the PI system 2 years
>> ago. But I allow myself to comment
>> anyway.
>>
>> On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@...ell.com> wrote:
>>
>>
>>> The kernel currently addresses priority-inversion through priority-
>>> inheritence. However, all of the priority-inheritence logic is
>>> integrated into the Real-Time Mutex infrastructure. This causes a few
>>> problems:
>>>
>>> 1) This tightly coupled relationship makes it difficult to extend to
>>> other areas of the kernel (for instance, pi-aware wait-queues may
>>> be desirable).
>>> 2) Enhancing the rtmutex infrastructure becomes challenging because
>>> there is no seperation between the locking code, and the pi-code.
>>>
>>> This patch aims to rectify these shortcomings by designing a stand-alone
>>> pi framework which can then be used to replace the rtmutex-specific
>>> version. The goal of this framework is to provide similar functionality
>>> to the existing subsystem, but with sole focus on PI and the
>>> relationships between objects that can boost priority, and the objects
>>> that get boosted.
>>>
>>>
>> This is really a good idea. When I had time (2 years ago) to actively
>> work on these problem
>> I also came to the conclusion that PI should be more general than just
>> the rtmutex. Preemptive RCU
>> was the example which drove it.
>>
>> But I do disagree that general objects should get boosted: The end
>> targets are always tasks. The objects might
>> be boosted as intermediate steps, but priority end the only applies to tasks.
>>
>>
> Actually I fully agree with you here. Its probably just poor wording on
> my part, but this is exactly what happens. We may "boost" arbitrary
> objects on the way to boosting a task...but the intermediate objects are
> just there to help find our way to the proper tasks. Ultimately
> everything ends up at the scheduler eventually ;)
>
>
>> I also have a few comments to the actual design:
>>
>>
>>
>>> ....
>>> +
>>> +Multiple sinks per Node:
>>> +
>>> +We allow multiple sinks to be associated with a node. This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on"). The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
>>> +
>>> +Leaf-sinks are boostable objects that do not perpetuate a chain per se. Rather, they act as endpoints to a priority boosting. Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information. However, there may be any number of leaf-sinks along a chain as well. Each one will act on its localized priority in its own implementation specific way. For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary. Whereas an rwlock leaf-sink may boost a list of reader-owners.
>>>
>>>
>> This is bad from a RT point of view: You have a hard time determininig
>> the number of sinks per node. An rw-lock could have an arbitrary
>> number of readers (is supposed to really). Therefore
>> you have no chance of knowing how long the boost/deboost operation
>> will take. And you also know for how long the boosted tasks stay
>> boosted. If there can be an arbitrary number of
>> such tasks you can no longer be deterministic.
>>
>>
>
> While you may have a valid concern about what rwlocks can do to
> determinism, not that we already have PI enabled rwlocks before my
> patch, so I am not increasing nor decreasing determinism in this
> regard. That being said, Steven Rostedt (author of the pi-rwlocks,
> CC'd) has facilities to manage this (such as limiting the number of
> readers to num_online_cpus) which this design would retain. Long story
> short, I do not believe I have made anything worse here, so this is a
> different discussion if you are still concerned.
>
>
>
>>
>>
>>> ...
>>> +
>>> +#define MAX_PI_DEPENDENCIES 5
>>>
>>>
>> WHAT??? There is a finite lock depth defined. I know we did that
>> originally but it wasn't hardcoded (as far as I remember) and
>> it was certainly not as low as 5.
>>
>>
>
> Note that this is simply in reference to how many direct sinks you can
> link to a node, not how long the resulting chain can grow. The chain
> depth is actually completely unconstrained by the design. I chose "5"
> here because typically we need 1 sink for the next link in the chain,
> and 1 sink for local notifications. The other 3 are there for head-room
> (we often hit 3-4 as we transition between nodes (add one node -> delete
> another, etc).
>
To clarify what I meant here: If you think of a normal linked-list node
having a single "next" pointer, this implementation is like each node
having up to 5 "next" pointers. However typically only 1-2 are used,
and all but one will usually point to a "leaf" node, meaning it does not
form a chain but terminates processing locally. Typically there will be
only one link to something that forms a chain with other nodes. I did
this because I realized the pattern (boost/deboost/update) was similar
whether the node was a leaf or a chain-link, so I unified both behind
the single pi_sink interface.
That being understood, note that as with any linked-list, the nodes can
still have an arbitrary chaining depth (and I will fix this to be
iterative instead of recursive, as previously mentioned).
> You are not the first to comment about this, however, so it makes me
> realize it is not very clear ;) I will comment the code better.
>
>
>
>> Remember: PI is used by the user space futeces as well!
>>
>>
>
> Yes, and on a slight tangent from your point, this incidentally is
> actually a problem in the design such that I need to respin at least a
> v5. My current design uses recursion against the sink->update()
> methods, which Peter Zijlstra pointed out would blow up with large
> userpspace chains. My next version will forgo the recursion in favor of
> an iterative method more reminiscent of the original design.
>
>
>>
>>
>>> ....
>>> +/*
>>> + * _pi_node_update - update the chain
>>> + *
>>> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
>>> + * that need to propagate up the chain. This is a step-wise process where we
>>> + * have to be careful about locking and preemption. By trying MAX_PI_DEPs
>>> + * times, we guarantee that this update routine is an effective barrier...
>>> + * all modifications made prior to the call to this barrier will have completed.
>>> + *
>>> + * Deadlock avoidance: This node may participate in a chain of nodes which
>>> + * form a graph of arbitrary structure. While the graph should technically
>>> + * never close on itself barring any bugs, we still want to protect against
>>> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
>>> + * from detecting this potential). To do this, we employ a dual-locking
>>> + * scheme where we can carefully control the order. That is: node->lock
>>> + * protects most of the node's internal state, but it will never be held
>>> + * across a chain update. sinkref->lock, on the other hand, can be held
>>> + * across a boost/deboost, and also guarantees proper execution order. Also
>>> + * note that no locks are held across an sink->update.
>>> + */
>>> +static int
>>> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
>>> +{
>>> + struct pi_node *node = node_of(sink);
>>> + struct pi_sinkref *sinkref;
>>> + unsigned long iflags;
>>> + int count = 0;
>>> + int i;
>>> + int pprio;
>>> + struct updater updaters[MAX_PI_DEPENDENCIES];
>>> +
>>> + spin_lock_irqsave(&node->lock, iflags);
>>> +
>>> + pprio = node->prio;
>>> +
>>> + if (!plist_head_empty(&node->srcs))
>>> + node->prio = plist_first(&node->srcs)->prio;
>>> + else
>>> + node->prio = MAX_PRIO;
>>> +
>>> + list_for_each_entry(sinkref, &node->sinks, list) {
>>> + /*
>>> + * If the priority is changing, or if this is a
>>> + * BOOST/DEBOOST, we consider this sink "stale"
>>> + */
>>> + if (pprio != node->prio
>>> + || sinkref->state != pi_state_boosted) {
>>> + struct updater *iter = &updaters[count++];
>>>
>>>
>> What prevents count from overrun?
>>
>>
>
> The node->sinks list will never have more than MAX_PI_DEPs in it, by design.
>
>
>>
>>
>>> +
>>> + BUG_ON(!atomic_read(&sinkref->sink->refs));
>>> + _pi_sink_get(sinkref);
>>> +
>>> + iter->update = 1;
>>> + iter->sinkref = sinkref;
>>> + iter->sink = sinkref->sink;
>>> + }
>>> + }
>>> +
>>> + spin_unlock(&node->lock);
>>> +
>>> + for (i = 0; i < count; ++i) {
>>> + struct updater *iter = &updaters[i];
>>> + unsigned int lflags = PI_FLAG_DEFER_UPDATE;
>>> + struct pi_sink *sink;
>>> +
>>> + sinkref = iter->sinkref;
>>> + sink = iter->sink;
>>> +
>>> + spin_lock(&sinkref->lock);
>>> +
>>> + switch (sinkref->state) {
>>> + case pi_state_boost:
>>> + sinkref->state = pi_state_boosted;
>>> + /* Fall through */
>>> + case pi_state_boosted:
>>> + sink->ops->boost(sink, &sinkref->src, lflags);
>>> + break;
>>> + case pi_state_deboost:
>>> + sink->ops->deboost(sink, &sinkref->src, lflags);
>>> + sinkref->state = pi_state_free;
>>> +
>>> + /*
>>> + * drop the ref that we took when the sinkref
>>> + * was allocated. We still hold a ref from
>>> + * above.
>>> + */
>>> + _pi_sink_put_all(node, sinkref);
>>> + break;
>>> + case pi_state_free:
>>> + iter->update = 0;
>>> + break;
>>> + default:
>>> + panic("illegal sinkref type: %d", sinkref->state);
>>> + }
>>> +
>>> + spin_unlock(&sinkref->lock);
>>> +
>>> + /*
>>> + * We will drop the sinkref reference while still holding the
>>> + * preempt/irqs off so that the memory is returned synchronously
>>> + * to the system.
>>> + */
>>> + _pi_sink_put_local(node, sinkref);
>>> + }
>>> +
>>> + local_irq_restore(iflags);
>>>
>>>
>> Yack! You keep interrupts off while doing the chain.
>>
>
> Actually, not quite. The first pass (with interrupts off) simply sets
> the new priority value at each local element (limited to 5, typically
> 1-2). Short and sweet. Its the "update" that happens next (with
> interrupts/preemption enabled) that updates the chain.
>
>
>
>> I think my main
>> contribution to the PI system 2 years ago was to do this preemptively.
>> I.e. there was points in the loop where interrupts and preemption
>> where turned on.
>>
>>
>
> I agree this is important, but I think you will see with further review
> that this is in fact what I do too.
>
>
>> Remember: It goes into user space again. An evil user could craft an
>> application with a very long lock depth and keep higher priority real
>> time tasks from running for an arbitrary long time (if
>> no limit on the lock depth is set, which is bad because it will be too
>> low in some cases.)
>>
>> But as I said I have had no time to watch what has actually been going
>> on in the kernel for the last 2 years roughly. The said defects might
>> have creeped in by other contributers already :-(
>>
>> Esben
>>
>>
>
> Esben,
> Your review and insight are very much appreciated. I will be sure to
> address the concerns mentioned above and CC you on the next release.
>
> Thanks again,
> -Greg
>
>
>
Download attachment "signature.asc" of type "application/pgp-signature" (258 bytes)
Powered by blists - more mailing lists