linux-kernel - Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59929f7a0808220555r5a0b972cl5db047f74d7cabec@mail.gmail.com>
Date:	Fri, 22 Aug 2008 14:55:30 +0200
From:	"Esben Nielsen" <nielsen.esben@...glemail.com>
To:	"Gregory Haskins" <ghaskins@...ell.com>
Cc:	mingo@...e.hu, paulmck@...ux.vnet.ibm.com, peterz@...radead.org,
	tglx@...utronix.de, rostedt@...dmis.org,
	linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
	gregory.haskins@...il.com, David.Holmes@....com, jkacur@...il.com
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

Disclaimer: I am no longer actively involved and I must admit I might
have lost out on much of
what have been going on since I contributed to the PI system 2 years
ago. But I allow myself to comment
anyway.

On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@...ell.com> wrote:
> The kernel currently addresses priority-inversion through priority-
> inheritence.  However, all of the priority-inheritence logic is
> integrated into the Real-Time Mutex infrastructure.  This causes a few
> problems:
>
>  1) This tightly coupled relationship makes it difficult to extend to
>    other areas of the kernel (for instance, pi-aware wait-queues may
>    be desirable).
>  2) Enhancing the rtmutex infrastructure becomes challenging because
>    there is no seperation between the locking code, and the pi-code.
>
> This patch aims to rectify these shortcomings by designing a stand-alone
> pi framework which can then be used to replace the rtmutex-specific
> version.  The goal of this framework is to provide similar functionality
> to the existing subsystem, but with sole focus on PI and the
> relationships between objects that can boost priority, and the objects
> that get boosted.

This is really a good idea. When I had time (2 years ago) to actively
work on these problem
I also came to the conclusion that PI should be more general than just
the rtmutex. Preemptive RCU
was the example which drove it.

But I do disagree that general objects should get boosted: The end
targets are always tasks. The objects might
be boosted as intermediate steps, but priority end the only applies to tasks.

I also have a few comments to the actual design:

> ....
> +
> +Multiple sinks per Node:
> +
> +We allow multiple sinks to be associated with a node.  This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on").  The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
> +
> +Leaf-sinks are boostable objects that do not perpetuate a chain per se.  Rather, they act as endpoints to a priority boosting.  Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information.  However, there may be any number of leaf-sinks along a chain as well.  Each one will act on its localized priority in its own implementation specific way.  For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary.  Whereas an rwlock leaf-sink may boost a list of reader-owners.

This is bad from a RT point of view: You have a hard time determininig
the number of sinks per node. An rw-lock could have an arbitrary
number of readers (is supposed to really). Therefore
you have no chance of knowing how long the boost/deboost operation
will take. And you also know for how long the boosted tasks stay
boosted. If there can be an arbitrary number of
such tasks you can no longer be deterministic.

> ...
> +
> +#define MAX_PI_DEPENDENCIES 5


WHAT??? There is a finite lock depth defined. I know we did that
originally but it wasn't hardcoded (as far as I remember) and
it was certainly not as low as 5.

Remember: PI is used by the user space futeces as well!

> ....
> +/*
> + * _pi_node_update - update the chain
> + *
> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> + * that need to propagate up the chain.  This is a step-wise process where we
> + * have to be careful about locking and preemption.  By trying MAX_PI_DEPs
> + * times, we guarantee that this update routine is an effective barrier...
> + * all modifications made prior to the call to this barrier will have completed.
> + *
> + * Deadlock avoidance: This node may participate in a chain of nodes which
> + * form a graph of arbitrary structure.  While the graph should technically
> + * never close on itself barring any bugs, we still want to protect against
> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> + * from detecting this potential).  To do this, we employ a dual-locking
> + * scheme where we can carefully control the order.  That is: node->lock
> + * protects most of the node's internal state, but it will never be held
> + * across a chain update.  sinkref->lock, on the other hand, can be held
> + * across a boost/deboost, and also guarantees proper execution order. Also
> + * note that no locks are held across an sink->update.
> + */
> +static int
> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> +{
> +       struct pi_node    *node = node_of(sink);
> +       struct pi_sinkref *sinkref;
> +       unsigned long      iflags;
> +       int                count = 0;
> +       int                i;
> +       int                pprio;
> +       struct updater     updaters[MAX_PI_DEPENDENCIES];
> +
> +       spin_lock_irqsave(&node->lock, iflags);
> +
> +       pprio = node->prio;
> +
> +       if (!plist_head_empty(&node->srcs))
> +               node->prio = plist_first(&node->srcs)->prio;
> +       else
> +               node->prio = MAX_PRIO;
> +
> +       list_for_each_entry(sinkref, &node->sinks, list) {
> +               /*
> +                * If the priority is changing, or if this is a
> +                * BOOST/DEBOOST, we consider this sink "stale"
> +                */
> +               if (pprio != node->prio
> +                   || sinkref->state != pi_state_boosted) {
> +                       struct updater *iter = &updaters[count++];

What prevents count from overrun?

> +
> +                       BUG_ON(!atomic_read(&sinkref->sink->refs));
> +                       _pi_sink_get(sinkref);
> +
> +                       iter->update  = 1;
> +                       iter->sinkref = sinkref;
> +                       iter->sink     = sinkref->sink;
> +               }
> +       }
> +
> +       spin_unlock(&node->lock);
> +
> +       for (i = 0; i < count; ++i) {
> +               struct updater    *iter = &updaters[i];
> +               unsigned int       lflags = PI_FLAG_DEFER_UPDATE;
> +               struct pi_sink    *sink;
> +
> +               sinkref = iter->sinkref;
> +               sink = iter->sink;
> +
> +               spin_lock(&sinkref->lock);
> +
> +               switch (sinkref->state) {
> +               case pi_state_boost:
> +                       sinkref->state = pi_state_boosted;
> +                       /* Fall through */
> +               case pi_state_boosted:
> +                       sink->ops->boost(sink, &sinkref->src, lflags);
> +                       break;
> +               case pi_state_deboost:
> +                       sink->ops->deboost(sink, &sinkref->src, lflags);
> +                       sinkref->state = pi_state_free;
> +
> +                       /*
> +                        * drop the ref that we took when the sinkref
> +                        * was allocated.  We still hold a ref from
> +                        * above.
> +                        */
> +                       _pi_sink_put_all(node, sinkref);
> +                       break;
> +               case pi_state_free:
> +                       iter->update = 0;
> +                       break;
> +               default:
> +                       panic("illegal sinkref type: %d", sinkref->state);
> +               }
> +
> +               spin_unlock(&sinkref->lock);
> +
> +               /*
> +                * We will drop the sinkref reference while still holding the
> +                * preempt/irqs off so that the memory is returned synchronously
> +                * to the system.
> +                */
> +               _pi_sink_put_local(node, sinkref);
> +       }
> +
> +       local_irq_restore(iflags);

Yack! You keep interrupts off while doing the chain. I think my main
contribution to the PI system 2 years ago was to do this preemptively.
I.e. there was points in the loop where interrupts and preemption
where turned on.

Remember: It goes into user space again. An evil user could craft an
application with a very long lock depth and keep higher priority real
time tasks from running for an arbitrary long time (if
no limit on the lock depth is set, which is bad because it will be too
low in some cases.)

But as I said I have had no time to watch what has actually been going
on in the kernel for the last 2 years roughly. The said defects might
have creeped in by other contributers already :-(

Esben
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/