linux-kernel - Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB0TPYEKq-=c5rOX955Nnt8zr0vRUfsP22XNHnvf8CoKje=kCw@mail.gmail.com>
Date:   Thu, 16 Nov 2017 14:03:13 +0100
From:   Martijn Coenen <maco@...roid.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Greg KH <gregkh@...uxfoundation.org>,
        John Stultz <john.stultz@...aro.org>,
        Todd Kjos <tkjos@...gle.com>,
        "Arve Hj??nnev??g" <arve@...roid.com>,
        Sherry Yang <sherryy@...roid.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Amit Pundir <amit.pundir@...aro.org>,
        LKML <linux-kernel@...r.kernel.org>, devel@...verdev.osuosl.org,
        Martijn Coenen <maco@...gle.com>
Subject: Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

On Thu, Nov 16, 2017 at 12:27 PM, Peter Zijlstra <peterz@...radead.org> wrote:
>> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@...radead.org> wrote:
>> >> + * 1) binder supports a "minimum node priority", meaning that all transactions
>> >> + *    into a node must run at this priority at a minimum. This means that the
>> >> + *    desired priority for handling a transaction is not necessarily equal to
>> >> + *    the priority of the caller.
>> >
>> > Isn't that the same as running everything in that node (whatever that
>> > is) at that given prio?
>>
>> Not for synchronous transactions; in that case it's max(min_node_prio,
>> caller_prio).
>
> But that's exactly what you get!? How would it ever get below
> min_node_prio? PI only ever (temporarily) raises prio, it never lowers
> it.

Ah I guess we have different definitions of inheritance then. This
behavior is also related to binder's threadpool - all those threads
run at a default prio of 120 (nice 0), and call into the kernel to
wait for work. If somebody then makes a call with a lower prio (>120),
we do change the binder thread that handles that call to use the same
low prio as well. Why run at nice 0 if the caller runs at nice 10?

>
>> One reason min_node_prio exists is that for certain
>> critical nodes, we don't want calls into it to run at a very low
>> priority, because the implementation of said calls may take locks that
>> are contended. I realize one possible solution to that is to use PI
>> mutexes in userspace, but Android doesn't support that today.
>
> Fix that?

That's probably easier said than done :) I don't know why we don't
support them, I'll ask around.

>
> I'm confused -- again. Why not run those threads at the min prio to
> begin with?

This also gets back to binder's threadpool; they are generic threads
in a process waiting for work at nice 0. The prio they get depends on
the work they get, and in case of synchronous transactions, who calls
into them. If we had dedicated threads for each node, we could indeed
run them at the min_prio of that node by default. But this is not how
binder works today.

> Also, this doesn't seem related to PI.

Yes, in case of asynchronous transactions it's not PI at all.

> I'm not understanding; are you saying that some app does an async
> request for sensor data, then later this sensor triggers and we need to
> route the data back to the app?

The way this typically works is that an interested app creates a
binder node to receive sensor data on (basically a callback). The app
then passes this node into the Sensor HAL and says "call me back when
you have relevant sensor data". When the Sensor HAL has data, it makes
calls on this node with the relevant data. Because the receiving app
process has a few generic binder threads that may receive any type of
data, not just sensor data, we want to raise the priority only when
handling sensor data, and restore it to the default when the thread is
done.

>
> But above you said that async requests do not boost the priority; so
> then I would expect this sensor's data to land on regular min-prio
> thread for dispatch.

As you found out later in this e-mail, this doesn't work because of
binder's current threadpool design.

> *blink* what? Binder has its own threads? Its not just an RPC/ORB?

Yeah :-)

> That smells like fail; if you need to raise your prio it means you are
> low(er) prio and could at that point be subject to inversion.

True - it does expose a race condition, but it's still better than not
raising the priority at all. I'm looking into selecting a busy thread
and boost that. Though in general we don't run into this very often by
sizing the threadpool large enough.

>
> Assuming no idle threads in the pool, and all 'busy' at lower priority.
> A medium prio something is running, keeping these pool thingies from
> asking for more work. A high prio request comes in.. nobody around to
> service. You loose, game over.

Agree, we're trying to fix that.


> Yeah, that sounds really dodgy. I've been a _long_ time since I read the
> RT CORBA stuff, so I'm not sure if there's a sensible solution to the
> problem, but it sounds really really dodgy.
>
> Regular RPCs are synchronous; if A asks B to do something, A blocks. B
> can then not turn around and ask A something. That is called a deadlock.

In case of binder, this actually works, we call it "recursive calls";
the same thread in A that called into B can receive an incoming call
from B and handle that. Of course you still need to be careful when
that happens - eg if you hold a lock when making the first outbound
call, you better make sure you can't take that same lock on any
inbound call that can result from it. "Don't hold locks over
synchronous binder calls" is a well-adapted Android design paradigm.


> Alternatively, if you consider A and B to be services/nodes and you get
> service A to call into B and then have B call back into A, you need to
> guarantee there are sufficient tasks/threads to complete the entire
> transaction _ahead_of_time_. That is, when task 1 of A does the request
> into B and blocks, it needs to have reserved another thread (2) of A to
> be idle and waiting for B's request.

See above - we're guaranteed to run on the same thread that made the
original call.

> So I really don't understand how all this works.. how can 'important'
> device threads not already be of the right priority. All this fixing up
> after the fact just reeks of fail.

This gets back to the binder threadpool design. That actually
pre-dates my involvement with binder, but there are probably good
reasons for it being the way it is. For example, some Android
processes host as much as 100 binder nodes, some having different prio
requirements; having one or more threads dedicated to each of those
individual nodes seems quite wasteful.

>
>> What parts of the system aren't designed for RT?
>
> None of it as far I can tell. The shared thread pools, the quick fix
> priority, the nested transaction stuff, those all reek of utter fail and
> confusion.

Ack. FWIW I don't think it's really fail and confusion - I believe
these were conscious design choices for binder and at the time I doubt
that RT was even considered. Like every system binder has its warts,
but it has worked very well for Android. The same holds for this
patchset - it has very measurable improvements on Android system
performance, and I don't think the code or behavior is incorrect. It's
major weakness is that it has to work with the existing threadpool
paradigm and therefore depends on some internal scheduler APIs.

> Have tasks at appropriate priorities a priory wherever and whenever
> possible. PI is a (least bad) fix for when two threads interact on a
> shared resource. The best fix is taking out the shared resource.
>
> You don't have a shared resource in most of the scenarios described
> here.
>
> Use dedicated IPC channels, and avoid mixing priority msgs on a channel.
> If you have to, make sure the channels are priority ordered.
>
> The primary rule for systems is simplicity, RT systems more so than
> others. What you're describing is far too complicated and riddled with
> holes.
>
> Also, have a look at RT CORBA, there's a really nice open-source one
> here:
>
>   http://www.cs.wustl.edu/~schmidt/TAO.html
>
> That group also has a whole bunch of research papers on the subject. As
> said, its been many years since I used any of that so I'm bound to not
> know all the new shiny stuff ;-)
>
> (As a bonus, its written in a semi sane language, as opposed to most of
> Android :-)

Thanks for all the feedback. I'll do some more reading and talking
internally. As far as this series goes: since all the things we talked
about which really only affect Android, is your main concern with
merging this the use of some internal scheduler APIs? In general I'd
really like to keep the upstream binder driver as close as possible to
what we use in Android, and until we have an alternative for PI (if we
can agree on something and build it), this is a pretty significant
delta.