linux-kernel - Re: [PATCH v2] xen/xenbus: make xs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <18c05f8c-d40b-b7e2-fc0d-c1a5343bdcfa@suse.com>
Date:   Fri, 18 Dec 2020 07:49:33 +0100
From:   Jürgen Groß <jgross@...e.com>
To:     Andrew Cooper <andrew.cooper3@...rix.com>,
        xen-devel@...ts.xenproject.org, linux-kernel@...r.kernel.org
Cc:     Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Stefano Stabellini <sstabellini@...nel.org>
Subject: Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

On 17.12.20 19:25, Andrew Cooper wrote:
> On 16/12/2020 08:21, Jürgen Groß wrote:
>> On 15.12.20 21:59, Andrew Cooper wrote:
>>> On 15/12/2020 11:10, Juergen Gross wrote:
>>>> In case a process waits for any Xenstore action in the xenbus driver
>>>> it should be interruptible by signals.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@...e.com>
>>>> ---
>>>> V2:
>>>> - don't special case SIGKILL as libxenstore is handling -EINTR fine
>>>> ---
>>>>    drivers/xen/xenbus/xenbus_xs.c | 9 ++++++++-
>>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/xen/xenbus/xenbus_xs.c
>>>> b/drivers/xen/xenbus/xenbus_xs.c
>>>> index 3a06eb699f33..17c8f8a155fd 100644
>>>> --- a/drivers/xen/xenbus/xenbus_xs.c
>>>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>>>> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>>>>      static void *read_reply(struct xb_req_data *req)
>>>>    {
>>>> +    int ret;
>>>> +
>>>>        do {
>>>> -        wait_event(req->wq, test_reply(req));
>>>> +        ret = wait_event_interruptible(req->wq, test_reply(req));
>>>> +
>>>> +        if (ret == -ERESTARTSYS && signal_pending(current)) {
>>>> +            req->msg.type = XS_ERROR;
>>>> +            return ERR_PTR(-EINTR);
>>>> +        }
>>>
>>> So now I can talk fully about the situations which lead to this, I think
>>> there is a bit more complexity.
>>>
>>> It turns out there are a number of issues related to running a Xen
>>> system with no xenstored.
>>>
>>> 1) If a xenstore-write occurs during startup before init-xenstore-domain
>>> runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
>>> reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
>>> tell the dom0 kernel that xenstored is in dom1.  This effectively
>>> deadlocks the system.
>>
>> This should be easy to solve: any request to /dev/xen/xenbus should
>> block upfront in case xenstored isn't up yet (could e.g. wait
>> interruptible until xenstored_ready is non-zero).
> 
> I'm not sure that that would fix the problem.  The problem is that
> setting the ring details via /dev/xen/xenbus_backend blocks, which
> prevents us launching the xenstored stubdomain, which prevents the
> earlier xenbus write being completed.

But _why_ is it blocking? Digging through the code I think it blocks
is xs_suspend() due to the normal xenstore request being pending. If
that request doesn't reach the state to cause blocking in xs_suspend()
all is fine.

> So long as /dev/xen/xenbus_backend doesn't block, there's no problem
> with other /dev/xen/xenbus activity being pending briefly.
> 
> 
> Looking at the current logic, I'm not completely convinced.  Even
> finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
> ready.

No, but the deadlock is not going to happen anymore (famous last
words).

> 
> There are 3 possible cases.
> 
> 1) PV guest, and details in start_info
> 2) HVM guest, and details in HVM_PARAMs
> 3) No details (expected for dom0).  Something in userspace must provide
> details at a later point.
> 
> So the setup phases go from nothing, to having ring details, to finding
> the ring working.
> 
> I think it would be prudent to try reading a key between having details
> and declaring the xenstored_ready.  Any activity, even XS_ERROR,
> indicates that the other end of the ring is listening.

Yes. But I really think the xs_suspend() is the problematic case. And
this will be called _before_ xenstored_ready is being set.

> 
>>
>>> 2) If xenstore-watch is running when xenstored dies, it spins at 100%
>>> cpu usage making no system calls at all.  This is caused by bad error
>>> handling from xs_watch(), and attempting to debug found:
>>
>> Can you expand on "bad error handling from xs_watch()", please?
> 
> do_watch() has
> 
>      for ( ... ) { // defaults to an infinite loop
>          vec = xs_read_watch();
>          if (vec == NULL)
>              continue;
>          ...
>      }
> 
> 
> My next plan was to experiment with break instead of continue, which
> I'll get to at some point.

I'd rather put a sleep() in. Otherwise you might break some use cases.

> 
>>
>>>
>>> 3) (this issue).  If anyone starts xenstore-watch with no xenstored
>>> running at all, it blocks in D in the kernel.
>>
>> Should be handled with solution for 1).
>>
>>>
>>> The cause is the special handling for watch/unwatch commands which,
>>> instead of just queuing up the data for xenstore, explicitly waits for
>>> an OK for registering the watch.  This causes a write() system call to
>>> block waiting for a non-existent entity to reply.
>>>
>>> So while this patch does resolve the major usability issue I found (I
>>> can't even SIGINT and get my terminal back), I think there are issues.
>>>
>>> The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
>>> require special handling.  The main kernel thread for processing
>>> incoming data from xenstored does need to know how to associate each
>>> async XS_WATCH_EVENT to the caller who watched the path.
>>>
>>> Therefore, depending on when this cancellation hits, we might be in any
>>> of the following states:
>>>
>>> 1) the watch is queued in the kernel, but not even sent to xenstored yet
>>> 2) the watch is queued in the xenstored ring, but not acted upon
>>> 3) the watch is queued in the xenstored ring, and the xenstored has seen
>>> it but not replied yet
>>> 4) the watch has been processed, but the XS_WATCH reply hasn't been
>>> received yet
>>> 5) the watch has been processed, and the XS_WATCH reply received
>>>
>>> State 5 (and a little bit) is the normal success path when xenstored has
>>> acted upon the request, and the internal kernel infrastructure is set up
>>> appropriately to handle XS_WATCH_EVENTs.
>>>
>>> States 1 and 2 can be very common if there is no xenstored (or at least,
>>> it hasn't started up yet).  In reality, there is either no xenstored, or
>>> it is up and running (and for a period of time during system startup,
>>> these cases occur in sequence).
>>
>> Yes. this is the reason we can't just reject a user request if xenstored
>> hasn't been detected yet: it could be just starting.
> 
> Right, and I'm not suggesting that we'd want to reject accesses while
> xenstored is starting up.
> 
>>
>>>
>>> As soon as the XS_WATCH event has been written into the xenstored ring,
>>> it is not safe to cancel.  You've committed to xenstored processing the
>>> request (if it is up).
>>
>> I'm not sure this is true. Cancelling it might result in a stale watch
>> in xenstored, but there shouldn't be a problem related to that. In case
>> that watch fires the event will normally be discarded by the kernel as
>> no matching watch is found in the kernel's data. In case a new watch
>> has been setup with the same struct xenbus_watch address (which is used
>> as the token), then this new watch might fire without the node of the
>> new watch having changed, but spurious watch events are defined to be
>> okay (OTOH the path in the event might look strange to the handler).
> 
> Watches are a quota'd resource in (at least some) xenstored
> configurations.  Losing track of the registration is a resource leak,
> even if the kernel can filter and discard the unexpected watch events.

Hmm, true.

The correct way to handle it then would be to mark the request to not
just drop it in case a late answer is arriving, but to do an unwatch.

A similar handling would be required for a transaction start.

> 
>>> If xenstored is actually up and running, its fine and necessary to
>>> block.  The request will be processed in due course (timing subject to
>>> the client and server load).  If xenstored isn't up, blocking isn't ok.
>>>
>>> Therefore, I think we need to distinguish "not yet on the ring" from "on
>>> the ring", as our distinction as to whether cancelling is safe, and
>>> ensure we don't queue anything on the ring before we're sure xenstored
>>> has started up.
>>>
>>> Does this make sense?
>>
>> Basically, yes.
> 
> Great.  If I get any time, I'll try to look into some fixes along the
> above lines.

I won't work on those for the coming 3 weeks, so go ahead. :-)


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3092 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)