[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2414c191-ff55-e446-b555-c9d0ccca6b93@citrix.com>
Date: Thu, 17 Dec 2020 18:25:52 +0000
From: Andrew Cooper <andrew.cooper3@...rix.com>
To: Jürgen Groß <jgross@...e.com>,
<xen-devel@...ts.xenproject.org>, <linux-kernel@...r.kernel.org>
CC: Boris Ostrovsky <boris.ostrovsky@...cle.com>,
Stefano Stabellini <sstabellini@...nel.org>
Subject: Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible
On 16/12/2020 08:21, Jürgen Groß wrote:
> On 15.12.20 21:59, Andrew Cooper wrote:
>> On 15/12/2020 11:10, Juergen Gross wrote:
>>> In case a process waits for any Xenstore action in the xenbus driver
>>> it should be interruptible by signals.
>>>
>>> Signed-off-by: Juergen Gross <jgross@...e.com>
>>> ---
>>> V2:
>>> - don't special case SIGKILL as libxenstore is handling -EINTR fine
>>> ---
>>> drivers/xen/xenbus/xenbus_xs.c | 9 ++++++++-
>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/xenbus/xenbus_xs.c
>>> b/drivers/xen/xenbus/xenbus_xs.c
>>> index 3a06eb699f33..17c8f8a155fd 100644
>>> --- a/drivers/xen/xenbus/xenbus_xs.c
>>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>>> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>>> static void *read_reply(struct xb_req_data *req)
>>> {
>>> + int ret;
>>> +
>>> do {
>>> - wait_event(req->wq, test_reply(req));
>>> + ret = wait_event_interruptible(req->wq, test_reply(req));
>>> +
>>> + if (ret == -ERESTARTSYS && signal_pending(current)) {
>>> + req->msg.type = XS_ERROR;
>>> + return ERR_PTR(-EINTR);
>>> + }
>>
>> So now I can talk fully about the situations which lead to this, I think
>> there is a bit more complexity.
>>
>> It turns out there are a number of issues related to running a Xen
>> system with no xenstored.
>>
>> 1) If a xenstore-write occurs during startup before init-xenstore-domain
>> runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
>> reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
>> tell the dom0 kernel that xenstored is in dom1. This effectively
>> deadlocks the system.
>
> This should be easy to solve: any request to /dev/xen/xenbus should
> block upfront in case xenstored isn't up yet (could e.g. wait
> interruptible until xenstored_ready is non-zero).
I'm not sure that that would fix the problem. The problem is that
setting the ring details via /dev/xen/xenbus_backend blocks, which
prevents us launching the xenstored stubdomain, which prevents the
earlier xenbus write being completed.
So long as /dev/xen/xenbus_backend doesn't block, there's no problem
with other /dev/xen/xenbus activity being pending briefly.
Looking at the current logic, I'm not completely convinced. Even
finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
ready.
There are 3 possible cases.
1) PV guest, and details in start_info
2) HVM guest, and details in HVM_PARAMs
3) No details (expected for dom0). Something in userspace must provide
details at a later point.
So the setup phases go from nothing, to having ring details, to finding
the ring working.
I think it would be prudent to try reading a key between having details
and declaring the xenstored_ready. Any activity, even XS_ERROR,
indicates that the other end of the ring is listening.
>
>> 2) If xenstore-watch is running when xenstored dies, it spins at 100%
>> cpu usage making no system calls at all. This is caused by bad error
>> handling from xs_watch(), and attempting to debug found:
>
> Can you expand on "bad error handling from xs_watch()", please?
do_watch() has
for ( ... ) { // defaults to an infinite loop
vec = xs_read_watch();
if (vec == NULL)
continue;
...
}
My next plan was to experiment with break instead of continue, which
I'll get to at some point.
>
>>
>> 3) (this issue). If anyone starts xenstore-watch with no xenstored
>> running at all, it blocks in D in the kernel.
>
> Should be handled with solution for 1).
>
>>
>> The cause is the special handling for watch/unwatch commands which,
>> instead of just queuing up the data for xenstore, explicitly waits for
>> an OK for registering the watch. This causes a write() system call to
>> block waiting for a non-existent entity to reply.
>>
>> So while this patch does resolve the major usability issue I found (I
>> can't even SIGINT and get my terminal back), I think there are issues.
>>
>> The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
>> require special handling. The main kernel thread for processing
>> incoming data from xenstored does need to know how to associate each
>> async XS_WATCH_EVENT to the caller who watched the path.
>>
>> Therefore, depending on when this cancellation hits, we might be in any
>> of the following states:
>>
>> 1) the watch is queued in the kernel, but not even sent to xenstored yet
>> 2) the watch is queued in the xenstored ring, but not acted upon
>> 3) the watch is queued in the xenstored ring, and the xenstored has seen
>> it but not replied yet
>> 4) the watch has been processed, but the XS_WATCH reply hasn't been
>> received yet
>> 5) the watch has been processed, and the XS_WATCH reply received
>>
>> State 5 (and a little bit) is the normal success path when xenstored has
>> acted upon the request, and the internal kernel infrastructure is set up
>> appropriately to handle XS_WATCH_EVENTs.
>>
>> States 1 and 2 can be very common if there is no xenstored (or at least,
>> it hasn't started up yet). In reality, there is either no xenstored, or
>> it is up and running (and for a period of time during system startup,
>> these cases occur in sequence).
>
> Yes. this is the reason we can't just reject a user request if xenstored
> hasn't been detected yet: it could be just starting.
Right, and I'm not suggesting that we'd want to reject accesses while
xenstored is starting up.
>
>>
>> As soon as the XS_WATCH event has been written into the xenstored ring,
>> it is not safe to cancel. You've committed to xenstored processing the
>> request (if it is up).
>
> I'm not sure this is true. Cancelling it might result in a stale watch
> in xenstored, but there shouldn't be a problem related to that. In case
> that watch fires the event will normally be discarded by the kernel as
> no matching watch is found in the kernel's data. In case a new watch
> has been setup with the same struct xenbus_watch address (which is used
> as the token), then this new watch might fire without the node of the
> new watch having changed, but spurious watch events are defined to be
> okay (OTOH the path in the event might look strange to the handler).
Watches are a quota'd resource in (at least some) xenstored
configurations. Losing track of the registration is a resource leak,
even if the kernel can filter and discard the unexpected watch events.
>> If xenstored is actually up and running, its fine and necessary to
>> block. The request will be processed in due course (timing subject to
>> the client and server load). If xenstored isn't up, blocking isn't ok.
>>
>> Therefore, I think we need to distinguish "not yet on the ring" from "on
>> the ring", as our distinction as to whether cancelling is safe, and
>> ensure we don't queue anything on the ring before we're sure xenstored
>> has started up.
>>
>> Does this make sense?
>
> Basically, yes.
Great. If I get any time, I'll try to look into some fixes along the
above lines.
~Andrew
Powered by blists - more mailing lists