[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eeb62129-d9fc-2155-0e0f-aff1fbb33fbc@suse.com>
Date: Sun, 7 Feb 2021 13:58:20 +0100
From: Jürgen Groß <jgross@...e.com>
To: Julien Grall <julien@....org>, xen-devel@...ts.xenproject.org,
linux-kernel@...r.kernel.org, linux-block@...r.kernel.org,
netdev@...r.kernel.org, linux-scsi@...r.kernel.org
Cc: Boris Ostrovsky <boris.ostrovsky@...cle.com>,
Stefano Stabellini <sstabellini@...nel.org>,
stable@...r.kernel.org,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
Roger Pau Monné <roger.pau@...rix.com>,
Jens Axboe <axboe@...nel.dk>, Wei Liu <wei.liu@...nel.org>,
Paul Durrant <paul@....org>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids
On 06.02.21 19:46, Julien Grall wrote:
> Hi Juergen,
>
> On 06/02/2021 10:49, Juergen Gross wrote:
>> The first three patches are fixes for XSA-332. The avoid WARN splats
>> and a performance issue with interdomain events.
>
> Thanks for helping to figure out the problem. Unfortunately, I still see
> reliably the WARN splat with the latest Linux master (1e0d27fce010) +
> your first 3 patches.
>
> I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L
> events ABI.
>
> After some debugging, I think I have an idea what's went wrong. The
> problem happens when the event is initially bound from vCPU0 to a
> different vCPU.
>
> From the comment in xen_rebind_evtchn_to_cpu(), we are masking the
> event to prevent it being delivered on an unexpected vCPU. However, I
> believe the following can happen:
>
> vCPU0 | vCPU1
> |
> | Call xen_rebind_evtchn_to_cpu()
> receive event X |
> | mask event X
> | bind to vCPU1
> <vCPU descheduled> | unmask event X
> |
> | receive event X
> |
> | handle_edge_irq(X)
> handle_edge_irq(X) | -> handle_irq_event()
> | -> set IRQD_IN_PROGRESS
> -> set IRQS_PENDING |
> | -> evtchn_interrupt()
> | -> clear IRQD_IN_PROGRESS
> | -> IRQS_PENDING is set
> | -> handle_irq_event()
> | -> evtchn_interrupt()
> | -> WARN()
> |
>
> All the lateeoi handlers expect a ONESHOT semantic and
> evtchn_interrupt() is doesn't tolerate any deviation.
>
> I think the problem was introduced by 7f874a0447a9 ("xen/events: fix
> lateeoi irq acknowledgment") because the interrupt was disabled
> previously. Therefore we wouldn't do another iteration in
> handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just
the last patch of the three patches you were testing.
> Aside the handlers, I think it may impact the defer EOI mitigation
> because in theory if a 3rd vCPU is joining the party (let say vCPU A
> migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch,
> eoi_time} could possibly get mangled?
>
> For a fix, we may want to consider to hold evtchn_rwlock with the write
> permission. Although, I am not 100% sure this is going to prevent
> everything.
It will make things worse, as it would violate the locking hierarchy
(xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
On a first glance I think we'll need a 3rd masking state ("temporarily
masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being
handled" indicator checked via test_and_set() semantics in
handle_irq_for_port() and reset only when calling clear_evtchn().
> Does my write-up make sense to you?
Yes. What about my reply? ;-)
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3092 bytes)
Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists