linux-kernel - Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4d8a7ba7-a9f6-2999-8750-bfe2b85f064e@suse.com>
Date:   Tue, 22 Jun 2021 13:04:14 +0200
From:   Juergen Gross <jgross@...e.com>
To:     Julien Grall <julien@....org>
Cc:     "xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
        linux-kernel@...r.kernel.org, mheyne@...zon.de
Subject: Re: Interrupt for port 19, but apparently not enabled; per-user
 000000004af23acc

On 22.06.21 12:24, Julien Grall wrote:
> Hi Juergen,
> 
> As discussed on IRC yesterday, we noticed a couple of splat in 5.13-rc6 

> (and stable 5.4) in the evtchn driver:
> 
> [    7.581000] ------------[ cut here ]------------
> [    7.581899] Interrupt for port 19, but apparently not 
enabled; 
> per-user 000000004af23acc
> [    7.583401] WARNING: CPU: 0 PID: 467 at 
> /home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:169 
> evtchn_interrupt+0xd5/0x100
> [    7.585583] Modules linked in:
> [    7.586188] CPU: 0 PID: 467 Comm: xenstore-read Not tainted 
> 5.13.0-rc6 #240
> [    7.587462] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [    7.589462] RIP: e030:evtchn_interrupt+0xd5/0x100
> [    7.590361] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 
be 1d 00 00 00 
> e8 5f 72 c4 ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 03 5f 82 e8 6b 2d 96 

> ff <0f> 0b e9 4d ff ff ff 41 0f b6 f4 48 c7 c7 80 da a2 82 e8 f0
> [    7.593662] RSP: e02b:ffffc90040003e60 EFLAGS: 00010082
> [    7.594636] RAX: 0000000000000000 RBX: ffff888102328c00 RCX: 
> 0000000000000027
> [    7.595924] RDX: 0000000000000000 RSI: ffff88817fe18ad0 RDI: 
> ffff88817fe18ad8
> [    7.597216] RBP: ffff888108ef8140 R08: 0000000000000000 R09: 
> 0000000000000001
> [    7.598522] R10: 0000000000000000 R11: 7075727265746e49 R12: 
> 0000000000000000
> [    7.599810] R13: ffffc90040003ec4 R14: ffff8881001b8000 R15: 
> ffff888109b36f80
> [    7.601113] FS:  0000000000000000(0000) GS:ffff88817fe00000(0000) 
> knlGS:0000000000000000
> [    7.602570] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    7.603700] CR2: 00007f15b390e368 CR3: 000000010bb04000 CR4: 
> 0000000000050660
> [    7.604993] Call Trace:
> [    7.605501]  <IRQ>
> [    7.605929]  __handle_irq_event_percpu+0x4c/0x330
> [    7.606817]  handle_irq_event_percpu+0x32/0xa0
> [    7.607670]  handle_irq_event+0x3a/0x60
> [    7.608416]  handle_edge_irq+0x9b/0x1f0
> [    7.609154]  generic_handle_irq+0x4f/0x60
> [    7.609918]  __evtchn_fifo_handle_events+0x195/0x3a0
> [    7.610864]  __xen_evtchn_do_upcall+0x66/0xb0
> [    7.611693]  __xen_pv_evtchn_do_upcall+0x1d/0x30
> [    7.612582]  xen_pv_evtchn_do_upcall+0x9d/0xc0
> [    7.613439]  </IRQ>
> [    7.613882]  exc_xen_hypervisor_callback+0x8/0x10
> 
> This is quite similar to the problem I reported a few months ago (see 
> [1]) but this time this is happening with fifo rather than 2L.
> 
> I haven't been able to reproduced it reliably so far. But looking at the 
> code, I think I have found another potential race after commit
> 
> commit b6622798bc50b625a1e62f82c7190df40c1f5b21
> Author: Juergen Gross <jgross@...e.com>
> Date:   Sat Mar 6 17:18:33 2021 +0100
>     xen/events: avoid handling the same event on two cpus at the same time
>     When changing the cpu affinity of an event it can happen today that
>     (with some unlucky timing) the same event will be handled 
on the old
>     and the new cpu at the same time.
>     Avoid that by adding an "event active" flag to the per-event data and
>     call the handler only if this flag isn't set.
>     Cc: stable@...r.kernel.org
>     Reported-by: Julien Grall <julien@....org>
>     Signed-off-by: Juergen Gross <jgross@...e.com>
>     Reviewed-by: Julien Grall <jgrall@...zon.com>
>     Link: https://lore.kernel.org/r/20210306161833.4552-4-jgross@suse.com
>     Signed-off-by: Boris Ostrovsky <boris.ostrovsky@...cle.com>
> 
> The evtchn driver will use the lateeoi handlers. So the code to ack 
> looks like:
> 
> do_mask(..., EVT_MASK_REASON_EOI_PENDING)
> smp_store_release(&info->is_active, 0);
> clear_evtchn(info->evtchn);
> 
> The code to handle an interrupts look like:
> 
> clear_link(...)
> if ( evtchn_fifo_is_pending(port) && !evtchn_fifo_is_mask()) {
>    if (xchg_acquire(&info->is_active, 1)
>      return;
>    generic_handle_irq();
> }
> 
> After changing the affinity, an interrupt may be received once on the 
> previous vCPU. So, I think the following can happen:
> 
> vCPU0                             | vCPU1
>                    |
>   Receive event              |
>                    | change affinity to vCPU1
>   clear_link()              |
>                        |
>                 /* The interrupt is re-raised */
>                    | receive event
>                      |
>                    | /* The interrupt is not masked */
>   info->is_active = 1          |
>   do_mask(...)              |
>   info->is_active = 0          |
>                    | info->is_active = 1
>   clear_evtchn(...)               |
>                                   | do_mask(...)
>                                   | info->is_active = 0
>                    | clear_evtchn(...)
> 
> Does this look plausible to you?

Yes, it does.

Thanks for the analysis.

So I guess for lateeoi events we need to clear is_active only in
xen_irq_lateeoi()? At a first glance this should fix the issue.

What do you think?


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3092 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)