lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZEsiYbi8dorXTI5t@yujie-X299>
Date:   Fri, 28 Apr 2023 09:33:21 +0800
From:   Yujie Liu <yujie.liu@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     Shanker Donthineni <sdonthineni@...dia.com>,
        <oe-lkp@...ts.linux.dev>, <lkp@...el.com>,
        <linux-kernel@...r.kernel.org>, Marc Zyngier <maz@...nel.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        "Michael Walle" <michael@...le.cc>,
        Vikram Sethi <vsethi@...dia.com>
Subject: Re: [PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors
 management

Hi Thomas,

On Wed, Apr 26, 2023 at 02:08:54PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 25 2023 at 11:16, kernel test robot wrote:
> > kernel test robot noticed "WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical" on:
> >
> > commit: 13eb5c4e7d2fb860d3dc5f63d910e3acf78dfd28 ("[PATCH v3 3/3] genirq: Use the maple tree for IRQ descriptors management")
> > url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230410-235853
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 6f3ee0e22b4c62f44b8fa3c8de6e369a4d112a75
> > patch link: https://lore.kernel.org/all/20230410155721.3720991-4-sdonthineni@nvidia.com/
> > patch subject: [PATCH v3 3/3] genirq: Use the maple tree for IRQ
> > descriptors management
> 
> This happens during CPU hot-unplug.
> 
> [  206.930774][  T228] block/008 => sdb2 (do IO while hotplugging CPUs)            
> [  206.935757][ T2086] run blktests block/008 at 2023-04-22 16:27:25
> [  207.199359][ T2086] smpboot: CPU 2 is now offline
> 
> [  207.468574][   T30] WARNING: CPU: 3 PID: 30 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical+0x40/0x44
> [  207.568426][   T30] CPU: 3 PID: 30 Comm: migration/3 Tainted: G S          E      6.2.0-rc4-00051-g13eb5c4e7d2f #1
> [  207.588372][   T30] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0xf5/0x138
> [  207.596649][   T30] EIP: default_send_IPI_mask_logical+0x40/0x44
> 
> This warns because fixup_irqs() sends an IPI to an offline CPU. In this
> case to CPU3 which just cleared its online bit and is about to vanish:
> 
> [  207.622147][   T30] EAX: 00000008 EBX: 00000002 ECX: fffffffc EDX: 00000022
> 
> EAX contains the target and ECX the inverted online mask. That's
> probably the ata2 interrupt as that later detects a timeout:
> 
> [  238.826212][  T174] ata2.00: exception Emask 0x0 SAct 0x3c00000 SErr 0x0 action 0x6 frozen
> [  238.834522][  T174] ata2.00: failed command: READ FPDMA QUEUED
> [  238.840378][  T174] ata2.00: cmd 60/08:b0:90:3e:90/00:00:25:00:00/40 tag 22 ncq dma 4096 in
> [  238.840378][  T174]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> 
> Which means that migrating the interrupt away from the outgoing CPU3
> failed for yet to understand reasons.
> 
> The patch in question is changing the interrupt descriptor storage and
> with that also the iterator function. But I can't spot anything wrong
> right now.
> 
> But what I can spot is this:
> 
> [    0.000000][    T0] Linux version 6.2.0-rc4-00051-g13eb5c4e7d2f
> 
> IOW, that test is based on some random upstream version, which lacks
> about 30 commits to maple_tree, where 12 of them have 'fix' in the
> commit subject.
> 
> Can you please retest this on v6.3 and report back when the problem
> persists?

Thanks for your help looking into this problem.

The problem persists when tested on v6.3, but not 100% reproducible.
We ran the test on v6.3 and v6.3+patch each for 20 runs. There are 9
failed runs on v6.3+patch, while v6.3 is all clean. Full dmesg is
attached.

=========================================================================================
compiler/disk/kconfig/rootfs/tbox_group/test/testcase:
  gcc-11/1SSD/i386-debian-10.3-func/debian-11.1-i386-20220923.cgz/lkp-skl-d06/block-group-00/blktests

commit:
  v6.3
  32c58fc685e5c ("genirq: Use the maple tree for IRQ descriptors management")

            v6.3 32c58fc685e5cd6b5947a5f8e9a
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :20          45%           9:20    dmesg.EIP:default_send_IPI_mask_logical
           :20          45%           9:20    dmesg.WARNING:at_arch/x86/kernel/apic/ipi.c:#default_send_IPI_mask_logical
           :20          45%           9:20    blktests.block/008.fail
           :20          45%           9:20    blktests.block/012.fail


[  214.484584][  T235] block/008 => sdb2 (do IO while hotplugging CPUs)
[  214.484586][  T235]
[  214.489534][ T2125] run blktests block/008 at 2023-04-27 10:06:38
[  214.749748][ T2125] smpboot: CPU 3 is now offline
[  215.009645][   T28] ------------[ cut here ]------------
[ 215.014948][ T28] WARNING: CPU: 2 PID: 28 at arch/x86/kernel/apic/ipi.c:299 default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[  215.025527][   T28] Modules linked in: loop(E) dm_multipath(E) dm_mod(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) sd_mod(E) t10_pi(E) coretemp(E) crc64_rocksoft_generic(E) kvm_intel(E) crc64_rocksoft(E) crc64(E) kvm(E) irqbypass(E) ipmi_devintf(E) ipmi_msghandler(E) crc32_pclmul(E) i915(E) drm_buddy(E) crc32c_intel(E) tpm_infineon(E) drm_display_helper(E) aesni_intel(E) crypto_simd(E) mei_wdt(E) cec(E) tpm_tis(E) cryptd(E) ahci(E) drm_kms_helper(E) psmouse(E) tpm_tis_core(E) xhci_pci(E) mei_me(E) i2c_i801(E) rapl(E) syscopyarea(E) wmi_bmof(E) xhci_hcd(E) usbcore(E) intel_cstate(E) libahci(E) evdev(E) serio_raw(E) i2c_smbus(E) sysfillrect(E) sysimgblt(E) intel_uncore(E) tpm(E) video(E) libata(E) mei(E) intel_pch_thermal(E) ttm(E) ie31200_edac(E) usb_common(E) rng_core(E) wmi(E) acpi_pad(E) button(E) fuse(E) drm(E) configfs(E) autofs4(E) [last unloaded: null_blk(E)]
[  215.112194][   T28] CPU: 2 PID: 28 Comm: migration/2 Tainted: G S          E      6.3.0-00003-g32c58fc685e5 #1
[  215.122171][   T28] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[ 215.131716][ T28] Stopper: multi_cpu_stop+0x0/0xf4 <- stop_machine_cpuslocked (kernel/stop_machine.c:429 kernel/stop_machine.c:469 kernel/stop_machine.c:619)
[ 215.139966][ T28] EIP: default_send_IPI_mask_logical (arch/x86/kernel/apic/ipi.c:299 (discriminator 1))
[ 215.145961][ T28] Code: 2b c2 f7 d1 85 c1 75 22 b9 00 08 00 00 e8 cc fb ff ff 80 e7 02 74 01 fb 8b 5d fc c9 c3 8d 74 26 00 90 c3 8d b4 26 00 00 00 00 <0f> 0b eb da 3e 8d 74 26 00 a1 64 b8 2b c2 f6 c4 02 75 0d 31 c0 c3
All code
========
   0:   2b c2                   sub    %edx,%eax
   2:   f7 d1                   not    %ecx
   4:   85 c1                   test   %eax,%ecx
   6:   75 22                   jne    0x2a
   8:   b9 00 08 00 00          mov    $0x800,%ecx
   d:   e8 cc fb ff ff          call   0xfffffffffffffbde
  12:   80 e7 02                and    $0x2,%bh
  15:   74 01                   je     0x18
  17:   fb                      sti
  18:   8b 5d fc                mov    -0x4(%rbp),%ebx
  1b:   c9                      leave
  1c:   c3                      ret
  1d:   8d 74 26 00             lea    0x0(%rsi,%riz,1),%esi
  21:   90                      nop
  22:   c3                      ret
  23:   8d b4 26 00 00 00 00    lea    0x0(%rsi,%riz,1),%esi
  2a:*  0f 0b                   ud2             <-- trapping instruction
  2c:   eb da                   jmp    0x8
  2e:   3e 8d 74 26 00          ds lea 0x0(%rsi,%riz,1),%esi
  33:   a1 64 b8 2b c2 f6 c4    movabs 0x7502c4f6c22bb864,%eax
  3a:   02 75
  3c:   0d                      .byte 0xd
  3d:   31 c0                   xor    %eax,%eax
  3f:   c3                      ret

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2
   2:   eb da                   jmp    0xffffffffffffffde
   4:   3e 8d 74 26 00          ds lea 0x0(%rsi,%riz,1),%esi
   9:   a1 64 b8 2b c2 f6 c4    movabs 0x7502c4f6c22bb864,%eax
  10:   02 75
  12:   0d                      .byte 0xd
  13:   31 c0                   xor    %eax,%eax
  15:   c3                      ret
[  215.165363][   T28] EAX: 00000004 EBX: 00000002 ECX: fffffffc EDX: 00000023
[  215.172308][   T28] ESI: 00000023 EDI: c1045a80 EBP: c30fdeb8 ESP: c30fdeb4
[  215.179248][   T28] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010002
[  215.186717][   T28] CR0: 80050033 CR2: a95b2f4c CR3: 02400000 CR4: 003506f0
[  215.193662][   T28] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  215.200601][   T28] DR6: fffe0ff0 DR7: 00000400
[  215.205126][   T28] Call Trace:
[ 215.208266][ T28] default_send_IPI_single (arch/x86/kernel/apic/ipi.c:228)
[ 215.213396][ T28] apic_retrigger_irq (arch/x86/kernel/apic/vector.c:880)
[ 215.218095][ T28] irq_chip_retrigger_hierarchy (kernel/irq/chip.c:1473)
[ 215.223658][ T28] fixup_irqs (arch/x86/kernel/irq.c:371)
[ 215.227753][ T28] cpu_disable_common (arch/x86/kernel/smpboot.c:1663)
[ 215.232451][ T28] native_cpu_disable (arch/x86/kernel/smpboot.c:1694)
[ 215.237150][ T28] take_cpu_down (kernel/cpu.c:1035)
[ 215.241415][ T28] multi_cpu_stop (kernel/stop_machine.c:240)
[ 215.245766][ T28] cpu_stopper_thread (kernel/stop_machine.c:512)
[ 215.250549][ T28] ? stop_machine_yield+0x4/0x4
[ 215.255246][ T28] smpboot_thread_fn (kernel/smpboot.c:164 (discriminator 4))
[ 215.259943][ T28] kthread (kernel/kthread.c:376)
[ 215.263688][ T28] ? sort_range (kernel/smpboot.c:107)
[ 215.267858][ T28] ? kthread_complete_and_exit (kernel/kthread.c:331)
[ 215.273334][ T28] ret_from_fork (arch/x86/entry/entry_32.S:770)
[  215.277599][   T28] ---[ end trace 0000000000000000 ]---

--
Best Regards,
Yujie

Download attachment "dmesg.xz" of type "application/x-xz" (40276 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ