[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ldrujhr5.fsf@gmail.com>
Date: Mon, 21 Apr 2025 09:19:50 +0530
From: Ritesh Harjani (IBM) <riteshh@...ux.ibm.com>
To: Misbah Anjum N <misanjum@...ux.ibm.com>, linuxppc-dev@...ts.ozlabs.org,
linux-mm@...ck.org
Cc: maddy@...ux.ibm.com, mpe@...erman.id.au, npiggin@...il.com,
christophe.leroy@...roup.eu, naveen@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [BUG][powerpc] OOPs: Kernel access of bad area during zram swap write - kswapd0 crash
++ linux-mm
Misbah Anjum N <misanjum@...ux.ibm.com> writes:
> Bug Description:
> When running Avocado-VT based functional tests on KVM guest, the system
> encounters a
> kernel panic and crash during memory reclaim activity when zram is
> actively used for
> swap. The crash occurs in the kswapd0 kernel thread during what appears
> to be a write
> operation to zram.
>
>
> Steps to Reproduce:
> 1. Compile Upstream Kernel on LPAR
> 2. Compile Qemu, Libvirt for KVM Guest
> 3. Run Functional tests on KVM guest using Avocado-VT Regression Bucket
> a. Clone: git clone https://github.com/lop-devops/tests.git
> b. Setup: python3 avocado-setup.py --bootstrap --enable-kvm
> --install-deps
> c. Add guest in folder: tests/data/avocado-vt/images/
> d. Run: python3 avocado-setup.py --run-suite guest_regression
> --guest-os\
> <Guest-name> --only-filter 'virtio_scsi virtio_net qcow2'\
> --no-download
>
> The bug is reproducible when Avocado-VT Regression bucket is executed
> which
> consists of series of functional tp-libvirt tests performed on the KVM
> guest in the
> following order: cpu, memory, network, storage and hotplug (disk, change
> media,
> libvirt_mem), etc.
> Whilst execution, the system crashes during test:
> io-github-autotest-libvirt.libvirt_mem.positive_test.mem_basic.cold_plug_discard
> Note: This does not appear to be caused by a single test, but by
> cumulative
> operations during the test sequence.
>
>
> Environment Details:
> Kernel: 6.15.0-rc1-g521d54901f98
> Reproducible with: 6.15.0-rc2-gf3a2e2a79c9d
Looks like the issue is happening on 6.15-rc2. Did git bisect revealed a
faulty commit?
> Platform: IBM POWER10 LPAR (ppc64le)
> Distro: Fedora42
> RAM: 64GB
> CPUs: 80
> Qemu: 9.2.93 (v10.0.0-rc3-10-g8bdd3a0308)
> Libvirt: 11.3.0
>
>
> System Memory State:
> # free -mh
> total used free shared
> buff/cache available
> Mem: 61Gi 3.0Gi 25Gi 11Mi
> 33Gi 58Gi
> Swap: 8.0Gi 0B 8.0Gi
> # zramctl
> NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS
> MOUNTPOINT
> /dev/zram0 lzo-rle 8G 64K 222B 128K [SWAP]
> # swapon --show
> NAME TYPE SIZE USED PRIO
> /dev/zram0 partition 8G 0B 100
>
>
> Call Trace:
> [180060.602200] BUG: Unable to handle kernel data access on read at
> 0xc00800000a1b0000
> [180060.602219] Faulting instruction address: 0xc000000000175670
> [180060.602224] Oops: Kernel access of bad area, sig: 11 [#1]
> [180060.602227] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA
> pSeries
> [180060.602232] Modules linked in: dm_thin_pool dm_persistent_data
> vmw_vsock_virtio_transport_common vsock zram xfs dm_service_time sd_mod
> [180060.602345] CPU: 68 UID: 0 PID: 465 Comm: kswapd0 Kdump: loaded Not
> tainted
> 6.15.0-rc1-g521d54901f98 #1 VOLUNTARY
> [180060.602351] Hardware name: IBM,9080-HEX POWER10 (architected)
> 0x800200 0xf000006 of:IBM,FW1060.21
> (NH1060_078) hv:phyp pSeries
> [180060.602355] NIP: c000000000175670 LR: c0000000006d96b4 CTR:
> 01fffffffffffc05
> [180060.602358] REGS: c0000000a5a56da0 TRAP: 0300 Not tainted
> (6.15.0-rc1-g521d54901f98)
> [180060.602362] MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR:
> 44042880 XER: 20040001
> [180060.602370] CFAR: c0000000001756c8 DAR: c00800000a1b0000 DSISR:
> 40000000 IRQMASK: 0
<...>
>
>
> Crash Utility Output:
> # crash /home/kvmci/linux/vmlinux vmcore
> crash 8.0.6-4.fc42
>
> KERNEL: /home/kvmci/linux/vmlinux [TAINTED]
> DUMPFILE: vmcore [PARTIAL DUMP]
> CPUS: 80
> DATE: Wed Dec 31 18:00:00 CST 1969
> UPTIME: 2 days, 02:01:00
> LOAD AVERAGE: 0.72, 0.66, 0.64
> TASKS: 1249
> NODENAME: ***
> RELEASE: 6.15.0-rc1-g521d54901f98
> VERSION: #1 SMP Wed Apr 9 05:13:03 CDT 2025
> MACHINE: ppc64le (3450 Mhz)
> MEMORY: 64 GB
> PANIC: "Oops: Kernel access of bad area, sig: 11 [#1]" (check log
> for details)
> PID: 465
> COMMAND: "kswapd0"
> TASK: c000000006067d80 [THREAD_INFO: c000000006067d80]
> CPU: 68
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 465 TASK: c000000006067d80 CPU: 68 COMMAND: "kswapd0"
> R0: 000000000e000000 R1: c0000000a5a57040 R2: c0000000017a8100
> R3: c000000d34cefd00 R4: c00800000a1affe8 R5: fffffffffffffffa
> R6: 01ffffffffffffff R7: 03ffffff2cb33000 R8: 0000000080000000
> R9: 0000000000000010 R10: 0000000000000020 R11: 0000000000000030
> R12: 0000000000000040 R13: c000000ffde34300 R14: 0000000000000050
> R15: 0000000000000060 R16: 0000000000000070 R17: 5deadbeef0000122
> R18: 0000000101124ad9 R19: 0000000000018028 R20: 0000000000c01400
> R21: c0000000a5a574b0 R22: 0000000000010000 R23: 0000000000000051
> R24: c000000002be02f8 R25: c00c000003ef7380 R26: c00800000a190000
> R27: c0000001054dbbe0 R28: c00c0000034d3340 R29: 00000000000002d8
> R30: fffffffffffffffa R31: c0000001054dbbb0
> NIP: c000000000175670 MSR: 8000000002009033 OR3: c0000000001756c8
> CTR: 01fffffffffffc05 LR: c0000000006d96b4 XER: 0000000020040001
> CCR: 0000000044042880 MQ: 0000000000000000 DAR: c00800000a1b0000
> DSISR: 0000000040000000 Syscall Result: 0000000000000000
> [NIP : memcpy_power7+1648]
> [LR : zs_obj_write+548]
> #0 [c0000000a5a56c50] crash_kexec at c00000000037f268
> #1 [c0000000a5a56c80] oops_end at c00000000002b678
> #2 [c0000000a5a56d00] __bad_page_fault at c00000000014f348
> #3 [c0000000a5a56d70] data_access_common_virt at c000000000008be0
> #4 [c0000000a5a57040] memcpy_power7 at c000000000175274
> #5 [c0000000a5a57140] zs_obj_write at c0000000006d96b4
Looks like zsmalloc new object mapping API being called, which was
merged in rc1? But let's first confirm from git bisect, unless someone
from linux-mm who knows zsmalloc subsystem better and can point on what
could be going wrong here.
> #6 [c0000000a5a571b0] zram_write_page at c008000009174a50 [zram]
> #7 [c0000000a5a57260] zram_bio_write at c008000009174ff4 [zram]
> #8 [c0000000a5a57310] __submit_bio at c0000000009323ac
> #9 [c0000000a5a573a0] __submit_bio_noacct at c000000000932614
> #10 [c0000000a5a57410] submit_bio_wait at c000000000926d34
> #11 [c0000000a5a57480] swap_writepage_bdev_sync at c00000000065ab5c
> #12 [c0000000a5a57540] swap_writepage at c00000000065b90c
> #13 [c0000000a5a57570] shmem_writepage at c0000000005ada30
> #14 [c0000000a5a57630] pageout at c00000000059d700
> #15 [c0000000a5a57850] shrink_folio_list at c00000000059eafc
> #16 [c0000000a5a57aa0] shrink_inactive_list at c0000000005a0b38
> #17 [c0000000a5a57b70] shrink_lruvec at c0000000005a12a0
> #18 [c0000000a5a57c80] shrink_node_memcgs at c0000000005a16f4
> #19 [c0000000a5a57d00] shrink_node at c0000000005a183c
> #20 [c0000000a5a57d80] balance_pgdat at c0000000005a24e0
> #21 [c0000000a5a57ef0] kswapd at c0000000005a2b50
> #22 [c0000000a5a57f80] kthread at c00000000026ea68
> #23 [c0000000a5a57fe0] start_kernel_thread at c00000000000df98
>
-ritesh
Powered by blists - more mailing lists