[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <710c48c9-406d-e4c5-a394-10501b951316@samsung.com>
Date: Wed, 13 Apr 2022 16:03:28 +0200
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Peter Xu <peterx@...hat.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Cc: Mike Kravetz <mike.kravetz@...cle.com>,
Nadav Amit <nadav.amit@...il.com>,
Matthew Wilcox <willy@...radead.org>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
David Hildenbrand <david@...hat.com>,
Hugh Dickins <hughd@...gle.com>,
Jerome Glisse <jglisse@...hat.com>,
"Kirill A . Shutemov" <kirill@...temov.name>,
Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Alistair Popple <apopple@...dia.com>
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()
Hi,
On 05.04.2022 03:48, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes. For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte. pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <peterx@...hat.com>
This patch landed in today's linux next-202204213 as commit fa6009949163
("mm: check against orig_pte for finish_fault()"). Unfortunately it
causes serious system instability on some ARM 32bit machines. I've
observed it on all tested boards (various Samsung Exynos based,
Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
compiled from multi_v7_defconfig.
Here is a crash log from QEMU's ARM 32bit virt machine:
8<--- cut here ---
Unable to handle kernel paging request at virtual address e093263c
[e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#1] SMP ARM
Modules linked in:
CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>] lr : [<c0478144>] psr: 40000013
sp : e0931df8 ip : e0931e54 fp : c26a8000
r10: 00000081 r9 : c2230880 r8 : 00000000
r7 : 00000081 r6 : beffffed r5 : c267f000 r4 : c2230880
r3 : 00000000 r2 : 00000000 r1 : 00000040 r0 : e0931e3c
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 4020406a DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0930000
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c2230880 pointer offset 0
Register r5 information: slab vm_area_struct start c267f000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c2230880 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c26a8000 pointer offset 0
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0930000
allocated at kernel_clone+0x8c/0x3a8
Process kworker/u4:0 (pid: 37, stack limit = 0x(ptrval))
Stack: (0xe0931df8 to 0xe0932000)
...
---[ end trace 0000000000000000 ]---
CAN device driver interface
bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
e1000e: Intel(R) PRO/1000 Network Driver
e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
igb: Intel(R) Gigabit Ethernet Network Driver
igb: Copyright (c) 2007-2014 Intel Corporation.
pegasus: Pegasus/Pegasus II USB Ethernet driver
usbcore: registered new interface driver pegasus
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
usbcore: registered new interface driver smsc75xx
usbcore: registered new interface driver smsc95xx
usbcore: registered new interface driver net1080
usbcore: registered new interface driver cdc_subset
usbcore: registered new interface driver zaurus
usbcore: registered new interface driver cdc_ncm
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci-pci: EHCI PCI platform driver
ehci-platform: EHCI generic platform driver
ehci-omap: OMAP-EHCI Host Controller driver
ehci-orion: EHCI orion driver
SPEAr-ehci: EHCI SPEAr driver
ehci-st: EHCI STMicroelectronics driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci-pci: OHCI PCI platform driver
ohci-platform: OHCI generic platform driver
SPEAr-ohci: OHCI SPEAr driver
ohci-st: OHCI STMicroelectronics driver
usbcore: registered new interface driver usb-storage
rtc-pl031 9010000.pl031: registered as rtc0
rtc-pl031 9010000.pl031: setting system clock to 2022-04-13T13:49:19 UTC
(1649857759)
i2c_dev: i2c /dev entries driver
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
Synopsys Designware Multimedia Card Interface Driver
sdhci-pltfm: SDHCI platform and OF driver helper
ledtrig-cpu: registered to indicate activity on CPUs
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
NET: Registered PF_INET6 protocol family
Segment Routing with IPv6
In-situ OAM (IOAM) with IPv6
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
NET: Registered PF_PACKET protocol family
can: controller area network core
NET: Registered PF_CAN protocol family
can: raw protocol
can: broadcast manager protocol
can: netlink gateway - max_hops=1
Key type dns_resolver registered
ThumbEE CPU extension supported.
Registering SWP/SWPB emulation handler
Loading compiled-in X.509 certificates
input: gpio-keys as /devices/platform/gpio-keys/input/input0
uart-pl011 9000000.pl011: no DMA platform data
EXT4-fs (vda): mounted filesystem with ordered data mode. Quota mode:
disabled.
VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 2048K
Run /sbin/init as init process
with arguments:
/sbin/init
with environment:
HOME=/
TERM=linux
8<--- cut here ---
Unable to handle kernel paging request at virtual address e082662c
[e082662c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G D
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>] lr : [<c0478144>] psr: 40000013
sp : e0825de8 ip : e0825e44 fp : c213e000
r10: 00000081 r9 : c20e0000 r8 : 00000000
r7 : 00000081 r6 : befffff1 r5 : c2695000 r4 : c20e0000
r3 : 00000000 r2 : 00000000 r1 : 00000040 r0 : e0825e2c
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 4020406a DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0824000
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c20e0000 pointer offset 0
Register r5 information: slab vm_area_struct start c2695000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c20e0000 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c213e000 pointer offset 0
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0824000
allocated at kernel_clone+0x8c/0x3a8
Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
Stack: (0xe0825de8 to 0xe0826000)
...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x40/0x4c
dump_stack_lvl from do_handle_IPI+0x2c4/0x2fc
do_handle_IPI from ipi_handler+0x18/0x20
ipi_handler from handle_percpu_devid_irq+0x8c/0x1e0
handle_percpu_devid_irq from generic_handle_domain_irq+0x40/0x84
generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
gic_handle_irq from generic_handle_arch_irq+0x34/0x44
generic_handle_arch_irq from call_with_stack+0x18/0x20
call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xe0869f50 to 0xe0869f98)
9f40: 00009ddc 00000000 00000001
c031be20
9f60: c20e5d80 c1b48f20 c1904d10 c1904d6c c183e9e8 c1b47971 00000000
00000000
9f80: c1904e24 e0869fa0 c0307b74 c0307b78 60000113 ffffffff
__irq_svc from arch_cpu_idle+0x38/0x3c
arch_cpu_idle from default_idle_call+0x3c/0xb8
default_idle_call from do_idle+0x1f8/0x298
do_idle from cpu_startup_entry+0x18/0x1c
cpu_startup_entry from 0x40301780
---[ end Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b ]---
> ---
> mm/memory.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> vmf->address, &vmf->ptl);
> ret = 0;
> /* Re-check under ptl */
> - if (likely(pte_none(*vmf->pte)))
> + if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> do_set_pte(vmf, page, vmf->address);
> else
> ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * concurrent faults and from rmap lookups.
> */
> vmf->pte = NULL;
> + /*
> + * Always initialize orig_pte. This matches with below
> + * code to have orig_pte to be the none pte if pte==NULL.
> + * This makes the rest code to be always safe to reference
> + * it, e.g. in finish_fault() we'll detect pte changes.
> + */
> + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> } else {
> /*
> * If a huge pmd materialized under us just retry later. Use
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
Powered by blists - more mailing lists