linux-kernel - [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <2b4d03bc-2b6e-45b0-655a-58b66672187e@huawei.com>
Date:   Wed, 14 Dec 2022 09:33:10 +0800
From:   mawupeng <mawupeng1@...wei.com>
To:     <naoya.horiguchi@....com>
CC:     <mawupeng1@...wei.com>, <catalin.marinas@....com>,
        <gregkh@...uxfoundation.org>, <akpm@...ux-foundation.org>,
        <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: [Qestion] UCE on pud-sized hugepage lead to kernel panic on lts5.10

On current arm64 stable 5.10(v5.10.158). If a UCE happnes pud-sized
hugepage, kernel will panic since current memory failure can not handle
this kind of memory failure since commit 31286a8484a8 ("mm: hwpoison:
disable memory error handling on 1GB hugepage")

The latest kernel(v6.0) can handle this UCE since commit 6f4614886baa ("mm,
hwpoison: enable memory error handling on 1GB hugepage"). We are trying to
backport this patchset to stable 5.10, however too many other patches
should be backport since there are huge difference between 5.10 and 6.0.
The full patch list will be shown at the end of this mail.

We do not think backport all of these patches is doable for stable 5.10. Is
there any better way to fix this problem.

The kernel panic call trace:

  Kernel panic - not syncing: Fatal hardware error!
  CPU: 0 PID: 15 Comm: kworker/0:1 Not tainted 5.10.158_stable_5_10 #1
  Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.26.01 06/14/2019
  Workqueue: kacpi_notify acpi_os_execute_deferred
  Call trace:
   dump_backtrace+0x0/0x1ec
   show_stack+0x20/0x30
   dump_stack+0xd0/0x128
   panic+0x154/0x36c
   __raw_spin_lock_irqsave.constprop.0+0x0/0xb0
   ghes_proc+0x148/0x200
   ghes_notify_hed+0x58/0xd4
   blocking_notifier_call_chain+0x74/0xb0
   acpi_hed_notify+0x28/0x3c
   acpi_device_notify+0x24/0x30
   acpi_ev_notify_dispatch+0x68/0x78
   acpi_os_execute_deferred+0x24/0x3c
   process_one_work+0x1d4/0x4b0
   worker_thread+0x180/0x430
   kthread+0x118/0x120
   ret_from_fork+0x10/0x18
  SMP: stopping secondary CPUs
  Kernel Offset: 0x4ed64eb80000 from 0xffff800010000000
  PHYS_OFFSET: 0xffffd24300000000
  CPU features: 0x00000002,62208a38
  Memory Limit: none
  Rebooting in 30 seconds..

Our backport list(bug fixes not included):

  mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: refactor get_any_page
  mm,hwpoison: disable pcplists before grabbing a refcount
  mm,hwpoison: remove drain_all_pages from shake_page
  hugetlb: use page.private for hugetlb specific page flags
  hugetlb: convert page_huge_active() HPageMigratable flag
  hugetlb: convert PageHugeTemporary() to HPageTemporary flag
  hugetlb: convert PageHugeFreed to HPageFreed flag
  mm,hwpoison: fix race with hugetlb page allocation
  mm: hugetlb: gather discrete indexes of tail page
  hugetlb: create remove_hugetlb_page() to separate functionality
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm/hwpoison: disable pcp for page_handle_poison()
  mm/hwpoison: mf_mutex for soft offline and unpoison
  mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
  mm/hwpoison: fix unpoison_memory()
  mm/memory-failure.c: fix race with changing page compound again
  mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/memory-failure.c: simplify num_poisoned_pages_dec
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage