[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fee21b3f-fe4d-480e-9ad6-cac8ba46055f@windriver.com>
Date: Wed, 10 Dec 2025 15:31:24 +0800
From: Jianpeng Chang <jianpeng.chang.cn@...driver.com>
To: Anshuman Khandual <anshuman.khandual@....com>, catalin.marinas@....com,
will@...nel.org, ying.huang@...ux.alibaba.com, ardb@...nel.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v3 PATCH] arm64: mm: Fix kexec failure after pte_mkwrite_novma()
change
On 12/4/25 4:16 PM, Chang, Jianpeng (CN) wrote:
>
>
> On 12/4/2025 4:07 PM, Anshuman Khandual wrote:
>> CAUTION: This email comes from a non Wind River email account!
>> Do not click links or open attachments unless you recognize the sender
>> and know the content is safe.
>>
>> On 04/12/25 11:57 AM, Jianpeng Chang wrote:
>>> Commit 143937ca51cc ("arm64, mm: avoid always making PTE dirty in
>>> pte_mkwrite()") modified pte_mkwrite_novma() to only clear PTE_RDONLY
>>> when the page is already dirty (PTE_DIRTY is set). While this
>>> optimization
>>> prevents unnecessary dirty page marking in normal memory management
>>> paths,
>>> it breaks kexec on some platforms like NXP LS1043.
>>
>> Why is this problem only applicable for NXP LS1043 ? OR is that the only
>> platform you have observed the issue ? although that is problematic else
>> where as well.
>
> Not only 1043. I found it on the NXP LS1043, and I have both NXP LS1043
> and LS1046 boards available. They both have this issue.
Hi Anshuman,
Just following up on my previous response from a week ago, any updates?
I borrowed an IMX8 board, which differs from the LS1043 as it's based on
Cortex-A72 + Cortex-A53, and conducted the same test. When reproducing
the issue on LS1043, I used the same Image as the first kernel with:
kexec -l /boot/Image --reuse-cmdline.
However, I couldn't reproduce it on IMX8 initially - the second kernel
booted normally. Here are the differences between IMX8 and LS1043:
root@...-ls1043:~# cat /proc/iomem | grep Kernel
81000000-824effff : Kernel code
82700000-82a8ffff : Kernel data
root@...-ls1043:~# kexec -l /boot/Image --reuse-cmdline -d 2>&1 | grep
segment
image_arm64_load: kernel_segment: 0000000080000000
arm64_load_other_segments:730: purgatory sink: 0x0
nr_segments = 3
segment[0].buf = 0xffff9c5a2010
segment[0].bufsz = 0x194e200
segment[0].mem = 0x80000000
segment[0].memsz = 0x1a90000
segment[1].buf = 0xaaaad60cc180
segment[1].bufsz = 0xfa81
segment[1].mem = 0x81a90000
segment[1].memsz = 0x10000
segment[2].buf = 0xaaaad60dc1c0
segment[2].bufsz = 0x3660
segment[2].mem = 0x81aa0000
segment[2].memsz = 0x4000
root@...-imx8:~# cat /proc/iomem | grep Kernel
8a0000000-8a19bffff : Kernel code
8a1c20000-8a1f7ffff : Kernel data
root@...-imx8:~# kexec -l /boot/Image --reuse-cmdline -d 2>&1 | grep segment
image_arm64_load: kernel_segment: 0000000080200000
arm64_load_other_segments:730: purgatory sink: 0x0
nr_segments = 3
segment[0].buf = 0xffff990da010
segment[0].bufsz = 0x1ea2200
segment[0].mem = 0x80200000
segment[0].memsz = 0x1f80000
segment[1].buf = 0xffff99084010
segment[1].bufsz = 0x29839
segment[1].mem = 0x82180000
segment[1].memsz = 0x2a000
segment[2].buf = 0xaaaab0cdbc10
segment[2].bufsz = 0x3680
segment[2].mem = 0x821aa000
segment[2].memsz = 0x4000
>From the logs, on LS1043, the second kernel segments happen to overlap
with the kernel code pages, which are read-only. I was able to reproduce
the same issue on IMX8 by forcing the overlap:
kexec -l /boot/Image --reuse-cmdline --mem-min=0x898000000
--mem-max=0x8a1000000
root@...-imx8:~# kexec -l /boot/Image --reuse-cmdline
--mem-min=0x898000000 --mem-max=0x8a1000000 -d 2>&1 | grep segment
image_arm64_load: kernel_segment: 0000000898000000
arm64_load_other_segments:730: purgatory sink: 0x0
nr_segments = 3
segment[0].buf = 0xffff95e0a010
segment[0].bufsz = 0x1ea2200
segment[0].mem = 0x898000000 overlap
segment[0].memsz = 0x1f80000
segment[1].buf = 0xffff95db4010
segment[1].bufsz = 0x29839
segment[1].mem = 0x899f80000
segment[1].memsz = 0x2a000
segment[2].buf = 0xaaaac05fbc10
segment[2].bufsz = 0x3680
segment[2].mem = 0x899faa000
segment[2].memsz = 0x4000
This explains why we haven't seen similar reports - the issue is memory
layout dependent. However, I still prefer this fix because it's
universal and works regardless of memory layout or kexec-tools address
selection. We cannot expect kexec-tools to always find the "right"
memory location, and fundamentally, we expect this temporary page table
to be writable.
I'm happy to know if you need any additional information or clarification.
Thanks,
Jianpeng
>
>>
>>>
>>> The issue occurs in the kexec code path:
>>> 1. machine_kexec_post_load() calls trans_pgd_create_copy() to create a
>>> writable copy of the linear mapping
>>> 2. _copy_pte() calls pte_mkwrite_novma() to ensure all pages in the copy
>>> are writable for the new kernel image copying
>>> 3. With the new logic, clean pages (without PTE_DIRTY) remain read-only
>>> 4. When kexec tries to copy the new kernel image through the linear
>>> mapping, it fails on read-only pages, causing the system to hang
>>> after "Bye!"
>>>
>>> The same issue affects hibernation which uses the same trans_pgd code
>>> path.
>>>
>>> Fix this by marking pages dirty with pte_mkdirty() in _copy_pte(), which
>>> ensures pte_mkwrite_novma() clears PTE_RDONLY for both kexec and
>>> hibernation, making all pages in the temporary mapping writable
>>> regardless
>>> of their dirty state. This preserves the original commit's optimization
>>> for normal memory management while fixing the kexec/hibernation
>>> regression.
>>>
>>> Using pte_mkdirty() causes redundant bit operations when the page is
>>> already writable (redundant PTE_RDONLY clearing), but this is acceptable
>>> since it's not a hot path and only affects kexec/hibernation scenarios.
>>>
>>> Fixes: 143937ca51cc ("arm64, mm: avoid always making PTE dirty in
>>> pte_mkwrite()")
>>> Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@...driver.com>
>>> Reviewed-by: Huang Ying <ying.huang@...ux.alibaba.com>
>>> ---
>>> v3:
>>> - Add the description about pte_mkdirty in commit message
>>> - Note that the redundant bit operations in commit message
>>> - Fix the comments following the suggestions
>>> v2: https://lore.kernel.org/all/20251202022707.2720933-1-
>>> jianpeng.chang.cn@...driver.com/
>>> - Use pte_mkwrite_novma(pte_mkdirty(pte)) instead of manual bit
>>> manipulation
>>> - Updated comments to clarify pte_mkwrite_novma() alone cannot be
>>> used
>>> v1: https://lore.kernel.org/all/20251127034350.3600454-1-
>>> jianpeng.chang.cn@...driver.com/
>>>
>>> arch/arm64/mm/trans_pgd.c | 17 +++++++++++++++--
>>> 1 file changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
>>> index 18543b603c77..766883780d2a 100644
>>> --- a/arch/arm64/mm/trans_pgd.c
>>> +++ b/arch/arm64/mm/trans_pgd.c
>>> @@ -40,8 +40,14 @@ static void _copy_pte(pte_t *dst_ptep, pte_t
>>> *src_ptep, unsigned long addr)
>>> * Resume will overwrite areas that may be marked
>>> * read only (code, rodata). Clear the RDONLY bit from
>>> * the temporary mappings we use during restore.
>>> + *
>>> + * For both kexec and hibernation, writable accesses
>>> are required
>>> + * for all pages in the linear map to copy over new
>>> kernel image.
>>> + * Hence mark these pages dirty first via pte_mkdirty()
>>> to ensure
>>> + * pte_mkwrite_novma() subsequently clears PTE_RDONLY -
>>> providing
>>> + * required write access for the pages.
>>> */
>>> - __set_pte(dst_ptep, pte_mkwrite_novma(pte));
>>> + __set_pte(dst_ptep, pte_mkwrite_novma(pte_mkdirty(pte)));
>>> } else if (!pte_none(pte)) {
>>> /*
>>> * debug_pagealloc will removed the PTE_VALID bit if
>>> @@ -57,7 +63,14 @@ static void _copy_pte(pte_t *dst_ptep, pte_t
>>> *src_ptep, unsigned long addr)
>>> */
>>> BUG_ON(!pfn_valid(pte_pfn(pte)));
>>>
>>> - __set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte)));
>>> + /*
>>> + * For both kexec and hibernation, writable accesses
>>> are required
>>> + * for all pages in the linear map to copy over new
>>> kernel image.
>>> + * Hence mark these pages dirty first via pte_mkdirty()
>>> to ensure
>>> + * pte_mkwrite_novma() subsequently clears PTE_RDONLY -
>>> providing
>>> + * required write access for the pages.
>>> + */
>>> + __set_pte(dst_ptep,
>>> pte_mkvalid(pte_mkwrite_novma(pte_mkdirty(pte))));
>>> }
>>> }
>>>
>>
>
Powered by blists - more mailing lists