[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jzqh5j4w4w23xuigqj5bggbmx2hgte4u5tvbss3hqi3vjeodhl@rnmirwt6biol>
Date: Tue, 20 Aug 2024 13:13:26 +0300
From: "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>
To: "Huang, Kai" <kai.huang@...el.com>
Cc: "Huang, Ying" <ying.huang@...el.com>,
"ardb@...nel.org" <ardb@...nel.org>, "luto@...nel.org" <luto@...nel.org>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"tzimmermann@...e.de" <tzimmermann@...e.de>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
"mingo@...hat.com" <mingo@...hat.com>, "bhe@...hat.com" <bhe@...hat.com>,
"tglx@...utronix.de" <tglx@...utronix.de>, "hpa@...or.com" <hpa@...or.com>,
"peterz@...radead.org" <peterz@...radead.org>, "bp@...en8.de" <bp@...en8.de>,
"rafael@...nel.org" <rafael@...nel.org>, "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCHv3 3/4] x86/64/kexec: Map original relocate_kernel() in
init_transition_pgtable()
On Mon, Aug 19, 2024 at 12:39:23PM +0000, Huang, Kai wrote:
> On Mon, 2024-08-19 at 14:57 +0300, kirill.shutemov@...ux.intel.com wrote:
> > On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote:
> > > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote:
> > > > The init_transition_pgtable() function sets up transitional page tables.
> > > > It ensures that the relocate_kernel() function is present in the
> > > > identity mapping at the same location as in the kernel page tables.
> > > > relocate_kernel() switches to the identity mapping, and the function
> > > > must be present at the same location in the virtual address space before
> > > > and after switching page tables.
> > > >
> > > > init_transition_pgtable() maps a copy of relocate_kernel() in
> > > > image->control_code_page at the relocate_kernel() virtual address, but
> > > > the original physical address of relocate_kernel() would also work.
> > > >
> > > > It is safe to use original relocate_kernel() physical address cannot be
> > > > overwritten until swap_pages() is called, and the relocate_kernel()
> > > > virtual address will not be used by then.
> > > >
> > > > Map the original relocate_kernel() at the relocate_kernel() virtual
> > > > address in the identity mapping. It is preparation to replace the
> > > > init_transition_pgtable() implementation with a call to
> > > > kernel_ident_mapping_init().
> > > >
> > > > Note that while relocate_kernel() switches to the identity mapping, it
> > > > does not flush global TLB entries (CR4.PGE is not cleared). This means
> > > > that in most cases, the kernel still runs relocate_kernel() from the
> > > > original physical address before the change.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> > > > ---
> > > > arch/x86/kernel/machine_kexec_64.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > > index 9c9ac606893e..645690e81c2d 100644
> > > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
> > > > pte_t *pte;
> > > >
> > > > vaddr = (unsigned long)relocate_kernel;
> > > > - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
> > > > + paddr = __pa(relocate_kernel);
> > > > pgd += pgd_index(vaddr);
> > > > if (!pgd_present(*pgd)) {
> > > > p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
> > >
> > >
> > > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true).
> > >
> > > The relocate_kernel() first saves couple of regs and some other data like PA
> > > of swap page to the control page. Note here the VA_CONTROL_PAGE is used to
> > > access the control page, so those data are saved to the control page.
> > >
> > > SYM_CODE_START_NOALIGN(relocate_kernel)
> > > UNWIND_HINT_END_OF_STACK
> > > ANNOTATE_NOENDBR
> > > /*
> > > * %rdi indirection_page
> > > * %rsi page_list
> > > * %rdx start address
> > > * %rcx preserve_context
> > > * %r8 bare_metal
> > > */
> > >
> > > ...
> > >
> > > movq PTR(VA_CONTROL_PAGE)(%rsi), %r11
> > > movq %rsp, RSP(%r11)
> > > movq %cr0, %rax
> > > movq %rax, CR0(%r11)
> > > movq %cr3, %rax
> > > movq %rax, CR3(%r11)
> > > movq %cr4, %rax
> > > movq %rax, CR4(%r11)
> > >
> > > ...
> > >
> > > /*
> > > * get physical address of control page now
> > > * this is impossible after page table switch
> > > */
> > > movq PTR(PA_CONTROL_PAGE)(%rsi), %r8
> > >
> > > /* get physical address of page table now too */
> > > movq PTR(PA_TABLE_PAGE)(%rsi), %r9
> > >
> > > /* get physical address of swap page now */
> > > movq PTR(PA_SWAP_PAGE)(%rsi), %r10
> > >
> > > /* save some information for jumping back */
> > > movq %r9, CP_PA_TABLE_PAGE(%r11)
> > > movq %r10, CP_PA_SWAP_PAGE(%r11)
> > > movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
> > >
> > > ...
> > >
> > > And after jumping back from the second kernel, relocate_kernel() tries to
> > > restore the saved data:
> > >
> > > ...
> > >
> > > /* get the re-entry point of the peer system */
> > > movq 0(%rsp), %rbp
> > > leaq relocate_kernel(%rip), %r8 <--------- (*)
> > > movq CP_PA_SWAP_PAGE(%r8), %r10
> > > movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
> > > movq CP_PA_TABLE_PAGE(%r8), %rax
> > > movq %rax, %cr3
> > > lea PAGE_SIZE(%r8), %rsp
> > > call swap_pages
> > > movq $virtual_mapped, %rax
> > > pushq %rax
> > > ANNOTATE_UNRET_SAFE
> > > ret
> > > int3
> > > SYM_CODE_END(identity_mapped)
> > >
> > > Note the above code (*) uses the VA of relocate_kernel() to access the control
> > > page. IIUC, that means if we map VA of relocate_kernel() to the original PA
> > > where the code relocate_kernel() resides, then the above code will never be
> > > able to read those data back since they were saved to the control page.
> > >
> > > Did I miss anything?
> >
> > Note that relocate_kernel() usage at (*) is inside identity_mapped(). We
> > run from identity mapping there. Nothing changed to identity mapping
> > around relocate_kernel(), only top mapping (at __START_KERNEL_map) is
> > affected.
>
> Yes, but before this patch the VA of relocate_kernel() is mapped to the copied
> one, which resides in the control page:
>
> control_page = page_address(image->control_code_page) + PAGE_SIZE;
> __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
>
> page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
> page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
>
> So the (*) can actually access to the control page IIUC.
>
> Now if we change to map VA of relocate_kernel() to the original one, then (*)
> won't be able to access the control page.
No, it still will be able to access control page.
So we call relocate_kernel() in normal kernel text (within
__START_KERNEL_map).
relocate_kernel() switches to identity mapping, VA is still the same.
relocate_kernel() jumps to identity_mapped() in the control page:
/*
* get physical address of control page now
* this is impossible after page table switch
*/
movq PTR(PA_CONTROL_PAGE)(%rsi), %r8
...
/* jump to identity mapped page */
addq $(identity_mapped - relocate_kernel), %r8
pushq %r8
ANNOTATE_UNRET_SAFE
ret
The ADDQ finds offset of identity_mapped() in the control page.
identity_mapping() finds start of the control page from *relative*
position of relocate_page() to the current RIP in the control page:
leaq relocate_kernel(%rip), %r8
It looks like this in my kernel binary:
lea -0xfa(%rip),%r8
What PA is mapped at the normal kernel text VA of relocate_kernel() makes
zero affect to the calculation.
Does it make sense?
--
Kiryl Shutsemau / Kirill A. Shutemov
Powered by blists - more mailing lists