[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20210524083239.GA2872@MiWiFi-R3L-srv>
Date: Mon, 24 May 2021 16:32:39 +0800
From: Baoquan He <bhe@...hat.com>
To: lijiang <lijiang@...hat.com>, andy@...radead.org
Cc: linux-kernel@...r.kernel.org, linux-efi@...r.kernel.org,
platform-driver-x86@...r.kernel.org, x86@...nel.org,
ardb@...nel.org, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dvhart@...radead.org, kexec@...ts.infradead.org,
hpa@...or.com, Dave Young <dyoung@...hat.com>
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the
crashkernel option is specified
Hi Lianbo,
On 05/24/21 at 11:00am, lijiang wrote:
> Also add mail lists and more people in the cc list.
>
> Thanks.
> Lianbo
>
>
> On Fri, May 21, 2021 at 8:36 PM lijiang <lijiang@...hat.com> wrote:
>
> > Hi, Baoquan, Andy
> > Sorry for the late reply.
> >
> > On Tue, Apr 13, 2021 at 5:45 PM Baoquan He <bhe@...hat.com> wrote:
> >
> >> On 04/12/21 at 08:24am, Andy Lutomirski wrote:
> >> > On Mon, Apr 12, 2021 at 2:52 AM Baoquan He <bhe@...hat.com> wrote:
> >> > >
> >> > > On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> >> > > >
> >> > > >
> >> > > > > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@...hat.com> wrote:
> >> > > > >
> >> > > > > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> > > > >> Why don't we do this unconditionally? At the very best we gain
> >> half a megabyte of memory (except the trampoline, which has to live there,
> >> but it is only a few kilobytes.)
> >> > > > >
> >> > > > > This is a great suggestion, thanks. I think we can fix it in this
> >> way to
> >> > > > > make code simpler. Then the specific caring of real mode in
> >> > > > > efi_free_boot_services() can be removed too.
> >> > > > >
> >> > > >
> >> > > > This whole situation makes me think that the code is buggy before
> >> and buggy after.
> >> > > >
> >> > > > The issue here (I think) is that various pieces of code want to
> >> reserve specific pieces of otherwise-available low memory for their own
> >> nefarious uses. I don’t know *why* crash kernel needs this, but that
> >> doesn’t matter too much.
> >> > >
> >> > > Kdump kernel also need go through real mode code path during bootup.
> >> It
> >> > > is not different than normal kernel except that it skips the firmware
> >> > > resetting. So kdump kernel needs low 1M as system RAM just as normal
> >> > > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> >> > > to avoid any later kernel or driver data reside in this area.
> >> Otherwise,
> >> > > we need dump the content of this area to vmcore. As we know, when
> >> crash
> >> > > happened, the old memory of 1st kernel should be untouched until
> >> vmcore
> >> > > dumping read out its content. Meanwhile, kdump kernel need reuse low
> >> 1M.
> >> > > In the past, we used a back up region to copy out the low 1M area, and
> >> > > map the back up region into the low 1M area in vmcore elf file. In
> >> > > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the
> >> crashkernel
> >> > > option is specified"), we changed to lock the whole low 1M to avoid
> >> > > writting any kernel data into, like this we can skip this area when
> >> > > dumping vmcore.
> >> > >
> >> > > Above is why we try to memblock reserve the whole low 1M. We don't
> >> want
> >> > > to use it, just don't want anyone to use it in 1st kernel.
> >> > >
> >> > > >
> >> > > > I propose that the right solution is to give low-memory-reserving
> >> code paths two chances to do what they need: once at the very beginning and
> >> once after EFI boot services are freed.
> >> > > >
> >> > > > Alternatively, just reserve *all* otherwise unused sub 1M memory up
> >> front, then release it right after releasing boot services, and then invoke
> >> the special cases exactly once.
> >> > >
> >> > > I am not sure if I got both suggested ways clearly. They look a little
> >> > > complicated in our case. As I explained at above, we want the whole
> >> low
> >> > > 1M locked up, not one piece or some pieces of it.
> >> >
> >> > My second suggestion is probably the better one. Here it is,
> >> concretely:
> >> >
> >> > The early (pre-free_efi_boot_services) code just reserves all
> >> > available sub-1M memory unconditionally, but it specially marks it as
> >> > reserved-but-available-later. We stop allocating the trampoline page
> >> > at this stage.
> >> >
> >> > In free_efi_boot_services, instead of *freeing* the sub-1M memory, we
> >> > stick it in the pile of reserved memory created in the early step.
> >> > This may involve splitting a block, kind of like the current
> >> > trampoline late allocation works.
> >> >
> >> > Then, *after* free_efi_boot_services(), we run a single block of code
> >> > that lets everything that wants sub-1M code claim some. This means
> >> > that the trampoline gets allocated and, if crashkernel wants to claim
> >> > everything else, it can. After that, everything still unclaimed gets
> >> > freed.
> >>
> >> void __init setup_arch(char **cmdline_p)
> >> {
> >> ...
> >> efi_reserve_boot_services();
> >> e820__memblock_alloc_reserved_mpc_new();
> >> #ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
> >> setup_bios_corruption_check();
> >> #endif
> >> reserve_real_mode();
> >>
> >>
> >> trim_platform_memory_ranges();
> >> trim_low_memory_range();
> >> ...
> >> }
> >>
> >> After efi_reserve_boot_services(), there are several function calling to
> >> require memory reservation under low 1M.
> >>
> >>
> >> asmlinkage __visible void __init __no_sanitize_address
> >> start_kernel(void)
> >>
> >> {
> >> ...
> >> setup_arch(&command_line);
> >> ...
> >> mm_init();
> >> --> mem_init();
> >> -->memblock_free_all();
> >>
> >> ...
> >> #ifdef CONFIG_X86
> >> if (efi_enabled(EFI_RUNTIME_SERVICES))
> >> efi_enter_virtual_mode();
> >> -->efi_free_boot_services();
> >> -->memblock_free_late();
> >> #endif
> >> ...
> >> }
> >>
> >> So from the code flow, we can see that buddy allocator is built in
> >> mm_init() which puts all memory from memblock.memory excluding
> >> memblock.reserved into buddy. And much later, we call
> >> efi_free_boot_services() to release those reserved efi boot memory into
> >> buddy too.
> >>
> >> Are you suggesting we should do the memory reservation from low 1M
> >> after efi_free_boot_services()? To require memory pages from buddy for
> >> them? Please help point out my misunderstanding if have any.
> >>
> >> With my understanding, in non-efi case, we have done the memory
> >> reservation with memblock_reserve(), e.g
> >> e820__memblock_alloc_reserved_mpc_new, reserve_real_mode() are calling
> >> to do. Just efi_reserve|free_boot_services() break them when efi is
> >>
> >
> > Yes. But Andy also suggested to reserve all available sub-1M memory
> > **unconditionally**
> > in the early code.
> >
> > enabled. We can do them again in efi_free_boot_services() just like the
> >> real_mode reservation does.
> >>
> >
> > Do you mean to call the memblock_reserve() in
> > the efi_free_boot_services()? Or anything else?
> > Would you mind sharing more details about this?
Hmm, maybe no. There's no chance to call memblock_reserve() after
memblock_free_late().
I suggested making change in efi_free_boot_services() because the
similar problem has been encountered by Andy himself and fixed in
efi_free_boot_services(). Please check below commit for reference.
The patch described the phenemenon, while explained why in code
comment.
commit 5bc653b7318217c54244a14f248f1f07abe0a865
Author: Andy Lutomirski <luto@...nel.org>
Date: Wed Aug 10 02:29:17 2016 -0700
x86/efi: Allocate a trampoline if needed in efi_free_boot_services()
So we could handle it in the same place by extending the area to the
whole low-1M unconditionally. This is 1st way of three I can think of.
The other two are:
2) Move efi_reserve_boot_services() down to be above init_mem_mapping().
until init_mem_mapping(), we stop reserving memory from low-1M with
memblock_reserve(). The change is like below:
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 72920af0b3c0..93b2106d2050 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1067,7 +1067,6 @@ void __init setup_arch(char **cmdline_p)
* The EFI specification says that boot service code won't be
* called after ExitBootServices(). This is, in fact, a lie.
*/
- efi_reserve_boot_services();
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
@@ -1090,6 +1089,8 @@ void __init setup_arch(char **cmdline_p)
*/
trim_snb_memory();
+ efi_reserve_boot_services();
+
init_mem_mapping();
idt_setup_early_pf();
3) Do it in efi_reserve_boot_services() as you are doing in this patch.
I prefer the 1st and 2nd way.
And by the way, the original code of commit 5bc653b731821("x86/efi: Allocate
a trampoline if needed in efi_free_boot_services()") could be buggy. I
remember you ever pasted the boot log of system where you reproduced
issue and tested the tested patch, there are three separate pages from
low-1M reserved by boot services. If any of them falls into real mode
area, it will break the fix of commit 5bc653b731821.
Thanks
Baoquan
> >
> > diff --git a/arch/x86/platform/efi/quirks.c
> > b/arch/x86/platform/efi/quirks.c
> > index 7850111008a8..d02f12a60457 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -453,6 +453,8 @@ void __init efi_free_boot_services(void)
> > memblock_free_late(start, size);
> > }
> >
> > + memblock_reserve(0, 1<<20);
> > +
> > if (!num_entries)
> > return;
> >
> > Thanks.
> > Lianbo
> >
> >
> > Thanks
> >> Baoquan
> >>
> >
Powered by blists - more mailing lists