lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210412095231.GC4282@MiWiFi-R3L-srv>
Date:   Mon, 12 Apr 2021 17:52:31 +0800
From:   Baoquan He <bhe@...hat.com>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     "H. Peter Anvin" <hpa@...or.com>,
        Lianbo Jiang <lijiang@...hat.com>,
        linux-kernel@...r.kernel.org, linux-efi@...r.kernel.org,
        platform-driver-x86@...r.kernel.org, x86@...nel.org,
        ardb@...nel.org, tglx@...utronix.de, mingo@...hat.com,
        bp@...en8.de, dvhart@...radead.org, andy@...radead.org,
        kexec@...ts.infradead.org, dyoung@...hat.com
Subject: Re: [PATCH] x86/efi: Do not release sub-1MB memory regions when the
 crashkernel option is specified

On 04/11/21 at 06:49pm, Andy Lutomirski wrote:
> 
> 
> > On Apr 11, 2021, at 6:14 PM, Baoquan He <bhe@...hat.com> wrote:
> > 
> > On 04/09/21 at 07:59pm, H. Peter Anvin wrote:
> >> Why don't we do this unconditionally? At the very best we gain half a megabyte of memory (except the trampoline, which has to live there, but it is only a few kilobytes.)
> > 
> > This is a great suggestion, thanks. I think we can fix it in this way to
> > make code simpler. Then the specific caring of real mode in
> > efi_free_boot_services() can be removed too.
> > 
> 
> This whole situation makes me think that the code is buggy before and buggy after.
> 
> The issue here (I think) is that various pieces of code want to reserve specific pieces of otherwise-available low memory for their own nefarious uses. I don’t know *why* crash kernel needs this, but that doesn’t matter too much.

Kdump kernel also need go through real mode code path during bootup. It
is not different than normal kernel except that it skips the firmware
resetting. So kdump kernel needs low 1M as system RAM just as normal
kernel does. Here we reserve the whole low 1M with memblock_reserve()
to avoid any later kernel or driver data reside in this area. Otherwise,
we need dump the content of this area to vmcore. As we know, when crash
happened, the old memory of 1st kernel should be untouched until vmcore
dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
In the past, we used a back up region to copy out the low 1M area, and
map the back up region into the low 1M area in vmcore elf file. In
6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
option is specified"), we changed to lock the whole low 1M to avoid
writting any kernel data into, like this we can skip this area when
dumping vmcore.

Above is why we try to memblock reserve the whole low 1M. We don't want
to use it, just don't want anyone to use it in 1st kernel.

> 
> I propose that the right solution is to give low-memory-reserving code paths two chances to do what they need: once at the very beginning and once after EFI boot services are freed.
> 
> Alternatively, just reserve *all* otherwise unused sub 1M memory up front, then release it right after releasing boot services, and then invoke the special cases exactly once.

I am not sure if I got both suggested ways clearly. They look a little
complicated in our case. As I explained at above, we want the whole low
1M locked up, not one piece or some pieces of it.

> 
> In either case, the result is that the crashkernel mess gets unified with the trampoline mess.  One way the result is called twice and needs to be more careful, and the other way it’s called only once.
> 
> Just skipping freeing boot services seems wrong.  It doesn’t unmap boot services, and skipping that is incorrect, I think. And it seems to result in a bogus memory map in which the system thinks that some crashkernel memory is EFI memory instead.

I like hpa's thought to lock the whole low 1M unconditionally since only
a few KB except of trampoline area is there. Rethinking about it, doing
it in can_free_region() may be risky because efi memory region could
cross the 1M boundary, e.g [640K, 100M] with type of
EFI_BOOT_SERVICES_CODE|EFI_BOOT_SERVICES_DATA, it could cause loss of memory.
Just a wild guess, not very sure if the 1M boundary corssing can really
happen. efi_reserve_boot_services() won't split regions.

If moving efi_reserve_boot_services() after reserve_real_mode() is not
accepted, maybe we can call efi_mem_reserve(0, 1M) just as
efi_esrt_init() has done.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ