[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160721081438.GA26531@gmail.com>
Date: Thu, 21 Jul 2016 10:14:38 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Andy Lutomirski <luto@...nel.org>
Cc: "H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
Mario Limonciello <mario_limonciello@...l.com>,
Matthew Garrett <mjg59@...f.ucam.org>,
linux-kernel@...r.kernel.org, Kees Cook <keescook@...omium.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation
code
* Andy Lutomirski <luto@...nel.org> wrote:
> Under some conditions, my Dell XPS 13 9350 puts the EBDA at 0x2c000
> but reports the lowmem cutoff as 0. The old code reserves
> everything above 0x2c000 and I can't boot [1].
> [1] This only breaks boot in practice when some other firmware or
> GRUB oddity that I don't fully understand kicks in causing the
> memory below 0x2c000 to be unusable.
Exactly why can't Linux boot if *more* memory is reserved? Is it perhaps the SMP
trampoline that cannot be allocated? Or does this shift some firmware memory
corruption pattern that makes it more lethal than it already is?
Without more information we don't really know the root cause here and I'm hesitant
to touch such really old quirks without first understanding what's going on on
your machine.
Is the boot failure deterministic - if yes, could you try to dig a bit more into
this?
My guess it's the SMP trampoline, and I think we should robustify that in a
different way: lets put it aside very early as a reservation (possibly in this
very function), to guarantee that we have a below 1MB buffer for the SMP
trampoline. This would be a lot more robust ...
> --- a/arch/x86/kernel/ebda.c
> +++ b/arch/x86/kernel/ebda.c
> @@ -62,10 +62,12 @@ void __init reserve_ebda_region(void)
> if (lowmem < INSANE_CUTOFF)
> lowmem = LOWMEM_CAP;
>
> - /* Use the lower of the lowmem and EBDA markers as the cutoff */
> - lowmem = min(lowmem, ebda_addr);
> lowmem = min(lowmem, LOWMEM_CAP); /* Absolute cap */
>
> /* reserve all memory between lowmem and the 1MB mark */
> memblock_reserve(lowmem, 0x100000 - lowmem);
> +
> + /* if the EBDA is in lowmem, reserve it separately. */
> + if (ebda_addr < lowmem)
> + memblock_reserve(ebda_addr, 4096);
Ugh, this code is a mess, and your fix does not help make it better:
- Firstly, could we please organize this code so that we first calculate ebda_addr
and then lowmem - not this interleaved mess of first lowmem, then ebda_addr,
then lowmem tweaks...
- Also, could we please rename things to make it more obvious what's going on, and
then apply changes on top? Here's a list of problems:
- 'lowmem' here means 'super low mem' - i.e. 16-bit addressable memory. In other
parts of the x86 code 'lowmem' means 32-bit addressable memory... This makes it
super confusing to read.
- It does not help at all that we have various memory range markers, half of which
are 'start of range', half of which are 'end of range' - but this crucial
property is not obvious in the naming at all ... gave me a headache trying to
understand all this.
- Also, the 'ebda_addr' name sucks: it highlights that it's an address (which is
obvious, all values here are addresses!), while it does not highlight that it's
the start of the EBDA region ...
- 'BIOS_LOWMEM_KILOBYTES' says a lot of things, except that this is the only value
that is a pointer to a value, not a memory range address.
- The function name itself is a misnomer: it says 'reserve_ebda_region()' while
its main purpose is to reserve all the firmware ROM typically between 640K and
1MB, while the 'EBDA' part is only a small part of that ... It should be renamed
to something like reserve_bios_regions(). (The plural is intentional, to signal
that this is potentially about multiple memory regions.)
- Likewise, the paravirt quirk flag name 'ebda_search' is misleading as well: this
too should be about whether to reserve firmware areas in the paravirt case.
Something like ::reserve_bios_regions would work for me.
- In fact thinking about this as 'end of RAM' is confusing: what this function
*really* wants to reserve is firmware data and code areas! Once the thinking is
inverted from a mixed 'ram' and 'reserved firmware area' notion to a pure
'reserved area' notion everything becomes a lot clearer.
So something like this, names ordered in (probable) address order:
BIOS_RAM_SIZE_KB_PTR // was: BIOS_LOWMEM_KILOBYTES
BIOS_RAM_END_MIN // was: INSANE_CUTOFF
ebda_start // was: ebda_addr
bios_ram_end // was: lowmem
BIOS_RAM_END_MAX // was: LOWMEM_CAP
would work for me. Also the comments could need some fixing, refreshing and
general harmonization.
Blah - the patch below does this all (untested). Just look how much more readable
it all became!
In fact I believe we should try another patch on top of this and massively
simplify the whole EBDA mess:
- Find a suitable address for the SMP trampoline and remember it for the
SMP boot code.
- Reserve *ALL* of the first 2MB of RAM.
- (are there any other pieces of kernel code that require memory below 1MB/2MB?)
The thing is, x86 CPUs are already treating the first 1MB of RAM in a special way:
for example 2MB TLBs get split up into 4K TLBs regardless of what the pagetable or
the MTRRs say. So it's a substandard piece of RAM already that we want to skip for
performance reasons.
Furthermore we had various problems in the past where firmware corrupted the first
1MB of RAM.
So could we please just get rid of this class of problems and reserve it all?
The only thing we need from there is a suitable place for the SMP trampoline. We
should try to allocate that somewhere in the middle of the (suspected...)
available RAM range based on BIOS_RAM_SIZE_KB_PTR (no EBDA quirking at this
point), where the probability of firmware interference is the lowest.
If the EBDA pointer happens to point to the exact same location we should bump the
trampoline address down so that it's below the EBDA. (and should generate a
low-key KERN_INFO printk that we've done so.)
This strategy should solve your boot problem too I believe.
Thanks,
Ingo
Powered by blists - more mailing lists