linux-kernel - Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <049b2e0f-00b2-4704-8868-1569a006a134@amazon.com>
Date: Mon, 17 Jun 2024 23:01:12 +0200
From: Alexander Graf <graf@...zon.com>
To: Steven Rostedt <rostedt@...dmis.org>
CC: <linux-kernel@...r.kernel.org>, <linux-trace-kernel@...r.kernel.org>,
	Masami Hiramatsu <mhiramat@...nel.org>, Mark Rutland <mark.rutland@....com>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Andrew Morton
	<akpm@...ux-foundation.org>, Vincent Donnefort <vdonnefort@...gle.com>, "Joel
 Fernandes" <joel@...lfernandes.org>, Daniel Bristot de Oliveira
	<bristot@...hat.com>, Ingo Molnar <mingo@...nel.org>, Peter Zijlstra
	<peterz@...radead.org>, <suleiman@...gle.com>, Thomas Gleixner
	<tglx@...utronix.de>, Vineeth Pillai <vineeth@...byteword.org>, Youssef Esmat
	<youssefesmat@...gle.com>, Beau Belgrave <beaub@...ux.microsoft.com>,
	"Baoquan He" <bhe@...hat.com>, Borislav Petkov <bp@...en8.de>, "Paul E.
 McKenney" <paulmck@...nel.org>, David Howells <dhowells@...hat.com>, Mike
 Rapoport <rppt@...nel.org>, Ard Biesheuvel <ardb@...nel.org>
Subject: Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named
 memory at boot up

[resend because Thunderbird decided to send the previous version as HTML :(]


On 17.06.24 22:40, Steven Rostedt wrote:
> On Mon, 17 Jun 2024 09:07:29 +0200
> Alexander Graf<graf@...zon.com>  wrote:
>
>> Hey Steve,
>>
>>
>> I believe we're talking about 2 different things :). Let me rephrase a
>> bit and make a concrete example.
>>
>> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
>> line option. The kernel now comes up and allocates a random chunk of
>> memory that - by (admittedly good) chance - may be at the same physical
>> location as before. Let's assume it deemed 0x1000000 as a good offset.
> Note, it's not random. It picks from the top of available memory every
> time. But things can mess with it (see below).
>
>> Let's now assume you're running on a UEFI system. There, you always have
>> non-volatile storage available to you even in the pre-boot phase. That
>> means the kernel could create a UEFI variable that says "12M:4096:trace
>> -> 0x1000000". The pre-boot phase takes all these UEFI variables and
>> marks them as reserved. When you finally reach your command line parsing
>> logic for reserve_mem=, you can flip all reservations that were not on
>> the command line back to normal memory.
>>
>> That way you have pretty much guaranteed persistent memory regions, even
>> with KASLR changing your memory layout across boots.
>>
>> The nice thing is that the above is an extension of what you've already
>> built: Systems with UEFI simply get better guarantees that their regions
>> persist.
> This could be an added feature, but it is very architecture specific,
> and would likely need architecture specific updates.


It definitely would be an added feature, yes. But one that allows you to 
ensure persistence a lot more safely :).


>>>>> Requirement:
>>>>>
>>>>> Need a way to reserve memory that will be at a consistent location for
>>>>> every boot, if the kernel and system are the same. Does not need to work
>>>>> if rebooting to a different kernel, or if the system can change the
>>>>> memory layout between boots.
>>>>>
>>>>> The reserved memory can not be an hard coded address, as the same kernel /
>>>>> command line needs to run on several different machines. The picked memory
>>>>> reservation just needs to be the same for a given machine, but may be
>>>> With KASLR is enabled, doesn't this approach break too often to be
>>>> reliable enough for the data you want to extract?
>>>>
>>>> Picking up the idea above, with a persistent variable we could even make
>>>> KASLR avoid that reserved pstore region in its search for a viable KASLR
>>>> offset.
>>> I think I was hit by it once in all my testing. For our use case, the
>>> few times it fails to map is not going to affect what we need this for
>>> at all.
>> Once is pretty good. Do you know why? Also once out of how many runs? Is
>> the randomness source not as random as it should be or are the number of
>> bits for KASLR maybe so few on your target architecture that the odds of
>> hitting anything become low? Do these same constraints hold true outside
>> of your testing environment?
> So I just ran it a hundred times in a loop. I added a patch to print
> the location of "_text". The loop was this:
>
>    for i in `seq 100`; do
>          ssh root@...iantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
>          sleep 25
>    done
>
> It searches dmesg for my added printk as well as the print of were the
> ring buffer was loaded in physical memory.
>
> It takes about 15 seconds to reboot, so I waited 25. The results are
> attached. I found that it was consistent 76 times, which means 1 out of
> 4 it's not. Funny enough, it broke whenever it loaded the kernel below
> 0x100000000. And then it would be off by a little.
>
> It was consistently at:
>
>    0x27d000000
>
> And when it failed, it was at 0x27ce00000.
>
> Note, when I used the e820 tables to do this, I never saw a failure. My
> assumption is that when it is below 0x100000000, something else gets
> allocated causing this to get pushed down.


Thinking about it again: What if you run the allocation super early (see 
arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
allocating only from top, you're effectively kernel version independent 
for your allocations because none of the kernel code ran yet and 
definitely KASLR independent because you're running deterministically 
before KASLR even gets allocated.

> As this code relies on memblock_phys_alloc() being consistent, if
> something gets allocated before it differently depending on where the
> kernel is, it can also move the location. A plugin to UEFI would mean
> that it would need to reserve the memory, and the code here will need
> to know where it is. We could always make the function reserve_mem()
> global and weak so that architectures can override it.


Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
type of memblock with the respective reservations and you later call 
memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
in flags that you want to allocate only from the new 
MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots 
through the handle_mem_options() path above. In fact, if the BIOS way 
works fine, we don't even need UEFI variables: The same way allocations 
will be identical during BIOS execution, they should stay identical 
across UEFI launches.

As cherry on top, kexec also works seamlessly with the special memblock 
approach because kexec (at least on x86) hands memblocks as is to the 
next kernel. So the new kernel will also automatically use the same 
ranges for its allocations.


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597