linux-kernel - Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6C1D0560-6D77-4733-9B8D-5184935AEC62@zytor.com>
Date:   Fri, 30 Dec 2022 11:13:32 -0800
From:   "H. Peter Anvin" <hpa@...or.com>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
CC:     pbonzini@...hat.com, ebiggers@...nel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, qemu-devel@...gnu.org,
        ardb@...nel.org, kraxel@...hat.com, bp@...en8.de, philmd@...aro.org
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 30, 2022 7:59:30 AM PST, "Jason A. Donenfeld" <Jason@...c4.com> wrote:
>Hi,
>
>On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
>> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@...c4.com> wrote:
>> >Hi,
>> >
>> >Read this message in a fixed width text editor with a lot of columns.
>> >
>> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> >> Glad you asked.
>> >> 
>> >> So the kernel load addresses are parameterized in the kernel image
>> >> setup header. One of the things that are so parameterized are the size
>> >> and possible realignment of the kernel image in memory.
>> >> 
>> >> I'm very confused where you are getting the 64 MB number from. There
>> >> should not be any such limitation.
>> >
>> >Currently, QEMU appends it to the kernel image, not to the initramfs as
>> >you suggest below. So, that winds up looking, currently, like:
>> >
>> >          kernel image            setup_data
>> >   |--------------------------||----------------|
>> >0x100000                  0x100000+l1     0x100000+l1+l2
>> >
>> >The problem is that this decompresses to 0x1000000 (one more zero). So
>> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
>> >
>> >          kernel image            setup_data
>> >   |--------------------------||----------------|
>> >0x100000                  0x100000+l1     0x100000+l1+l2
>> >
>> >                                 d e c o m p r e s s e d   k e r n e l
>> >		     |-------------------------------------------------------------|
>> >                0x1000000                                                     0x1000000+l3 
>> >
>> >The decompressed kernel seemingly overwriting the compressed kernel
>> >image isn't a problem, because that gets relocated to a higher address
>> >early on in the boot process. setup_data, however, stays in the same
>> >place, since those links are self referential and nothing fixes them up.
>> >So the decompressed kernel clobbers it.
>> >
>> >The solution in this commit adds a bunch of padding between the kernel
>> >image and setup_data to avoid this. That looks like this:
>> >
>> >          kernel image                            padding                               setup_data
>> >   |--------------------------||---------------------------------------------------||----------------|
>> >0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2
>> >
>> >                                 d e c o m p r e s s e d   k e r n e l
>> >		     |-------------------------------------------------------------|
>> >                0x1000000                                                     0x1000000+l3 
>> >
>> >This way, the decompressed kernel doesn't clobber setup_data.
>> >
>> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>> >then the bootloader crashes when trying to dereference setup_data's
>> >->len param at the end of initialize_identity_maps() in ident_map_64.c.
>> >I don't know why it does this. If I could remove the 62 megabyte
>> >restriction, then I could keep with this technique and all would be
>> >well.
>> >
>> >> In general, setup_data should be able to go anywhere the initrd can
>> >> go, and so is subject to the same address cap (896 MB for old kernels,
>> >> 4 GB on newer ones; this address too is enumerated in the header.)
>> >
>> >It would be theoretically possible to attach it to the initrd image
>> >instead of to the kernel image. As a last resort, I guess I can look
>> >into doing that. However, that's going to require some serious rework
>> >and plumbing of a lot of different components. So if I can make it work
>> >as is, that'd be ideal. However, I need to figure out this weird 62 meg
>> >limitation.
>> >
>> >Any ideas on that?
>> >
>> >Jason
>> 
>> As far as a crash... that sounds like a big and a pretty serious one at that.
>> 
>> Could you let me know what kernel you are using and how *exactly* you are booting it?
>
>I'll attach a .config file. Apply the patch at the top of this thread to
>qemu, except make one modification:
>
>diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>index 628fd2b2e9..a61ee23e13 100644
>--- a/hw/i386/x86.c
>+++ b/hw/i386/x86.c
>@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms,
> 
>             /* The early stage can't address past around 64 MB from the original
>              * mapping, so just give up in that case. */
>-            if (padded_size < 62 * 1024 * 1024)
>+            if (true || padded_size < 62 * 1024 * 1024)
>                 kernel_size = padded_size;
>             else {
>                 fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
>
>Then build qemu. Run it with `-kernel bzImage`, based on the kernel
>built with the .config I attached.
>
>You'll see that the CPU triple faults when hitting this line:
>
>        sd = (struct setup_data *)boot_params->hdr.setup_data;
>        while (sd) {
>                unsigned long sd_addr = (unsigned long)sd;
>
>                kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);  <----
>                sd = (struct setup_data *)sd->next;
>        }
>
>, because it dereferences *sd. This does not happen if the decompressed
>size of the kernel is < 62 megs.
>
>So that's the "big and pretty serious" bug that might be worthy of
>investigation.
>
>Jason

No kidding. Dereferencing data *before you map it* is generally frowned upon.

This needs to be split into to making calls.

*Facepalm*