linux-kernel - Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHmME9rqCpEHAdJhKNACVx2a_o6LNS4Hy5ff6pX6o5oeJCRZOg@mail.gmail.com>
Date:   Wed, 8 Feb 2023 15:09:27 -0300
From:   "Jason A. Donenfeld" <Jason@...c4.com>
To:     Nathan Chancellor <nathan@...nel.org>,
        "Michael S. Tsirkin" <mst@...hat.com>
Cc:     pbonzini@...hat.com, ebiggers@...nel.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, qemu-devel@...gnu.org,
        ardb@...nel.org, kraxel@...hat.com, hpa@...or.com, bp@...en8.de,
        philmd@...aro.org
Subject: Re: [PATCH qemu v3] x86: don't let decompressed kernel image clobber setup_data

On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <Jason@...c4.com> wrote:
>
> Hi Nathan (and MST),
>
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@...nel.org> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > >           kernel image            setup_data
> > >    |--------------------------||----------------|
> > > 0x100000                  0x100000+l1     0x100000+l1+l2
> > >
> > >                                  d e c o m p r e s s e d   k e r n e l
> > >                      |-------------------------------------------------------------|
> > >                 0x1000000                                                     0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up.  So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: x86@...nel.org
> > > Cc: Philippe Mathieu-Daudé <philmd@...aro.org>
> > > Cc: H. Peter Anvin <hpa@...or.com>
> > > Cc: Borislav Petkov <bp@...en8.de>
> > > Cc: Eric Biggers <ebiggers@...nel.org>
> > > Signed-off-by: Jason A. Donenfeld <Jason@...c4.com>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> >   Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> >     -kernel arch/x86_64/boot/bzImage \
> >     -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> >     -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> >     -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
>
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
>
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.

Sent as https://lore.kernel.org/qemu-devel/20230208180835.234638-1-Jason@zx2c4.com/