linux-kernel - Re: [Xen-devel] HVMLite / PVHv2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160414205619.GR1990@wotan.suse.de>
Date:	Thu, 14 Apr 2016 22:56:19 +0200
From:	"Luis R. Rodriguez" <mcgrof@...nel.org>
To:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc:	"Luis R. Rodriguez" <mcgrof@...nel.org>,
	Juergen Gross <jgross@...e.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Michael Chang <MChang@...e.com>, linux-kernel@...r.kernel.org,
	Jim Fehlig <jfehlig@...e.com>, Jan Beulich <JBeulich@...e.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Daniel Kiper <daniel.kiper@...cle.com>, x86@...nel.org,
	Vojtěch Pavlík <vojtech@...e.cz>,
	Gary Lin <GLin@...e.com>, xen-devel@...ts.xenproject.org,
	Jeffrey Cheung <JCheung@...e.com>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	joeyli <jlee@...e.com>, Borislav Petkov <bp@...en8.de>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Charles Arndol <carnold@...e.com>,
	Andrew Cooper <andrew.cooper3@...rix.com>,
	Julien Grall <julien.grall@....com>,
	Andy Lutomirski <luto@...capital.net>,
	David Vrabel <david.vrabel@...rix.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Roger Pau Monné <roger.pau@...rix.com>,
	Josh Triplett <josh@...htriplett.org>,
	Kees Cook <keescook@...omium.org>,
	Vitaly Kuznetsov <vkuznets@...hat.com>
Subject: Re: [Xen-devel] HVMLite / PVHv2 - using x86 EFI boot entry

On Thu, Apr 14, 2016 at 03:56:53PM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 14, 2016 at 08:40:48PM +0200, Luis R. Rodriguez wrote:
> > On Wed, Apr 13, 2016 at 09:01:32PM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Apr 14, 2016 at 12:23:17AM +0200, Luis R. Rodriguez wrote:
> > > > VGA code will be dead code for HVMlite for sure as the design doc
> > > > says it will not run VGA, the ACPI flag will be set but the check
> > > > for that is not yet on Linux. That means the VGA Linux code will
> > > > be there but we have no way to ensure it will not run nor that
> > > > anything will muck with it.
> > > 
> > > <shrugs> The worst it will do is try to read non-existent registers.
> > 
> > Really ?
> > 
> > Is that your position on all other possible dead code that may have been
> > possible on old Xen PV guests as well ?
> 
> This is not just with Xen - it with other device drivers that are being
> invoked on baremetal and are not present in hardware anymore.

Indeed, however virtualization makes this issue much more prominent.

> > As I hinted, after thinking about this for a while I realized that dead code is
> > likely present on bare metal as well even without virtualization, specially if
> 
> Yes!
> > you build large single kernels to support a wide array of features which only
> > late at run time can be determined. Virtualization and the pvops design just
> > makes this issue much more prominent. If there are other areas of code exposed
> > that actually may run, but we are not sure may run, I figured some other folks
> > with a bit more security conscience minds might even simply take the position
> > it may be a security risk to leave that code exposed. So to take a position
> > that 'the worst it will do is try to read non-existent registers' -- seems
> > rather shortsighted here.
> 
> Security conscious people trim their CONFIG.

Not all Linux distributions want to do this, the more binaries the
higher the cost to test / vet.

> > Anyway for more details on thoughts on this refer to the this wiki:
> > 
> > http://kernelnewbies.org/KernelProjects/kernel-sandboxing
> > 
> > Since this is now getting off topic please send me your feedback on another
> > thread for the non-virtualization aspects of this if that interests you. My
> > point here was rather to highlight the importance of clear semantics due to
> > virtualization in light of possible dead code.
> 
> Thank you.
> > 
> > > The VGA code should be able to handle failures like that and
> > > not initialize itself when the hardware is dead (or non-existent).
> > 
> > That's right, its through ACPI_FADT_NO_VGA and since its part of the HVMLite
> > design doc we want HVMlite design to address ACPI_FADT_NO_VGA properly.  I've
> > paved the way for this to be done cleanly and easily now, but that code should
> > be in place before HVMLite code gets merged.
> > 
> > Does domU for old Xen PV also set ACPI_FADT_NO_VGA as well ?  Should it ?
> 
> It does not. Not sure - it seems to have worked fine for the last ten
> years?

Maybe HVMLite will need it enabled then too, just for bug parity.

> > > > To be clear -- dead code concerns still exist even without
> > > > virtualization solutions, its just that with virtualization
> > > > this stuff comes up more and there has been no proactive
> > > > measures to address this. The question of semantics here is
> > > > to see to what extent we need earlier boot code annotations
> > > > to ensure we address semantics proactively.
> > > 
> > > I think what you mean by dead code is another word for
> > > hardware test coverage?
> > 
> > No, no, its very different given that with virtualization the scope of possible
> > dead code is significant and at run time you are certain a huge portion of code
> > should *never ever* run. So for instance we know once we boot bare metal none
> > of the Xen stuff should ever run, likewise on Xen dom0 we know none of the KVM
> > / bare-metal only stuff should never run, when on Xen domU, none of the Xen
> 
> What is this 'bare metal only stuff' you speak of? On Xen dom0 most of
> the baremetal code is running.

A lot, not all. In the past folks added stubs (used to be paravirt_enabled()
checks) to some code, but we are simply not sure of other possible conflicts.
This is an known unknown if you will.

> In fact that is how the device drivers work. Or are you talking about low
> level baremetal code? If so, then PVH/HVMLite does that - it skips pvops so
> that it can run this 'low-level baremetal code'

Are you telling me that HVMLite has no dead code issues ?

> > domU-only stuff should ever run.
> 
> You forgot KVM guest support on baremetal. That shouldn't run either.

Glad you bring that up, yes, that is correct. I'm being just as cautious with
Xen as with KVM on their dead-code possible issues, however their dead code
conerns should be smaller given as you not the boot path.

It doesn't mean dead-cod concerns do not exist for KVM... or other
virtualization solutions.

> > > > > The entrace point in Linux "proper" is startup_32 or startup_64 - the same
> > > > > path that EFI uses.
> > > > > 
> > > > > If you were to draw this (very simplified):
> > > > > 
> > > > > a)- GRUB2 ---------------------\ (creates an bootparam structure)
> > > > >                                 \
> > > > >                                  +---- startup_32 or startup_64
> > > > > b) EFI -> Linux EFI stub -------/
> > > > >        (creates bootparm)      /
> > > > > c) GRUB2-EFI  -> Linux EFI----/
> > > > >                stub         /
> > > > > d) HVMLite ----------------/
> > > > >       (creates bootparm)
> > > > 
> > > > b) and d) might be able to share paths there...
> > > 
> > > No idea. You would have to look in the assembler code to
> > > figure that out.
> > 
> > And that's a pain, I get it.
> > 
> > I spotted one place already -- will note to Boris. I think Matt may have more
> > ideas ;)
> > 
> > > > d) still has its own entry, it does more than create boot params.
> > > 
> > > d) purpose is to create boot params.
> > 
> > OK good to know that's the only thing we acknowledge it *should* do.
> 
> And b), c) purpose is for that too - amongts providing an mechanism
> to call in EFI firmware.

Sure.

> And I realized that early baremetal boot option also ends up calling C during
> its startup (see main in arch/x86/boot/main.c) amongst then switching
> different modes.

Sure.

> > >  It may do more as nobody likes to muck in assembler and make bootparams from
> > >  within assembler.
> > 
> > OK -- it does do more and that's where we'd like to avoid duplication if
> > possible and yet-another-entry (TM).
> 
> It does more? EFI stub entry does more than the GRUB2 entry.
> 
> If you have some patches to trim the code duplication within
> those boot paths- please post it.

Sure.

> > > > > (I am not sure about the c) - I would have to look in source to
> > > > > be source). There is also LILO in this, but I am not even sure if
> > > > > works anymore.
> > > > > 
> > > > > 
> > > > > What you have is that every entry point creates the bootparams
> > > > > and ends up calling startup_X. The startup_64 then hit the rest
> > > > > of the kernel. The startp_X code is the one that would setup
> > > > > the basic pagetables, segments, etc.
> > > > 
> > > > Sure.. a full diagram should include both sides and how when using
> > > > a custom entry one runs the risk of skipping a lot of code setup.
> > > 
> > > But it does not skip a lot of code setup. It starts exactly
> > > at the same code startup that _all_ bootstraping code start at.
> > 
> > Its a fair point.
> > 
> > > > There is that and as others have pointed out how certain guests types
> > > > are assumed to not have certain peripherals, and we have no idea
> > > > to ensure certain old legacy code may not ever run or be accessed
> > > > by drivers.
> > > 
> > > Ok, but that is not at code setup. That is later - when device
> > > drivers are initialized. This no different than booting on
> > > some hardware with missing functionality. ACPI, PCI and PnP
> > > PnP are set there to help OSes discover this.
> > 
> > To a certain extent this is true, but there may things which are missing still.
> 
> Like?

That's the thing, I had a list of thing to look out for and then things
I ran across over code inspection. We need more work to be sure we're
really well covered.

Are you *sure* we have no dead code concerns with HVMLite ?
If there are dead code concerns are you sure there might not
be differences between KVM and HVMLite ? Should cpuid be used to
address differences ? Will that enable to distinguish between
hybrid versions of HVMLite ? Are we sure ?

> > We really have no idea what the full list of those things are.
> 
> Ok, it sounds like you have some homework.

We all do.

> > It may be that things may have been running for ages without notice of an issue
> > or that only under certain situations will certain issues or bugs trigger a
> > failure. For instance, just yesterday I was Cc'd on a brand-spanking new legacy
> > conflict [0], caused by upstream commit 8c058b0b9c34d8c ("x86/irq: Probe for
> > PIC presence before allocating descs for legacy IRQs") merged on v4.4 where
> > some new code used nr_legacy_irqs() -- one proposed solution seems to be that
> > for Xen code NR_IRQS_LEGACY should be used instead is as it lacks PCI [1] and
> > another was to peg the legacy requirements as a quirk on the new x86 platform
> > legacy quirk stuff [2]. Are other uses of nr_legacy_irqs() correct ? Are
> > we sure ?
> 
> And how is this example related to 'early bootup' path?
> 
> It is not.

For early boot code -- it is not. HVMLite is not merged, and PHV was never
completed.. so how are you sure we won't have any issues there ?

> It is in fact related to PV codepaths - which PVH/HVMLite and HVM guests
> do not exercise.

Agreed.

> > [0] http://lkml.kernel.org/r/570F90DF.1020508@oracle.com
> > [1] https://lkml.org/lkml/2016/4/14/532
> > [2] http://lkml.kernel.org/r/1460592286-300-1-git-send-email-mcgrof@kernel.org
> > 
> > > > > > How we address semantics then is *very* important to me.
> > > > > 
> > > > > Which semantics? How the CPU is going to be at startup_X ? Or
> > > > > how the CPU is going to be when EFI firmware invokes the EFI stub?
> > > > > Or when GRUB2 loads Linux?
> > > > 
> > > > What hypervisor kicked me and what guest type I am.
> > > 
> > > cpuid software flags have that - and that semantics has been 
> > > there for eons.
> > 
> > We cannot use cpuid early in asm code, I'm looking for something we
> 
> ?! Why!?

What existing code uses it? If there is nothing you are still certain
it should work ? Would that work for old PV guest as well BTW ?

> > can even use on asm early in boot code, on x86 the best option we
> > have is the boot_params, but I've even have had issues with that
> > early in code, as I can only access it after load_idt() where I
> > described my effort to unify Xen PV and x86_64 init paths [3].
> 
> Well, Xen PV skips x86_64_start_kernel..

Yes, and in doing so often times people skip adding Xen PV specific
code, as was the case with Kasan.

> > [3] http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcGOdH37MzVpxkKEN5esw@mail.gmail.com
> > 
> > > > Let me elaborate more below.
> > > > 
> > > > > That (those bootloaders) is clearly defined. The URL I provided
> > > > > mentions the HVMLite one. The Documentation/x86/boot.c mentions
> > > > > what the semantics are to expected when providing an bootstrap
> > > > > (which is what HVMLitel stub code in Linux would write against -
> > > > > and what EFI stub code had been written against too).
> > > > > > 
> > > > > > > > I'll elaborate on this but first let's clarify why a new entry is used for
> > > > > > > > HVMlite to start of with:
> > > > > > > > 
> > > > > > > >   1) Xen ABI has historically not wanted to set up the boot params for Linux
> > > > > > > >      guests, instead it insists on letting the Linux kernel Xen boot stubs fill
> > > > > > > >      that out for it. This sticking point means it has implicated a boot stub.
> > > > > > > 
> > > > > > > 
> > > > > > > Which is b/c it has to be OS agnostic. It has nothing to do 'not wanting'.
> > > > > > 
> > > > > > It can still be OS agnostic and pass on type and custom data pointer.
> > > > > 
> > > > > Sure. It has that (it MUST otherwise how else would you pass data).
> > > > > It is documented as well http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html#incontents_startofday
> > > > > (see " Start of day structure passed to PVH guests in %ebx.")
> > > > 
> > > > The design doc begs for a custom OS entry point though.
> > > 
> > > That is what the ELF Note has.
> > 
> > Right, but I'm saying that its rather silly to be adding entry points if
> > all we want the code to do is copy the boot params for us. The design
> > doc requires a new entry, and likewise you'd need yet-another-entry if
> > HVMLite is thrown out the window and come back 5 years later after new
> > hardware solutions are in place and need to redesign HVMLite. Kind of
> 
> Why would you need to redesign HVMLite based on hardware solutions?

That's what happened to Xen PV, right ? Are we sure 5 years from now we won't
have any new hardware virtualization features that will just obsolete HVMLite?

> The entrace point and the CPU state are pretty well known - it is akin
> to what GRUB2 bootloader path is (protected mode).
> > where we are with PVH today. Likewise if other paravirtualization
> > developers want to support Linux and want to copy your strategy they'd
> > add yet-another-entry-point as well.
> > 
> > This is dumb.
> 
> You saying the EFI entry point is dumb? That instead the EFI
> firmware should understand Linux bootparams and booted that?

EFI is a standard. Xen is not. And since we are not talking about legacy
hardware in the future, EFI seems like a sensible option to consider for an
entry point. Specially given that it may mean that we can ultimately also help
unify more entry points on Linux in general. I'd prefer to consider using
EFI configuration tables instead of extending the x86 boot protocol.

> > > > If we had a single 'type' and 'custom data' passed to the kernel that
> > > > should suffice for the default Linux entry point to just pivot off
> > > > of that and do what it needs without more entry points. Once.
> > > 
> > > And what about ramdisk? What about multiple ramdisks?
> > > What about command line? All of that is what bootparams
> > > tries to unify on Linux. But 'bootparams' is unique to Linux,
> > > it does not exist on FreeBSD. Hence some stub code to transplant
> > > OS-agnostic simple data to OS-specific is neccessary.
> > 
> > If we had a Xen ABI option where *all* that I'm asking is you pass
> > first:
> > 
> >   a) hypervisor type
> 
> Why can't you use cpuid.

I'll evaluate that.

> >   b) custom data pointer
> 
> What is this custom data pointer you speak of?

For Xen this is the en_start_info, the structure that Xen stuffs in
a copy of its version of what we need to fill the boot_params.

> > We'd be able to avoid adding *any* entry point and just address
> > the requirements as I noted with pre / post stubs for the type.
> 
> But you need some entry point to call into Linux. Are you
> suggesting to use the existing ones? No, the existing one
> wouldn't understand this.

If we used the boot_parms, yes it would be possible.

> > This would require an x86 boot protocol bump, but all the issues
> > creeping up randomly I think that's worth putting on the table now.
> 
> Aaaah, so you are saying expand the bootparams. In other words
> make Xen ABI call into Linux using the bootparams structure, similar
> to how GRUB2 does it.
> 
> How is that OS agnostic?

That's an issue, I understand. EFI is OS agnostic though.

> > And maybe we don't want it to be hypervisor specific, perhaps there are other
> > *needs* for custom pre-post startup_32()/startup_64() stubs.
> 
> Multiboot?

Can you elaborate?

> > To avoid extending boot_params further I figured perhaps we can look
> > at EFI as another option instead. If we are going to drop all legacy
> 
> But EFI support is _huge_.

I get the sense now. Perhaps we should explore to what extent now really
at the Hackathon.

> > PV support from the kernel (not the hypervisor) and require hardware
> > virtualization 5 years from now on the Linux kernel, it doesn't seem
> > to me far fetched to at the very least consider using an EFI entry
> > instead, specially since all it does is set boot params and we can
> > make re-use this for HVMLite too.
> 
> But to make that work you have to emulate EFI firmware in the
> hypervisor. Is that work you are signing up for?

I'll do what is needed, as I have done before. If EFI is on the long
term roadmap for ARM perhaps there are a few birds to knock with one
stone here. If there is also interest to support other OSes through
EFI standard means this also should help make that easier.

  Luis