linux-kernel - Re: [Xen-devel] HVMLite / PVHv2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160414195652.GC20459@localhost.localdomain>
Date:	Thu, 14 Apr 2016 15:56:53 -0400
From:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
To:	"Luis R. Rodriguez" <mcgrof@...nel.org>
Cc:	Juergen Gross <jgross@...e.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Michael Chang <MChang@...e.com>, linux-kernel@...r.kernel.org,
	Jim Fehlig <jfehlig@...e.com>, Jan Beulich <JBeulich@...e.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Daniel Kiper <daniel.kiper@...cle.com>, x86@...nel.org,
	Vojtěch Pavlík <vojtech@...e.cz>,
	Gary Lin <GLin@...e.com>, xen-devel@...ts.xenproject.org,
	Jeffrey Cheung <JCheung@...e.com>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	joeyli <jlee@...e.com>, Borislav Petkov <bp@...en8.de>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	Charles Arndol <carnold@...e.com>,
	Andrew Cooper <andrew.cooper3@...rix.com>,
	Julien Grall <julien.grall@....com>,
	Andy Lutomirski <luto@...capital.net>,
	David Vrabel <david.vrabel@...rix.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Roger Pau Monné <roger.pau@...rix.com>,
	Josh Triplett <josh@...htriplett.org>,
	Kees Cook <keescook@...omium.org>,
	Vitaly Kuznetsov <vkuznets@...hat.com>
Subject: Re: [Xen-devel] HVMLite / PVHv2 - using x86 EFI boot entry

On Thu, Apr 14, 2016 at 08:40:48PM +0200, Luis R. Rodriguez wrote:
> On Wed, Apr 13, 2016 at 09:01:32PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Apr 14, 2016 at 12:23:17AM +0200, Luis R. Rodriguez wrote:
> > > On Wed, Apr 13, 2016 at 05:08:01PM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Wed, Apr 13, 2016 at 10:40:55PM +0200, Luis R. Rodriguez wrote:
> > > > > On Wed, Apr 13, 2016 at 02:56:29PM -0400, Konrad Rzeszutek Wilk wrote:
> > > > > > On Wed, Apr 13, 2016 at 08:29:51PM +0200, Luis R. Rodriguez wrote:
> > > > > 
> > > > > and we also want to address dead code issues which pvops simply folded
> > > > > under the rug. The dead code concerns may exist still for hvmlite, so
> > > > > unless someone is willing to make a bold claim there is none, its
> > > > > something to consider.
> > > > 
> > > > What is this dead code you speak of?
> > > 
> > > Kasan is dead code to Xen. If you boot x86 Xen with Kasan enabled
> > 
> > For Xen PV guests,
> 
> That's right. For 5 years this will be a bomb. That went unnoticed
> and I feel I have to pull hair now to try to get folks to fix this.

Sometimes you have to roll up your sleeves and do the work yourself.
> 
> How many other issues will go in which will explode during this 5 year
> time line? How can we proactively address a solution to this now so we
> avoid this in future ?
> 
> Do you believe me now its a real issue?

I never said otherwise. What I was confused was that you grouped
this with HVMLite - which would not have a problem like this.

> 
> Fortunately I have a proactive solution for pvops now in my pipeline that
> should help avoid us having to blow more things up on Xen but also that should
> cause no headaches on behalf of x86 developers. But, reason I have been so
> engaged on HVMLite design review is I want to ensure we take the lessons
> learned from pvops and avoid this for an architecture that will be def-facto
> Xen on Linux 5 years from now.
> 
> Not bringing this up or addressing this now for HVMLite / PVH2 would simply be
> silly, and since it wasn't addressed in pvops I obviously have to ensure I
> convince enough people it was a real issue and ensure that we have enough
> semantics available to address it.

Kasan came after pvops, so of course it was not addressed in pvops.

> 
> Part of the semantics question, which has made my quest hard, was the use of
> semantics for virtualization for code on early boot and later in boot has been
> rather sloppy, so we have recently needed to address some of these gaps. Some
> of these discussions have however been productive, as I'll explain to George
> soon regarding his DT questions.  The discussion is not over though and we need
> to ensure that if we need semantics for HVMLite we'll have them available in
> *clean* way. One of the things where early semantics and design to address
> these issue help in a proactive manner is to address a clean boot entry -- and
> that's also why I've been so pedantic over review of the new HVMlite boot
> entry.

I must have missed your review of the patches. Sorry!
> 
> > > Xen explodes. Quick question, will Kasan not explode with HVMLite ?
> > 
> > .. but for HVMLite of Xen HVM guest Kasan will run.
> 
> Are you sure? Should that mean that Xen HVM should be fine as well.  Does that
> work? Are we sure?

Yes, and yes.
> 
> > > MTRR used to be dead code concern but since we have vetted most of that code
> > > now we are pretty certain that code should never run now.
> > > 
> > > KASLR may be -- not sure as I  haven't vetted that, but from
> > > what I have loosely heard maybe.
> > > 
> > > VGA code will be dead code for HVMlite for sure as the design doc
> > > says it will not run VGA, the ACPI flag will be set but the check
> > > for that is not yet on Linux. That means the VGA Linux code will
> > > be there but we have no way to ensure it will not run nor that
> > > anything will muck with it.
> > 
> > <shrugs> The worst it will do is try to read non-existent registers.
> 
> Really ?
> 
> Is that your position on all other possible dead code that may have been
> possible on old Xen PV guests as well ?

This is not just with Xen - it with other device drivers that are being
invoked on baremetal and are not present in hardware anymore.
> 
> As I hinted, after thinking about this for a while I realized that dead code is
> likely present on bare metal as well even without virtualization, specially if

Yes!
> you build large single kernels to support a wide array of features which only
> late at run time can be determined. Virtualization and the pvops design just
> makes this issue much more prominent. If there are other areas of code exposed
> that actually may run, but we are not sure may run, I figured some other folks
> with a bit more security conscience minds might even simply take the position
> it may be a security risk to leave that code exposed. So to take a position
> that 'the worst it will do is try to read non-existent registers' -- seems
> rather shortsighted here.

Security conscious people trim their CONFIG.
>  
> Anyway for more details on thoughts on this refer to the this wiki:
> 
> http://kernelnewbies.org/KernelProjects/kernel-sandboxing
> 
> Since this is now getting off topic please send me your feedback on another
> thread for the non-virtualization aspects of this if that interests you. My
> point here was rather to highlight the importance of clear semantics due to
> virtualization in light of possible dead code.

Thank you.
> 
> > The VGA code should be able to handle failures like that and
> > not initialize itself when the hardware is dead (or non-existent).
> 
> That's right, its through ACPI_FADT_NO_VGA and since its part of the HVMLite
> design doc we want HVMlite design to address ACPI_FADT_NO_VGA properly.  I've
> paved the way for this to be done cleanly and easily now, but that code should
> be in place before HVMLite code gets merged.
> 
> Does domU for old Xen PV also set ACPI_FADT_NO_VGA as well ?  Should it ?

It does not. Not sure - it seems to have worked fine for the last ten
years?
> 
> > > To be clear -- dead code concerns still exist even without
> > > virtualization solutions, its just that with virtualization
> > > this stuff comes up more and there has been no proactive
> > > measures to address this. The question of semantics here is
> > > to see to what extent we need earlier boot code annotations
> > > to ensure we address semantics proactively.
> > 
> > I think what you mean by dead code is another word for
> > hardware test coverage?
> 
> No, no, its very different given that with virtualization the scope of possible
> dead code is significant and at run time you are certain a huge portion of code
> should *never ever* run. So for instance we know once we boot bare metal none
> of the Xen stuff should ever run, likewise on Xen dom0 we know none of the KVM
> / bare-metal only stuff should never run, when on Xen domU, none of the Xen

What is this 'bare metal only stuff' you speak of? On Xen dom0 most of
the baremetal code is running. In fact that is how the device drivers
work. Or are you talking about low level baremetal code? If so, then
PVH/HVMLite does that - it skips pvops so that it can run this
'low-level baremetal code'

> domU-only stuff should ever run.

You forgot KVM guest support on baremetal. That shouldn't run either.

> 
> > > > The entrace point in Linux "proper" is startup_32 or startup_64 - the same
> > > > path that EFI uses.
> > > > 
> > > > If you were to draw this (very simplified):
> > > > 
> > > > a)- GRUB2 ---------------------\ (creates an bootparam structure)
> > > >                                 \
> > > >                                  +---- startup_32 or startup_64
> > > > b) EFI -> Linux EFI stub -------/
> > > >        (creates bootparm)      /
> > > > c) GRUB2-EFI  -> Linux EFI----/
> > > >                stub         /
> > > > d) HVMLite ----------------/
> > > >       (creates bootparm)
> > > 
> > > b) and d) might be able to share paths there...
> > 
> > No idea. You would have to look in the assembler code to
> > figure that out.
> 
> And that's a pain, I get it.
> 
> I spotted one place already -- will note to Boris. I think Matt may have more
> ideas ;)
> 
> > > d) still has its own entry, it does more than create boot params.
> > 
> > d) purpose is to create boot params.
> 
> OK good to know that's the only thing we acknowledge it *should* do.

And b), c) purpose is for that too - amongts providing an mechanism
to call in EFI firmware.

And I realized that early baremetal boot option also ends up calling C during
its startup (see main in arch/x86/boot/main.c) amongst then switching
different modes.

> 
> >  It may do more as nobody likes to muck in assembler and make bootparams from
> >  within assembler.
> 
> OK -- it does do more and that's where we'd like to avoid duplication if
> possible and yet-another-entry (TM).

It does more? EFI stub entry does more than the GRUB2 entry.

If you have some patches to trim the code duplication within
those boot paths- please post it.
> 
> > > > (I am not sure about the c) - I would have to look in source to
> > > > be source). There is also LILO in this, but I am not even sure if
> > > > works anymore.
> > > > 
> > > > 
> > > > What you have is that every entry point creates the bootparams
> > > > and ends up calling startup_X. The startup_64 then hit the rest
> > > > of the kernel. The startp_X code is the one that would setup
> > > > the basic pagetables, segments, etc.
> > > 
> > > Sure.. a full diagram should include both sides and how when using
> > > a custom entry one runs the risk of skipping a lot of code setup.
> > 
> > But it does not skip a lot of code setup. It starts exactly
> > at the same code startup that _all_ bootstraping code start at.
> 
> Its a fair point.
> 
> > > There is that and as others have pointed out how certain guests types
> > > are assumed to not have certain peripherals, and we have no idea
> > > to ensure certain old legacy code may not ever run or be accessed
> > > by drivers.
> > 
> > Ok, but that is not at code setup. That is later - when device
> > drivers are initialized. This no different than booting on
> > some hardware with missing functionality. ACPI, PCI and PnP
> > PnP are set there to help OSes discover this.
> 
> To a certain extent this is true, but there may things which are missing still.

Like?
> 
> We really have no idea what the full list of those things are.

Ok, it sounds like you have some homework.
> 
> It may be that things may have been running for ages without notice of an issue
> or that only under certain situations will certain issues or bugs trigger a
> failure. For instance, just yesterday I was Cc'd on a brand-spanking new legacy
> conflict [0], caused by upstream commit 8c058b0b9c34d8c ("x86/irq: Probe for
> PIC presence before allocating descs for legacy IRQs") merged on v4.4 where
> some new code used nr_legacy_irqs() -- one proposed solution seems to be that
> for Xen code NR_IRQS_LEGACY should be used instead is as it lacks PCI [1] and
> another was to peg the legacy requirements as a quirk on the new x86 platform
> legacy quirk stuff [2]. Are other uses of nr_legacy_irqs() correct ? Are
> we sure ?

And how is this example related to 'early bootup' path?

It is not.

It is in fact related to PV codepaths - which PVH/HVMLite and HVM guests
do not exercise.
> 
> [0] http://lkml.kernel.org/r/570F90DF.1020508@oracle.com
> [1] https://lkml.org/lkml/2016/4/14/532
> [2] http://lkml.kernel.org/r/1460592286-300-1-git-send-email-mcgrof@kernel.org
> 
> > > > > How we address semantics then is *very* important to me.
> > > > 
> > > > Which semantics? How the CPU is going to be at startup_X ? Or
> > > > how the CPU is going to be when EFI firmware invokes the EFI stub?
> > > > Or when GRUB2 loads Linux?
> > > 
> > > What hypervisor kicked me and what guest type I am.
> > 
> > cpuid software flags have that - and that semantics has been 
> > there for eons.
> 
> We cannot use cpuid early in asm code, I'm looking for something we

?! Why!?
> can even use on asm early in boot code, on x86 the best option we
> have is the boot_params, but I've even have had issues with that
> early in code, as I can only access it after load_idt() where I
> described my effort to unify Xen PV and x86_64 init paths [3].

Well, Xen PV skips x86_64_start_kernel..
> 
> [3] http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcGOdH37MzVpxkKEN5esw@mail.gmail.com
> 
> > > Let me elaborate more below.
> > > 
> > > > That (those bootloaders) is clearly defined. The URL I provided
> > > > mentions the HVMLite one. The Documentation/x86/boot.c mentions
> > > > what the semantics are to expected when providing an bootstrap
> > > > (which is what HVMLitel stub code in Linux would write against -
> > > > and what EFI stub code had been written against too).
> > > > > 
> > > > > > > I'll elaborate on this but first let's clarify why a new entry is used for
> > > > > > > HVMlite to start of with:
> > > > > > > 
> > > > > > >   1) Xen ABI has historically not wanted to set up the boot params for Linux
> > > > > > >      guests, instead it insists on letting the Linux kernel Xen boot stubs fill
> > > > > > >      that out for it. This sticking point means it has implicated a boot stub.
> > > > > > 
> > > > > > 
> > > > > > Which is b/c it has to be OS agnostic. It has nothing to do 'not wanting'.
> > > > > 
> > > > > It can still be OS agnostic and pass on type and custom data pointer.
> > > > 
> > > > Sure. It has that (it MUST otherwise how else would you pass data).
> > > > It is documented as well http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html#incontents_startofday
> > > > (see " Start of day structure passed to PVH guests in %ebx.")
> > > 
> > > The design doc begs for a custom OS entry point though.
> > 
> > That is what the ELF Note has.
> 
> Right, but I'm saying that its rather silly to be adding entry points if
> all we want the code to do is copy the boot params for us. The design
> doc requires a new entry, and likewise you'd need yet-another-entry if
> HVMLite is thrown out the window and come back 5 years later after new
> hardware solutions are in place and need to redesign HVMLite. Kind of

Why would you need to redesign HVMLite based on hardware solutions?
The entrace point and the CPU state are pretty well known - it is akin
to what GRUB2 bootloader path is (protected mode).
> where we are with PVH today. Likewise if other paravirtualization
> developers want to support Linux and want to copy your strategy they'd
> add yet-another-entry-point as well.
> 
> This is dumb.

You saying the EFI entry point is dumb? That instead the EFI
firmware should understand Linux bootparams and booted that?

> 
> > > If we had a single 'type' and 'custom data' passed to the kernel that
> > > should suffice for the default Linux entry point to just pivot off
> > > of that and do what it needs without more entry points. Once.
> > 
> > And what about ramdisk? What about multiple ramdisks?
> > What about command line? All of that is what bootparams
> > tries to unify on Linux. But 'bootparams' is unique to Linux,
> > it does not exist on FreeBSD. Hence some stub code to transplant
> > OS-agnostic simple data to OS-specific is neccessary.
> 
> If we had a Xen ABI option where *all* that I'm asking is you pass
> first:
> 
>   a) hypervisor type

Why can't you use cpuid.
>   b) custom data pointer

What is this custom data pointer you speak of?
> 
> We'd be able to avoid adding *any* entry point and just address
> the requirements as I noted with pre / post stubs for the type.

But you need some entry point to call into Linux. Are you
suggesting to use the existing ones? No, the existing one
wouldn't understand this.

> This would require an x86 boot protocol bump, but all the issues
> creeping up randomly I think that's worth putting on the table now.

Aaaah, so you are saying expand the bootparams. In other words
make Xen ABI call into Linux using the bootparams structure, similar
to how GRUB2 does it.

How is that OS agnostic?
> 
> And maybe we don't want it to be hypervisor specific, perhaps there are other
> *needs* for custom pre-post startup_32()/startup_64() stubs.

Multiboot?
> 
> To avoid extending boot_params further I figured perhaps we can look
> at EFI as another option instead. If we are going to drop all legacy

But EFI support is _huge_.
> PV support from the kernel (not the hypervisor) and require hardware
> virtualization 5 years from now on the Linux kernel, it doesn't seem
> to me far fetched to at the very least consider using an EFI entry
> instead, specially since all it does is set boot params and we can
> make re-use this for HVMLite too.

But to make that work you have to emulate EFI firmware in the
hypervisor. Is that work you are signing up for?
> 
>   Luis