linux-kernel - Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170324211016.GG9755@char.us.oracle.com>
Date:   Fri, 24 Mar 2017 17:10:16 -0400
From:   Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
To:     Dan Streetman <dan.streetman@...onical.com>
Cc:     Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        David Vrabel <david.vrabel@...rix.com>,
        Juergen Gross <jgross@...e.com>,
        xen-devel@...ts.xenproject.org, linux-kernel@...r.kernel.org
Subject: Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's
 initial state to number of existing RAM pages"

On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
> <boris.ostrovsky@...cle.com> wrote:
> >
> >
> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
> >>
> >> I have a question about a problem introduced by this commit:
> >> c275a57f5ec3056f732843b11659d892235faff7
> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
> >>
> >> It changed the xen balloon current_pages calculation to start with the
> >> number of physical pages in the system, instead of max_pfn.  Since
> >> get_num_physpages() does not include holes, it's always less than the
> >> e820 map's max_pfn.
> >>
> >> However, the problem that commit introduced is, if the hypervisor sets
> >> the balloon target to equal to the e820 map's max_pfn, then the
> >> balloon target will *always* be higher than the initial current pages.
> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> >> the OS adds any holes, the balloon target will be higher than the
> >> current pages.  This is the situation, for example, for Amazon AWS
> >> instances.  The result is, the xen balloon will always immediately
> >> hotplug some memory at boot, but then make only (max_pfn -
> >> get_num_physpages()) available to the system.
> >>
> >> This balloon-hotplugged memory can cause problems, if the hypervisor
> >> wasn't expecting it; specifically, the system's physical page
> >> addresses now will exceed the e820 map's max_pfn, due to the
> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> >> DMA to/from those physical pages above the e820 max_pfn, it causes
> >> problems.  For example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
> >>
> >> The additional small amount of balloon memory can cause other problems
> >> as well, for example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
> >>
> >> Anyway, I'd like to ask, was the original commit added because
> >> hypervisors are supposed to set their balloon target to the guest
> >> system's number of phys pages (max_pfn - holes)?  The mailing list
> >> discussion and commit description seem to indicate that.
> >
> >
> >
> > IIRC the problem that this was trying to fix was that since max_pfn includes
> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> > hole size.
> >
> > If you boot a guest with ~4+GB memory you should see this.
> >
> >
> >> However I'm
> >> not sure how that is possible, because the kernel reserves its own
> >> holes, regardless of any predefined holes in the e820 map; for
> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
> >> the hypervisor really has no way to know what the "right" target to
> >> specify is; unless it knows the exact guest OS and kernel version, and
> >> kernel config values, it will never be able to correctly specify its
> >> target to be exactly (e820 max_pfn - all holes).
> >>
> >> Should this commit be reverted?  Should the xen balloon target be
> >> adjusted based on kernel-added e820 holes?
> >
> >
> > I think the second one but shouldn't current_pages be updated, and not the
> > target? The latter is set by Xen (toolstack, via xenstore usually).
> >
> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
> > whether the memory that they add is of RAM type --- I believe it has its own
> > type and so perhaps that introduces additional inconsistencies. AWS may have
> > added their own support for that, which we don't have upstream yet.
> 
> The type of memory doesn't have anything to do with it.
> 
> The problem with NVMe is it's a passthrough device, so the guest talks
> directly to the NVMe controller and does DMA with it.  But the
> hypervisor does swiotlb translation between the guest physical memory,

Um, the hypervisor does not have SWIOTLB support, only IOMMU support.

> and the host physical memory, so that the NVMe device can correctly
> DMA to the right memory in the host.
> 
> However, the hypervisor only has the guest's physical memory up to the
> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
> any additional memory above the e820 max pfn, so when the NVMe driver
> in the guest tries to tell the NVMe controller to DMA to that
> balloon-hotplugged memory, the hypervisor fails the NVMe request,

But when the memory hotplug happens the hypercalls are done to
raise the max pfn.

> because it can't do the guest-to-host phys mem mapping, since the
> guest phys address is outside the expected max range.
> 
> 
> 
> >
> > -boris
> >
> >
> >
> >> Should something else be
> >> done?
> >>
> >> For context, Amazon Linux has simply disabled Xen ballooning
> >> completely.  Likewise, we're planning to disable Xen ballooning in the
> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> >> Ubuntu kernels).  However, if reverting this patch makes sense in a
> >> bigger context (i.e. Xen users besides AWS), that would allow more
> >> Ubuntu kernels to work correctly in AWS instances.
> >>
> >