linux-kernel - Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOZ2QJNq_oHsk=HcK3wc-DLt78sTRydUw2VYx-A0vLbJ64bGNQ@mail.gmail.com>
Date:   Fri, 24 Mar 2017 17:26:40 -0400
From:   Dan Streetman <dan.streetman@...onical.com>
To:     Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc:     Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        David Vrabel <david.vrabel@...rix.com>,
        Juergen Gross <jgross@...e.com>,
        xen-devel@...ts.xenproject.org, linux-kernel@...r.kernel.org
Subject: Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's
 initial state to number of existing RAM pages"

On Fri, Mar 24, 2017 at 5:10 PM, Konrad Rzeszutek Wilk
<konrad.wilk@...cle.com> wrote:
> On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
>> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
>> <boris.ostrovsky@...cle.com> wrote:
>> >
>> >
>> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
>> >>
>> >> I have a question about a problem introduced by this commit:
>> >> c275a57f5ec3056f732843b11659d892235faff7
>> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
>> >>
>> >> It changed the xen balloon current_pages calculation to start with the
>> >> number of physical pages in the system, instead of max_pfn.  Since
>> >> get_num_physpages() does not include holes, it's always less than the
>> >> e820 map's max_pfn.
>> >>
>> >> However, the problem that commit introduced is, if the hypervisor sets
>> >> the balloon target to equal to the e820 map's max_pfn, then the
>> >> balloon target will *always* be higher than the initial current pages.
>> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
>> >> the OS adds any holes, the balloon target will be higher than the
>> >> current pages.  This is the situation, for example, for Amazon AWS
>> >> instances.  The result is, the xen balloon will always immediately
>> >> hotplug some memory at boot, but then make only (max_pfn -
>> >> get_num_physpages()) available to the system.
>> >>
>> >> This balloon-hotplugged memory can cause problems, if the hypervisor
>> >> wasn't expecting it; specifically, the system's physical page
>> >> addresses now will exceed the e820 map's max_pfn, due to the
>> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
>> >> DMA to/from those physical pages above the e820 max_pfn, it causes
>> >> problems.  For example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
>> >>
>> >> The additional small amount of balloon memory can cause other problems
>> >> as well, for example:
>> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
>> >>
>> >> Anyway, I'd like to ask, was the original commit added because
>> >> hypervisors are supposed to set their balloon target to the guest
>> >> system's number of phys pages (max_pfn - holes)?  The mailing list
>> >> discussion and commit description seem to indicate that.
>> >
>> >
>> >
>> > IIRC the problem that this was trying to fix was that since max_pfn includes
>> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
>> > hole size.
>> >
>> > If you boot a guest with ~4+GB memory you should see this.
>> >
>> >
>> >> However I'm
>> >> not sure how that is possible, because the kernel reserves its own
>> >> holes, regardless of any predefined holes in the e820 map; for
>> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
>> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW).  So
>> >> the hypervisor really has no way to know what the "right" target to
>> >> specify is; unless it knows the exact guest OS and kernel version, and
>> >> kernel config values, it will never be able to correctly specify its
>> >> target to be exactly (e820 max_pfn - all holes).
>> >>
>> >> Should this commit be reverted?  Should the xen balloon target be
>> >> adjusted based on kernel-added e820 holes?
>> >
>> >
>> > I think the second one but shouldn't current_pages be updated, and not the
>> > target? The latter is set by Xen (toolstack, via xenstore usually).
>> >
>> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
>> > whether the memory that they add is of RAM type --- I believe it has its own
>> > type and so perhaps that introduces additional inconsistencies. AWS may have
>> > added their own support for that, which we don't have upstream yet.
>>
>> The type of memory doesn't have anything to do with it.
>>
>> The problem with NVMe is it's a passthrough device, so the guest talks
>> directly to the NVMe controller and does DMA with it.  But the
>> hypervisor does swiotlb translation between the guest physical memory,
>
> Um, the hypervisor does not have SWIOTLB support, only IOMMU support.

heh, well I have no special insight into Amazon's hypervisor, so I
have no idea what underlying memory remapping mechanism it uses :)

>
>> and the host physical memory, so that the NVMe device can correctly
>> DMA to the right memory in the host.
>>
>> However, the hypervisor only has the guest's physical memory up to the
>> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
>> any additional memory above the e820 max pfn, so when the NVMe driver
>> in the guest tries to tell the NVMe controller to DMA to that
>> balloon-hotplugged memory, the hypervisor fails the NVMe request,
>
> But when the memory hotplug happens the hypercalls are done to
> raise the max pfn.

well...all I can say is it rejects DMA above the e820 range.  so this
very well may be a hypervisor bug, where it should add the balloon
memory region to whatever does the NVMe passthrough device iommu
mapping.

I think we can all agree that the *ideal* situation would be, for the
balloon driver to not immediately hotplug memory so it can add 11 more
pages, so maybe I just need to figure out why the balloon driver
thinks it needs 11 more pages, and fix that.

>
>> because it can't do the guest-to-host phys mem mapping, since the
>> guest phys address is outside the expected max range.
>>
>>
>>
>> >
>> > -boris
>> >
>> >
>> >
>> >> Should something else be
>> >> done?
>> >>
>> >> For context, Amazon Linux has simply disabled Xen ballooning
>> >> completely.  Likewise, we're planning to disable Xen ballooning in the
>> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
>> >> Ubuntu kernels).  However, if reverting this patch makes sense in a
>> >> bigger context (i.e. Xen users besides AWS), that would allow more
>> >> Ubuntu kernels to work correctly in AWS instances.
>> >>
>> >