[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d1b3e5cb-a96b-bb93-71f2-55fb82dd5e49@redhat.com>
Date: Wed, 23 May 2018 20:27:48 +0200
From: David Hildenbrand <david@...hat.com>
To: linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org,
Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Cornelia Huck <cohuck@...hat.com>,
Dan Williams <dan.j.williams@...el.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Halil Pasic <pasic@...ux.ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
Jason Wang <jasowang@...hat.com>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Len Brown <lenb@...nel.org>,
Martin Schwidefsky <schwidefsky@...ibm.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Michal Hocko <mhocko@...e.com>,
Pavel Tatashin <pasha.tatashin@...cle.com>,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Stefan Hajnoczi <stefanha@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Vlastimil Babka <vbabka@...e.cz>, KVM <kvm@...r.kernel.org>,
"virtualization@...ts.linux-foundation.org"
<virtualization@...ts.linux-foundation.org>,
"virtio-dev@...ts.oasis-open.org" <virtio-dev@...ts.oasis-open.org>,
"qemu-devel@...gnu.org" <qemu-devel@...gnu.org>,
qemu-s390x <qemu-s390x@...gnu.org>
Subject: Re: [PATCH RFCv2 0/4] virtio-mem: paravirtualized memory
On 23.05.2018 20:24, David Hildenbrand wrote:
> This is the Linux driver side of virtio-mem. Compared to the QEMU side,
> it is in a pretty complete and clean state.
>
> virtio-mem is a paravirtualized mechanism of adding/removing memory to/from
> a VM. We can do this on a 4MB granularity right now. In Linux, all
> memory is added to the ZONE_NORMAL, so unplugging cannot be guaranteed -
> but will be more likely to succeed compared to unplugging 128MB+ chunks.
> We might implement some optimizations in that area in the future that will
> make memory unplug more reliable.
>
> For now, this is an easy way to give a VM access to more memory and
> eventually to remove some memory again. I am testing it on x86 and
> s390x (under QEMU TCG so far only).
>
> This is the follow up on [1], but the concept, user interface and
> virtio protocol has been heavily changed. I am only including the important
> parts in this cover letter (because otherwise nobody will read it). Please
> feel free to ask in case there are any questions.
>
> This series is based on [4] and shows how it is being used. It contains
> further information. Also have a look at the description of patch nr 4 in
> this series.
>
> This work is the result of the initital idea of Andrea Arcangeli to host
> enforce guest access to memory inflated in virtio-balloon using
> userfaultfd, which turned out to be problematic to implement. That's how
> I came up with virtio-mem.
>
> --------------------------------------------------------------------------
> 1. High level concept
> --------------------------------------------------------------------------
>
> Each virtio-mem device owns a memory region in the physical address space.
> The guest is allowed to plug and online up to 'requested_size' of memory.
> It will not be allowed to plug more than that size. Unplugged memory will
> be protected by configurable mechanisms (e.g. random discard, userfaultfd
> protection, etc.). virtio-mem is designed in a way that a guest may never
> assume to be able to even read unplugged memory. This is a big difference
> to classical balloon drivers.
>
> The usable memory region might grow over time, so not all parts of the
> device memory region might be usable from the start. This is an
> optimization to allow a smarter implementation in the hypervisor (reduce
> size of dirty bitmaps, size of memory regions ...).
>
> When the device driver starts up, it will query 'requested_size' and start
> to add memory to the system. This memory is not indicated e.g. via ACPI,
> so unmodified systems will not silently try to use unplugged memory that
> they are not supposed to touch.
>
> Updates on the 'requested_size' indicate hypervisor requests to plug or
> unplug memory.
>
> As each virtio-mem device can belong to a NUMA node, we can easily
> plug/unplug memory on a NUMA basis. And of course, we can have several
> independent virtio-mem devices for a VM.
>
> The idea is *not* to add new virtio-mem devices when hotplugging memory,
> the idea is to resize (grow/shrink) virtio-mem devices.
>
> --------------------------------------------------------------------------
> 2. Benefits
> --------------------------------------------------------------------------
>
> Guest side:
> - Increase memory usable by Linux in 4MB steps (vs. section size like 128MB
> on x86 or 2GB on e.g. some arm if I'm not mistaking)
> - Remove struct pages once all 4MB chunks of a section are offline (in
> contrast to all balloon drivers where this never happens)
> - Don't fragment memory, while still being able to unplug smaller chunks
> than ordinary DIMM sizes.
> - Memory hotplug support for architectures that have no proper interface
> (e.g. s390x misses the external notification part) or e.g. QEMU/Linux
> support is complicated to implement.
> - Automatic management of onlining/offlining in the device driver -
> no manual interaction from an admin/tool necessary.
>
> QEMU side:
> - Resizing (plug/unplug) has a single interface - in contrast to a mixture
> of ACPI and virtio-balloon. See the example below.
> - Migration works out of the box - no need to specify new DIMMs or new
> sizes on the migration target. It simply works.
> - We can resize in arbitrary steps and sizes (in contrast to e.g. ACPI,
> where we have to know upfront in which granularity we later on want to
> remove memory or even how much memory we eventually want to add to our
> guest)
> - One interface to rule them (architectures) all :)
>
> --------------------------------------------------------------------------
> 3. Reboot handling
> --------------------------------------------------------------------------
>
> After a reboot, all memory is unplugged. This allows the hypervisor
> to see if support for virtio-mem is available in the freshly booted system.
> This way we could charge only for the actually "plugged" memory size. And
> it avoids to sense for plugged memory in the guest.
>
> E.g. on every size change of a virtio-mem device, we can notify management
> layers. So we can track how much memory a VM has plugged.
>
> --------------------------------------------------------------------------
> 4. Example
> --------------------------------------------------------------------------
>
> (not including resizable memory regions on the QEMU side yet, so don't
> focus on that part - it will consume a lot of memory right now for e.g.
> dirty bitmaps and memory slot tracking data)
>
> Start QEMU with two virtio-mem devices that provide little memory inititally.
> $ qemu-system-x86_64 -m 4G,maxmem=504G \
> -smp sockets=2,cores=2 \
> [...]
> -object memory-backend-ram,id=mem0,size=256G \
> -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,size=4160M \
> -object memory-backend-ram,id=mem1,size=256G \
> -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,size=3G
>
> Query the configuration ('size' tells us the guest driver is active):
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 4362076160
> size: 4362076160
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> Memory device [virtio-mem]: "vm1"
> phys-addr: 0x4140000000
> node: 1
> requested-size: 3221225472
> size: 3221225472
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem1
>
> Change the size of a virtio-mem device:
> (qemu) memory-device-resize vm0 40960
> memory-device-resize vm0 40960
> ...
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 42949672960
> size: 42949672960
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> ...
>
> Try to unplug memory (KASAN active in the guest - a lot of memory wasted):
> (qemu) memory-device-resize vm0 1024
> memory-device-resize vm0 1024
> ...
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 1073741824
> size: 6169821184
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> ...
>
> I am sharing for now only the linux driver side. The current code can be
> found at [2]. The QEMU side is still heavily WIP, the current QEMU
> prototype can be found at [3].
>
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html
> [2] https://github.com/davidhildenbrand/linux/tree/virtio-mem
> [3] https://github.com/davidhildenbrand/qemu/tree/virtio-mem
> [4] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1698014.html
>
> David Hildenbrand (4):
> ACPI: NUMA: export pxm_to_node
> s390: mm: support removal of memory
> s390: numa: implement memory_add_physaddr_to_nid()
> virtio-mem: paravirtualized memory
>
> arch/s390/mm/init.c | 18 +-
> arch/s390/numa/numa.c | 12 +
> drivers/acpi/numa.c | 1 +
> drivers/virtio/Kconfig | 15 +
> drivers/virtio/Makefile | 1 +
> drivers/virtio/virtio_mem.c | 1040 +++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_mem.h | 134 ++++
> 8 files changed, 1216 insertions(+), 6 deletions(-)
> create mode 100644 drivers/virtio/virtio_mem.c
> create mode 100644 include/uapi/linux/virtio_mem.h
>
cc-ing some further mailing lists
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists