linux-kernel - Re: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ab50bc0-1714-67c4-ea9a-79e7d315315b@redhat.com>
Date:   Tue, 8 Jun 2021 12:12:09 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Oscar Salvador <osalvador@...e.de>
Cc:     linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Jason Wang <jasowang@...hat.com>,
        Marek Kedzierski <mkedzier@...hat.com>,
        Hui Zhu <teawater@...il.com>,
        Pankaj Gupta <pankaj.gupta.linux@...il.com>,
        Wei Yang <richard.weiyang@...ux.alibaba.com>,
        Michal Hocko <mhocko@...nel.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mike Rapoport <rppt@...nel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>,
        Pavel Tatashin <pasha.tatashin@...een.com>,
        virtualization@...ts.linux-foundation.org, linux-mm@...ck.org,
        linux-acpi@...r.kernel.org
Subject: Re: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy
 and memory groups

On 08.06.21 11:42, Oscar Salvador wrote:
> On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote:
>> Hi,
>>
>> this series aims at improving in-kernel auto-online support. It tackles the
>> fundamental problems that:
> 
> Hi David,
> 
> the idea sounds good to me, and I like that this series takes away part of the
> responsability from the user to know where the memory should go.
> I think the kernel is a much better fit for that as it has all the required
> information to balance things.
> 
> I also glanced over the series and besides some things here and there the
> whole approach looks sane.
> I plan to have a look into it in a few days, just have some high level questions
> for the time being:

Hi Oscar,

> 
>>   1) We can create zone imbalances when onlining all memory blindly to
>>      ZONE_MOVABLE, in the worst case crashing the system. We have to know
>>      upfront how much memory we are going to hotplug such that we can
>>      safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
>>      via "online_movable". This is far from practical and only applicable in
>>      limited setups -- like inside VMs under the RHV/oVirt hypervisor which
>>      will never hotplug more than 3 times the boot memory (and the
>>      limitation is only in place due to the Linux limitation).
> 
> Could you give more insight about the problems created by zone imbalances (e.g:
> a lot of movable memory and little kernel memory).

I just updated memory-hotplug.rst exactly for that purpose :)

https://lkml.kernel.org/r/20210525102604.8770-1-david@redhat.com

There, also safe zone ratios and "usually well known values" are given. 
I can link it in the next cover letter.

> 
>>   2) We see more setups that implement dynamic VM resizing, hot(un)plugging
>>      memory to resize VM memory. In these setups, we might hotplug a lot of
>>      memory, but it might happen in various small steps in both directions
>>      (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
>>      primary driver of this upstream right now, performing such dynamic
>>      resizing NUMA-aware via multiple virtio-mem devices.
>>
>>      Onlining all hotplugged memory to ZONE_NORMAL means we basically have
>>      no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
>>      easily run into zone imbalances when growing a VM. We want a mixture,
>>      and we want as much memory as reasonable/configured in ZONE_MOVABLE.
>>
>>   3) Memory devices consist of 1..X memory block devices, however, the
>>      kernel doesn't really track the relationship. Consequently, also user
>>      space has no idea. We want to make per-device decisions. As one
>>      example, for memory hotunplug it doesn't make sense to use a mixture of
>>      zones within a single DIMM: we want all MOVABLE if possible, otherwise
>>      all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
>>      getting hotunplugged. As another example, virtio-mem operates on
>>      individual units that span 1..X memory blocks. Similar to a DIMM, we
>>      want a unit to either be all MOVABLE or !MOVABLE. Further, we want
>>      as much memory of a virtio-mem device to be MOVABLE as possible.
> 
> So, a virtio-mem unit could be seen as DIMM right?

It's a bit more complicated. Each individual unit (e.g., a 128 MiB 
memory block) is the smallest granularity we can add/remove of that 
device. So such a unit is somewhat like a DIMM. However, all "units" of 
the device can interact -- it's a single memory device.


> 
>>   4) We want memory onlining to be done right from the kernel while adding
>>      memory; for example, this is reqired for fast memory hotplug for
>>      drivers that add individual memory blocks, like virito-mem. We want a
>>      way to configure a policy in the kernel and avoid implementing advanced
>>      policies in user space.
> 
> "we want memory onlining to be done right from the kernel while adding memory"
> 
> is not that always the case when a driver adds memory? User has no interaction
> with that right?

Well, with auto-onlining in the kernel disabled, user space has to do 
the onlining -- for example via udev rules right now in major distributions.

But there are also users that always want to online manually in user 
space to select a zone. Most prominently standby memory on s390x, but 
also in some cases dax/kmem memory. But these two are really corner 
cases. In general, we want hotplugged memory to be onlined immediately.

> 
>> The auto-onlining support we have in the kernel is not sufficient. All we
>> have is a) online everything movable (online_movable) b) online everything
>> !movable (online_kernel) c) keep zones contiguous (online). This series
>> allows configuring c) to mean instead "online movable if possible according
>> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
>> onlining policy.
>>
>> This series does 3 things:
>>
>>    1) Introduces the "auto-movable" online policy that initially operates on
>>       individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
>>       to make a decision whether a memory block will be onlined to
>>       ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
>>       memory does not allow for more MOVABLE memory (details in the
>>       patches). CMA memory is treated like MOVABLE memory.
> 
> How a user would know which ratio is sane? Could we add some info in the
> Docu part that kinda sets some "basic" rules?

Again, currently resides in the memory-hotplug.rst overhaul.

> 
>>    2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
>>       groups and uses group information to make decisions in the
>>       "auto-movable" online policy accross memory blocks of a single memory
>>       device (modeled as memory group).
> 
> So, the distinction being that a DIMM cannot grow larger but we can add more
> memory to a virtio-mem unit? I feel I am missing some insight here.

Right, the relevant patch contains more info.

You either plug or unplug a DIMM (or a NUMA node which spans multiple 
DIMMS) -- both are ACPI memory devices that span multiple physical 
regions. You cannot unplug parts of a DIMM or grow it. "static" as also 
expressed by ACPI code ("adds" and "removes" all memory device memory in 
one go).

virtio-mem behaves differently, as it's a single physical memory region 
in which we dynamically add or remove memory. The granularity in which 
we add/remove memory from Linux is a "unit". In the simplest case, it's 
just a single memory block (e.g., 128 MiB). So it's a memory device that 
can grow/shrink in the given unit -- "dynamic".

> 
>>    3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
>>       allowing ZONE_NORMAL memory within a dynamic memory group to allow for
>>       more ZONE_MOVABLE memory within the same memory group. The target use
>>       case is dynamic VM resizing using virtio-mem.
> 
> Sorry, I got lost in this one. Care to explain a bit more?

The virtio-mem example below should make this a bit more clearer (in 
addition to the relevant patch), especially in contrast to static memory 
devices like DIMMs. Key is that a single virtio-mem device is a "dynamic 
memory group" in which memory can get added/removed dynamically in a 
given unit granularity. And we want to special case that type of device 
to have as much memory of a virtio-mem device being MOVABLE as possible 
(and configured).

> 
>> The target usage will be:
>>
>>    1) Linux boots with "mhp_default_online_type=offline"
>>
>>    2) User space (e.g., systemd unit) configures memory onlining (according
>>       to a config file and system properties), for example:
>>       * Setting memory_hotplug.online_policy=auto-movable
>>       * Setting memory_hotplug.auto_movable_ratio=301
>>       * Setting memory_hotplug.auto_movable_numa_aware=true
> 
> I think we would need to document those in order to let the user know what
> it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc.

Yes, as mentioned below, an memory-hotplug.rst update will follow once 
the overhaul is done. The respective patch contains more information.

> 
>> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
>> 301% results in the following layout:
>> 	Memory block 1-15:    DMA32   (early)
>> 	Memory block 32-47:   Normal  (early)
>> 	Memory block 48-79:   Movable (DIMM 0)
>> 	Memory block 80-111:  Movable (DIMM 1)
>> 	Memory block 112-143: Movable (DIMM 2)
>> 	Memory block 144-275: Normal  (DIMM 3)
>> 	Memory block 176-207: Normal  (DIMM 4)
>> 	... all Normal
>> 	(-> hotplugged Normal memory does not allow for more Movable memory)
> 
> Uhm, I am sorry for being dense here:
> 
> On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47?

Sorry, it's actually "Memory block 0-15", which gives us 0-15 and 32-47 
== 32 memory blocks corresponding to boot memory. Note that the absent 
memory blocks 16-31 should correspond to the PCI hole.


Thanks Oscar!

-- 
Thanks,

David / dhildenb