lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 29 Jul 2022 14:25:05 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
Cc:     linux-mm@...ck.org, akpm@...ux-foundation.org,
        Wei Xu <weixugc@...gle.com>, Yang Shi <shy828301@...il.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Tim C Chen <tim.c.chen@...el.com>,
        Michal Hocko <mhocko@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Hesham Almatary <hesham.almatary@...wei.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        Alistair Popple <apopple@...dia.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Johannes Weiner <hannes@...xchg.org>, jvgediya.oss@...il.com,
        Jagdish Gediya <jvgediya@...ux.ibm.com>
Subject: Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers

"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds
> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
> based on the distances between nodes.
>
> This current memory tier kernel implementation needs to be improved for several
> important use cases,
>
> The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high performance
> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
> should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
> the next lower tier.
>
> With current kernel higher tier node can only be demoted to nodes with shortest
> distance on the next lower tier as defined by the demotion path, not any other
> node from any lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node is
> out of space), This demotion order is also inconsistent with the page allocation
> fallback order when all the nodes in a higher tier are out of space: The page
> allocation can fall back to any node from any lower tier, whereas the demotion
> order doesn't allow that.
>
> This patch series address the above by defining memory tiers explicitly.
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> This patch configures the range/chunk size to be 128. The default DRAM
> abstract distance is 512. We can have 4 memory tiers below the default DRAM
> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
> Slower memory devices like persistent memory will have abstract distance below
> the default DRAM level and hence will be placed in these 4 lower tiers.

For abstract distance, the lower value means higher performance, higher
value means lower performance.  So the abstract distance of PMEM should
be smaller than that of DRAM.

> A kernel parameter is provided to override the default memory tier.

Forget to delete?

Best Regards,
Huang, Ying

> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> Signed-off-by: Jagdish Gediya <jvgediya@...ux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@...ux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  17 ++++++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..8d7884b7a3f0
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +/*
> + * Each tier cover a abstrace distance chunk size of 128
> + */
> +#define MEMTIER_CHUNK_BITS	7
> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
> +/* leave one tier below this slow pmem */
> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
> +
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..01cfd514c192
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	/* hierarchy of memory tiers */
> +	struct list_head list;
> +	/* list of all memory types part of this tier */
> +	struct list_head memory_types;
> +	/*
> +	 * start value of abstract distance. memory tier maps
> +	 * an abstract distance  range,
> +	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
> +	 */
> +	int adistance_start;
> +};
> +
> +struct memory_dev_type {
> +	/* list of memory types that are are part of same tier as this type */
> +	struct list_head tier_sibiling;
> +	/* abstract distance for this specific memory type */
> +	int adistance;
> +	/* Nodes of same abstract distance */
> +	nodemask_t nodes;
> +	struct memory_tier *memtier;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +struct memory_dev_type *node_memory_types[MAX_NUMNODES];
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +static struct memory_dev_type default_dram_type  = {
> +	.adistance = MEMTIER_ADISTANCE_DRAM,
> +	.tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
> +};
> +
> +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
> +{
> +	bool found_slot = false;
> +	struct memory_tier *memtier, *new_memtier;
> +	int adistance = memtype->adistance;
> +	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	/*
> +	 * If the memtype is already part of a memory tier,
> +	 * just return that.
> +	 */
> +	if (memtype->memtier)
> +		return memtype->memtier;
> +
> +	adistance = round_down(adistance, memtier_adistance_chunk_size);
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		if (adistance == memtier->adistance_start) {
> +			memtype->memtier = memtier;
> +			list_add(&memtype->tier_sibiling, &memtier->memory_types);
> +			return memtier;
> +		} else if (adistance < memtier->adistance_start) {
> +			found_slot = true;
> +			break;
> +		}
> +	}
> +
> +	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!new_memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	new_memtier->adistance_start = adistance;
> +	INIT_LIST_HEAD(&new_memtier->list);
> +	INIT_LIST_HEAD(&new_memtier->memory_types);
> +	if (found_slot)
> +		list_add_tail(&new_memtier->list, &memtier->list);
> +	else
> +		list_add_tail(&new_memtier->list, &memory_tiers);
> +	memtype->memtier = new_memtier;
> +	list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
> +	return new_memtier;
> +}
> +
> +static int __init memory_tier_init(void)
> +{
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	/* CPU only nodes are not part of memory tiers. */
> +	default_dram_type.nodes = node_states[N_MEMORY];
> +
> +	memtier = find_create_memory_tier(&default_dram_type);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ