linux-kernel - Re: [PATCH v2] hugetlb: Add hugetlb.*.numa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4374d9ca-fc13-1dad-14eb-5e3e47953620@oracle.com>
Date:   Wed, 27 Oct 2021 16:35:58 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Mina Almasry <almasrymina@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Oscar Salvador <osalvador@...e.de>,
        Michal Hocko <mhocko@...e.com>,
        Muchun Song <songmuchun@...edance.com>,
        David Rientjes <rientjes@...gle.com>,
        Jue Wang <juew@...gle.com>, Yang Yao <ygyao@...gle.com>,
        Joanna Li <joannali@...gle.com>,
        Cannon Matthews <cannonmatthews@...gle.com>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] hugetlb: Add hugetlb.*.numa_stat file

On 10/20/21 12:09 PM, Mina Almasry wrote:
> For hugetlb backed jobs/VMs it's critical to understand the numa
> information for the memory backing these jobs to deliver optimal
> performance.
> 
> Currently this techinically can be queried from /proc/self/numa_maps, but

                 technically

> there are significant issues with that. Namely:
> 1. Memory can be mapped on unmapped.
> 2. numa_maps are per process and need to be aggregaged across all

                                              aggregated

>    proceses in the cgroup. For shared memory this is more involved as

     processes

>    the userspace needs to make sure it doesn't double count shared
>    mappings.
> 3. I believe querying numa_maps needs to hold the mmap_lock which adds
>    to the contention on this lock.
> 
> For these reasons I propose simply adding hugetlb.*.numa_stat file,
> which shows the numa information of the cgroup similarly to
> memory.numa_stat.
> 
> On cgroup-v2:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat

I know that cgroup can be mounted anywhere, but I think the conventional
locations are mentioned in the man page.  Why not use them?

v2 '/sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat'

>    total=2097152 N0=2097152 N1=0
> 
> On cgroup-v1:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat

v1 '/sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat'

>    total=2097152 N0=2097152 N1=0
>    hierarichal_total=2097152 N0=2097152 N1=0
> 
> This patch was tested manually by allocating hugetlb memory and querying
> the hugetlb.*.numa_stat file of the cgroup and its parents.

I have no objections to adding this functionality, and do not see any
blocking issues in hugetlb code.  However, it would be GREAT if someone
more familiar/experienced with cgroups would comment.  My cgroup
experience is very limited.

> 
> Cc: Mike Kravetz <mike.kravetz@...cle.com>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Shuah Khan <shuah@...nel.org>
> Cc: Miaohe Lin <linmiaohe@...wei.com>
> Cc: Oscar Salvador <osalvador@...e.de>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: Muchun Song <songmuchun@...edance.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: Jue Wang <juew@...gle.com>
> Cc: Yang Yao <ygyao@...gle.com>
> Cc: Joanna Li <joannali@...gle.com>
> Cc: Cannon Matthews <cannonmatthews@...gle.com>
> Cc: linux-mm@...ck.org
> Cc: linux-kernel@...r.kernel.org
> 
> Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> 
> ---
> 
> Changes in v2:
> - Fix warning Reported-by: kernel test robot <lkp@...el.com>
> ---
>  .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
>  Documentation/admin-guide/cgroup-v2.rst       |  7 ++
>  include/linux/hugetlb.h                       |  4 +-
>  include/linux/hugetlb_cgroup.h                |  7 ++
>  mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
>  .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
>  6 files changed, 113 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> index 338f2c7d7a1c..0fa724d82abb 100644
> --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> @@ -29,12 +29,14 @@ Brief summary of control files::
>   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
>   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
> 
>  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
>  files include::
> 
>    hugetlb.1GB.limit_in_bytes
>    hugetlb.1GB.max_usage_in_bytes
> +  hugetlb.1GB.numa_stat
>    hugetlb.1GB.usage_in_bytes
>    hugetlb.1GB.failcnt
>    hugetlb.1GB.rsvd.limit_in_bytes
> @@ -43,6 +45,7 @@ files include::
>    hugetlb.1GB.rsvd.failcnt
>    hugetlb.64KB.limit_in_bytes
>    hugetlb.64KB.max_usage_in_bytes
> +  hugetlb.64KB.numa_stat
>    hugetlb.64KB.usage_in_bytes
>    hugetlb.64KB.failcnt
>    hugetlb.64KB.rsvd.limit_in_bytes
> @@ -51,6 +54,7 @@ files include::
>    hugetlb.64KB.rsvd.failcnt
>    hugetlb.32MB.limit_in_bytes
>    hugetlb.32MB.max_usage_in_bytes
> +  hugetlb.32MB.numa_stat
>    hugetlb.32MB.usage_in_bytes
>    hugetlb.32MB.failcnt
>    hugetlb.32MB.rsvd.limit_in_bytes
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 4d8c27eca96b..8ba0d6aadd2c 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2252,6 +2252,13 @@ HugeTLB Interface Files
>  	are local to the cgroup i.e. not hierarchical. The file modified event
>  	generated on this file reflects only the local events.
> 
> +  hugetlb.<hugepagesize>.numa_stat
> +	Similar to memory.numa_stat, it shows the numa information of the
> +	memory in this cgroup:

        hugetlb pages of <hugepagesize> in this cgroup.  Only active in 
	use hugetlb pages are included.  The per-node values are in bytes.

> +
> +	/dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
> +	total=0 N0=0 N1=0

I would not include the 'example' above as nothing like this is used
elsewhere in the file.

> +
>  Misc
>  ----
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..0445faaa636e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -613,8 +613,8 @@ struct hstate {
>  #endif
>  #ifdef CONFIG_CGROUP_HUGETLB
>  	/* cgroup control files */
> -	struct cftype cgroup_files_dfl[7];
> -	struct cftype cgroup_files_legacy[9];
> +	struct cftype cgroup_files_dfl[8];
> +	struct cftype cgroup_files_legacy[10];
>  #endif
>  	char name[HSTATE_NAME_LEN];
>  };
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index c137396129db..54ff6ec68ed3 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
>  	HUGETLB_NR_MEMORY_EVENTS,
>  };
> 
> +struct hugetlb_cgroup_per_node {
> +	/* hugetlb usage in bytes over all hstates. */
> +	unsigned long usage[HUGE_MAX_HSTATE];
> +};
> +
>  struct hugetlb_cgroup {
>  	struct cgroup_subsys_state css;
> 
> @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
> 
>  	/* Handle for "hugetlb.events.local" */
>  	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> +
> +	struct hugetlb_cgroup_per_node *nodeinfo[];
>  };
> 
>  static inline struct hugetlb_cgroup *
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 5383023d0cca..4b807292f7e8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>  				struct hugetlb_cgroup *parent_h_cgroup)
>  {
>  	int idx;
> +	int node;
> 
>  	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
>  		struct page_counter *fault_parent = NULL;
> @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>  			limit);
>  		VM_BUG_ON(ret);
>  	}
> +
> +	for_each_node(node) {
> +		/* Set node_to_alloc to -1 for offline nodes. */
> +		int node_to_alloc =
> +			node_state(node, N_NORMAL_MEMORY) ? node : -1;
> +		h_cgroup->nodeinfo[node] =
> +			kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> +				     GFP_KERNEL, node_to_alloc);

Do you need to handle kzalloc_node failure here?

alloc_mem_cgroup_per_node_info provides similar functionality and has
the following comment.

	 * TODO: this routine can waste much memory for nodes which will
	 *       never be onlined. It's better to use memory hotplug callback
	 *       function.

Extra credit if you do this here and in alloc_mem_cgroup_per_node_info. :)

> +	}
>  }
> 
>  static struct cgroup_subsys_state *
> @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
>  	struct hugetlb_cgroup *h_cgroup;
> 
> -	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> +	unsigned int size =
> +		sizeof(*h_cgroup) +
> +		MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
> +	h_cgroup = kzalloc(size, GFP_KERNEL);
>  	if (!h_cgroup)
>  		return ERR_PTR(-ENOMEM);
> 
> @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>  		return;
> 
>  	__set_hugetlb_cgroup(page, h_cg, rsvd);
> -	return;
> +	if (!rsvd && h_cg)
> +		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
> +								 << PAGE_SHIFT;
>  }
> 
>  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> 
>  	if (rsvd)
>  		css_put(&h_cg->css);
> -
> +	else
> +		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
> +								 << PAGE_SHIFT;
>  	return;
>  }
> 
> @@ -421,6 +438,56 @@ enum {
>  	RES_RSVD_FAILCNT,
>  };
> 
> +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> +{
> +	int nid;
> +	struct cftype *cft = seq_cft(seq);
> +	int idx = MEMFILE_IDX(cft->private);
> +	bool legacy = MEMFILE_ATTR(cft->private);
> +	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> +	struct cgroup_subsys_state *css;
> +	unsigned long usage;
> +
> +	if (legacy) {
> +		/* Add up usage across all nodes for the non-hierarchical total. */
> +		usage = 0;
> +		for_each_node_state(nid, N_MEMORY)
> +			usage += h_cg->nodeinfo[nid]->usage[idx];
> +		seq_printf(seq, "total=%lu", usage);
> +
> +		/* Simply print the per-node usage for the non-hierarchical total. */
> +		for_each_node_state(nid, N_MEMORY)
> +			seq_printf(seq, " N%d=%lu", nid,
> +				   h_cg->nodeinfo[nid]->usage[idx]);
> +		seq_putc(seq, '\n');
> +	}
> +
> +	/* The hierarchical total is pretty much the value recorded by the
> +	 * counter, so use that.

'pretty much' ... so use that.  Here is one place in particular where
input from someone with more cgroup experience would be appreciated.

-- 
Mike Kravetz

> +	 */
> +	seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> +		   page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> +
> +	/* For each node, transverse the css tree to obtain the hierarichal
> +	 * node usage.
> +	 */
> +	for_each_node_state(nid, N_MEMORY) {
> +		usage = 0;
> +		rcu_read_lock();
> +		css_for_each_descendant_pre(css, &h_cg->css) {
> +			usage += hugetlb_cgroup_from_css(css)
> +					 ->nodeinfo[nid]
> +					 ->usage[idx];
> +		}
> +		rcu_read_unlock();
> +		seq_printf(seq, " N%d=%lu", nid, usage);
> +	}
> +
> +	seq_putc(seq, '\n');
> +
> +	return 0;
> +}
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>  				   struct cftype *cft)
>  {
> @@ -654,8 +721,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>  	cft->seq_show = hugetlb_cgroup_read_u64_max;
>  	cft->flags = CFTYPE_NOT_ON_ROOT;
> 
> -	/* Add the events file */
> +	/* Add the numa stat file */
>  	cft = &h->cgroup_files_dfl[4];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +	cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +	cft->flags = CFTYPE_NOT_ON_ROOT;
> +
> +	/* Add the events file */
> +	cft = &h->cgroup_files_dfl[5];
>  	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
>  	cft->private = MEMFILE_PRIVATE(idx, 0);
>  	cft->seq_show = hugetlb_events_show;
> @@ -663,7 +736,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>  	cft->flags = CFTYPE_NOT_ON_ROOT;
> 
>  	/* Add the events.local file */
> -	cft = &h->cgroup_files_dfl[5];
> +	cft = &h->cgroup_files_dfl[6];
>  	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
>  	cft->private = MEMFILE_PRIVATE(idx, 0);
>  	cft->seq_show = hugetlb_events_local_show;
> @@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>  	cft->flags = CFTYPE_NOT_ON_ROOT;
> 
>  	/* NULL terminate the last cft */
> -	cft = &h->cgroup_files_dfl[6];
> +	cft = &h->cgroup_files_dfl[7];
>  	memset(cft, 0, sizeof(*cft));
> 
>  	WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
> @@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
>  	cft->write = hugetlb_cgroup_reset;
>  	cft->read_u64 = hugetlb_cgroup_read_u64;
> 
> +	/* Add the numa stat file */
> +	cft = &h->cgroup_files_dfl[8];
> +	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +	cft->private = MEMFILE_PRIVATE(idx, 1);
> +	cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +
>  	/* NULL terminate the last cft */
> -	cft = &h->cgroup_files_legacy[8];
> +	cft = &h->cgroup_files_legacy[9];
>  	memset(cft, 0, sizeof(*cft));
> 
>  	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> index 6a2caba19ee1..d2da6315a40c 100644
> --- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
> +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> @@ -37,8 +37,8 @@ static int shmid;
>  static void exit_usage(void)
>  {
>  	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> -	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> -	       "[-o] [-w] [-n]\n",
> +	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
> +	       "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
>  	       self);
>  	exit(EXIT_FAILURE);
>  }
> @@ -161,6 +161,11 @@ int main(int argc, char **argv)
>  	else
>  		printf("RESERVE mapping.\n");
> 
> +	if (want_sleep)
> +		printf("Sleeping\n");
> +	else
> +		printf("Not sleeping\n");
> +
>  	switch (method) {
>  	case HUGETLBFS:
>  		printf("Allocating using HUGETLBFS.\n");
> --
> 2.33.0.1079.g6e70778dc9-goog
>