lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241101204402.1885383-1-joshua.hahnjy@gmail.com>
Date: Fri,  1 Nov 2024 13:44:02 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: akpm@...ux-foundation.org,
	hannes@...xchg.org
Cc: nphamcs@...il.com,
	shakeel.butt@...ux.dev,
	roman.gushchin@...ux.dev,
	muchun.song@...ux.dev,
	chris@...isdown.name,
	tj@...nel.org,
	lizefan.x@...edance.com,
	mkoutny@...e.com,
	corbet@....net,
	lnyng@...a.com,
	cgroups@...r.kernel.org,
	linux-mm@...ck.org,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	kernel-team@...a.com
Subject: [PATCH v4 1/1] memcg/hugetlb: Add hugeTLB counters to memcg

This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This
feature is enabled the same way hugeTLB accounting is enabled, via
the memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including
the metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits
in observability. Userspace might be interested in the hugeTLB footprint
of cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that
don't consume a lot of memory, or are seen to fault frequently. Note that
this is separate from wanting to inspect the distribution for limiting
purposes (in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability
benefits that I mentioned above, and users can check the total hugeTLB
usage, verify if it is distributed as expected, etc.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
    disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
    just for the sake of hugeTLB usage reporting, especially since
    they don't have any use for hugeTLB usage enforcing [2].

3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio
only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters
only if the feature is enabled. Regardless of whether memcg accounts
for hugetlb, the newly added global counter is updated and shown
in /proc/vmstat.

The global counter is added because vmstats is the preferred framework
for cgroup stats. It makes stat items consistent between global and
cgroups. It also provides a per-node breakdown, which is useful.
Because it does not use cgroup-specific hooks, we also keep generic
MM code separate from memcg code.

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the existing solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.
 
Suggested-by: Nhat Pham <nphamcs@...il.com>
Suggested-by: Shakeel Butt <shakeel.butt@...ux.dev>
Suggested-by: Johannes Weiner <hannes@...xchg.org>
Acked-by: Shakeel Butt <shakeel.butt@...ux.dev>
Acked-by: Johannes Weiner <hannes@...xchg.org>
Acked-by: Chris Down <chris@...isdown.name>
Acked-by: Michal Hocko <mhocko@...e.com>
Reviewed-by: Roman Gushchin <roman.gushchin@...ux.dev>
Reviewed-by: Nhat Pham <nphamcs@...il.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>

---
Changelog
v4:
  * Added {acked,suggested,reviewed}-by to the list
  * Added an extra section detailing why having no checks for mount
    options or configs is ok (handled by lruvec_stat_mod_folio). Also
    includes justifications for adding a global counter
v3:
  * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
    this check is already handled by lruvec_stat_mod (and doing the
    check in hugetlb.c actually breaks the build if MEMCG is not
    enabled.
  * Because there is now only one check for the flags, I've opted to
    do all of the cleanup in a separate patch series.
  * Added hugetlb information in cgroup-v2.rst
  * Added Suggested-by: Shakeel Butt
v2:
  * Enables the feature only if memcg accounts for hugeTLB usage
  * Moves the counter from memcg_stat_item to node_stat_item
  * Expands on motivation & justification in commitlog
  * Added Suggested-by: Nhat Pham

 Documentation/admin-guide/cgroup-v2.rst |  5 +++++
 include/linux/mmzone.h                  |  3 +++
 mm/hugetlb.c                            |  2 ++
 mm/memcontrol.c                         | 11 +++++++++++
 mm/vmstat.c                             |  3 +++
 5 files changed, 24 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69af2173555f..bd7e81c2aa2b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1646,6 +1646,11 @@ The following nested keys are defined.
 	  pgdemote_khugepaged
 		Number of pages demoted by khugepaged.
 
+	  hugetlb
+		Amount of memory used by hugetlb pages. This metric only shows
+		up if hugetlb usage is accounted for in memory.current (i.e.
+		cgroup is mounted with the memory_hugetlb_accounting option).
+
   memory.numa_stat
 	A read-only nested-keyed file which exists on non-root cgroups.
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..972795ae5946 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,6 +220,9 @@ enum node_stat_item {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..fbb10e52d7ea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
 	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
@@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
 	mem_cgroup_put(memcg);
 
 	return folio;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..5444d0e7bb64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -310,6 +310,9 @@ static const unsigned int memcg_node_stat_items[] = {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1346,6 +1349,9 @@ static const struct memory_stat memory_stats[] = {
 	{ "unevictable",		NR_UNEVICTABLE			},
 	{ "slab_reclaimable",		NR_SLAB_RECLAIMABLE_B		},
 	{ "slab_unreclaimable",		NR_SLAB_UNRECLAIMABLE_B		},
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugetlb",			NR_HUGETLB			},
+#endif
 
 	/* The memory events */
 	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
@@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (unlikely(memory_stats[i].idx == NR_HUGETLB) &&
+		    !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+			continue;
+#endif
 		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b5a4cea423e1..871566b04b79 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1273,6 +1273,9 @@ const char * const vmstat_text[] = {
 	"pgdemote_kswapd",
 	"pgdemote_direct",
 	"pgdemote_khugepaged",
+#ifdef CONFIG_HUGETLB_PAGE
+	"nr_hugetlb",
+#endif
 	/* system-wide enum vm_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
2.43.5


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ