lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210128043547.1560435-1-saravanand@fb.com>
Date:   Wed, 27 Jan 2021 20:35:47 -0800
From:   Saravanan D <saravanand@...com>
To:     <x86@...nel.org>, <dave.hansen@...ux.intel.com>, <luto@...nel.org>,
        <peterz@...radead.org>, <corbet@....net>
CC:     <linux-kernel@...r.kernel.org>, <kernel-team@...com>,
        <linux-doc@...r.kernel.org>, Saravanan D <saravanand@...com>
Subject: [PATCH V4] x86/mm: Tracking linear mapping split events

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Documentation regarding linear mapping split events added to admin-guide
as requested in V3 of the patch.

Signed-off-by: Saravanan D <saravanand@...com>
---
 .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
 Documentation/admin-guide/mm/index.rst        |  1 +
 arch/x86/mm/pat/set_memory.c                  | 13 ++++
 include/linux/vm_event_item.h                 |  4 ++
 mm/vmstat.c                                   |  4 ++
 5 files changed, 81 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst

diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
new file mode 100644
index 000000000000..298751391deb
--- /dev/null
+++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Direct Mapping Splits
+=====================
+
+Kernel maps all of physical memory in linear/direct mapped pages with
+translation of virtual kernel address to physical address is achieved
+through a simple subtraction of offset. CPUs maintain a cache of these
+translations on fast caches called TLBs. CPU architectures like x86 allow
+direct mapping large portions of memory into hugepages (2M, 1G, etc) in
+various page table levels.
+
+Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
+The splintering of huge direct pages into smaller ones does result in
+a measurable performance hit caused by frequent TLB miss and reloads.
+
+One of the many lasting (as we don't coalesce back) sources for huge page
+splits is tracing as the granular page attribute/permission changes would
+force the kernel to split code segments mapped to hugepages to smaller
+ones thus increasing the probability of TLB miss/reloads even after
+tracing has been stopped.
+
+On x86 systems, we can track the splitting of huge direct mapped pages
+through lifetime event counters in ``/proc/vmstat``
+
+	direct_map_level2_splits xxx
+	direct_map_level3_splits yyy
+
+where:
+
+direct_map_level2_splits
+	are 2M/4M hugepage split events
+direct_map_level3_splits
+	are 1G hugepage split events
+
+The distribution of direct mapped system memory in various page sizes
+post splits can be viewed through ``/proc/meminfo`` whose output
+will include the following lines depending upon supporting CPU
+architecture
+
+	DirectMap4k:    xxxxx kB
+	DirectMap2M:    yyyyy kB
+	DirectMap1G:    zzzzz kB
+
+where:
+
+DirectMap4k
+	is the total amount of direct mapped memory (in kB)
+	accessed through 4k pages
+DirectMap2M
+	is the total amount of direct mapped memory (in kB)
+	accessed through 2M pages
+DirectMap1G
+	is the total amount of direct mapped memory (in kB)
+	accessed through 1G pages
+
+
+-- Saravanan D, Jan 27, 2021
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 4b14d8b50e9e..9439780f3f07 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -38,3 +38,4 @@ the Linux memory management.
    soft-dirty
    transhuge
    userfaultfd
+   direct_mapping_splits
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..767cade53bdc 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
 	spin_unlock(&pgd_lock);
 }
 
+void update_split_page_event_count(int level)
+{
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
+}
+
 static void split_page_count(int level)
 {
 	if (direct_pages_count[level] == 0)
 		return;
 
 	direct_pages_count[level]--;
+	update_split_page_event_count(level);
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ