lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210406065944.08d8aa76@mail.inbox.lv>
Date:   Tue, 6 Apr 2021 06:59:44 +0900
From:   Alexey Avramov <hakavlad@...ox.lv>
To:     Stillinux <stillinux@...il.com>
Cc:     akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, liuzhengyuan@...inos.cn,
        liuyun01@...inos.cn
Subject: Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule

> For the first time involving the swap part, there is no good way to fix
> the problem

The solution is protecting the clean file pages.

Look at this:

> On ChromiumOS, we do not use swap. When memory is low, the only 
> way to free memory is to reclaim pages from the file list. This 
> results in a lot of thrashing under low memory conditions. We see 
> the system become unresponsive for minutes before it eventually OOMs. 
> We also see very slow browser tab switching under low memory. Instead 
> of an unresponsive system, we'd really like the kernel to OOM as soon 
> as it starts to thrash. If it can't keep the working set in memory, 
> then OOM. Losing one of many tabs is a better behaviour for the user 
> than an unresponsive system.

> This patch create a new sysctl, min_filelist_kbytes, which disables 
> reclaim of file-backed pages when when there are less than min_filelist_bytes 
> worth of such pages in the cache. This tunable is handy for low memory 
> systems using solid-state storage where interactive response is more important 
> than not OOMing.

> With this patch and min_filelist_kbytes set to 50000, I see very little block 
> layer activity during low memory. The system stays responsive under low 
> memory and browser tab switching is fast. Eventually, a process a gets killed 
> by OOM. Without this patch, the system gets wedged for minutes before it 
> eventually OOMs.

— https://lore.kernel.org/patchwork/patch/222042/

This patch can almost completely eliminate thrashing under memory pressure.

Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.

Read more: https://github.com/hakavlad/le9-patch

The patch:

From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov <hakavlad@...ox.lv>
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
 amount of clean file cache

The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace
for normal operation. First of all, you need a cache of shared libraries
and executable files. If the volume of the CFP cache falls below a certain
level, thrashing and even livelock occurs.

Protection of CFP may be used to prevent thrashing and reducing I/O under
memory pressure. Hard protection of CFP may be used to avoid high latency
and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory
pressure.

The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
CFP. The CFP on the current node won't be reclaimed uder memory pressure
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
or have no swap space or vm.swappiness=0. Setting it to a high value may
result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The
default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).

The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
CFP on the current node won't be reclaimed under memory pressure when their
volume is below vm.clean_min_kbytes. Setting it to a high value may result
in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The
default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).

Reported-by: Artem S. Tashkinov <aros@....com>
Signed-off-by: Alexey Avramov <hakavlad@...ox.lv>
---
 Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++
 include/linux/mm.h                      |  3 ++
 kernel/sysctl.c                         | 14 ++++++++
 mm/Kconfig                              | 35 +++++++++++++++++++
 mm/vmscan.c                             | 59 +++++++++++++++++++++++++++++++++
 5 files changed, 148 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f455fa00c..5d5ddfc85 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:

 - admin_reserve_kbytes
 - block_dump
+- clean_low_kbytes
+- clean_min_kbytes
 - compact_memory
 - compaction_proactiveness
 - compact_unevictable_allowed
@@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a nonzero value. More
 information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.


+clean_low_kbytes
+=====================
+
+This knob provides *best-effort* protection of clean file pages. The clean file
+pages on the current node won't be reclaimed uder memory pressure when their
+volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no
+swap space or vm.swappiness=0.
+
+Protection of clean file pages may be used to prevent thrashing and
+reducing I/O under low-memory conditions.
+
+Setting it to a high value may result in a early eviction of anonymous pages
+into the swap space by attempting to hold the protected amount of clean file
+pages in memory.
+
+The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
+
+
+clean_min_kbytes
+=====================
+
+This knob provides *hard* protection of clean file pages. The clean file pages
+on the current node won't be reclaimed under memory pressure when their volume
+is below vm.clean_min_kbytes.
+
+Hard protection of clean file pages may be used to avoid high latency and
+prevent livelock in near-OOM conditions.
+
+Setting it to a high value may result in a early out-of-memory condition due to
+the inability to reclaim the protected amount of clean file pages when other
+types of pages cannot be reclaimed.
+
+The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
+
+
 compact_memory
 ==============

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3f..7799f1555 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page *page)

 extern int sysctl_max_map_count;

+extern unsigned long sysctl_clean_low_kbytes;
+extern unsigned long sysctl_clean_min_kbytes;
+
 extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad08596..854b311cd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] = {
 	},
 #endif
 	{
+		.procname	= "clean_low_kbytes",
+		.data		= &sysctl_clean_low_kbytes,
+		.maxlen		= sizeof(sysctl_clean_low_kbytes),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
+	{
+		.procname	= "clean_min_kbytes",
+		.data		= &sysctl_clean_min_kbytes,
+		.maxlen		= sizeof(sysctl_clean_min_kbytes),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
+	{
 		.procname	= "user_reserve_kbytes",
 		.data		= &sysctl_user_reserve_kbytes,
 		.maxlen		= sizeof(sysctl_user_reserve_kbytes),
diff --git a/mm/Kconfig b/mm/Kconfig
index 390165ffb..3915c71e1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
 	  pfn_to_page and page_to_pfn operations.  This is the most
 	  efficient option when sufficient kernel resources are available.

+config CLEAN_LOW_KBYTES
+	int "Default value for vm.clean_low_kbytes"
+	depends on SYSCTL
+	default "0"
+	help
+	  The vm.clean_file_low_kbytes sysctl knob provides *best-effort*
+	  protection of clean file pages. The clean file pages on the current
+	  node won't be reclaimed uder memory pressure when their volume is
+	  below vm.clean_low_kbytes *unless* we threaten to OOM or have
+	  no swap space or vm.swappiness=0.
+
+	  Protection of clean file pages may be used to prevent thrashing and
+	  reducing I/O under low-memory conditions.
+
+	  Setting it to a high value may result in a early eviction of anonymous
+	  pages into the swap space by attempting to hold the protected amount of
+	  clean file pages in memory.
+
+config CLEAN_MIN_KBYTES
+	int "Default value for vm.clean_min_kbytes"
+	depends on SYSCTL
+	default "0"
+	help
+	  The vm.clean_file_min_kbytes sysctl knob provides *hard* protection
+	  of clean file pages. The clean file pages on the current node won't be
+	  reclaimed under memory pressure when their volume is below
+	  vm.clean_min_kbytes.
+
+	  Hard protection of clean file pages may be used to avoid high latency and
+	  prevent livelock in near-OOM conditions.
+
+	  Setting it to a high value may result in a early out-of-memory condition
+	  due to the inability to reclaim the protected amount of clean file pages
+	  when other types of pages cannot be reclaimed.
+
 config HAVE_MEMBLOCK_PHYS_MAP
 	bool

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac..77e98c43e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -120,6 +120,19 @@ struct scan_control {
 	/* The file pages on the current node are dangerously low */
 	unsigned int file_is_tiny:1;

+	/*
+	 * The clean file pages on the current node won't be reclaimed when
+	 * their volume is below vm.clean_low_kbytes *unless* we threaten
+	 * to OOM or have no swap space or vm.swappiness=0.
+	 */
+	unsigned int clean_below_low:1;
+
+	/*
+	 * The clean file pages on the current node won't be reclaimed when
+	 * their volume is below vm.clean_min_kbytes.
+	 */
+	unsigned int clean_below_min:1;
+
 	/* Allocation order */
 	s8 order;

@@ -166,6 +179,17 @@ struct scan_control {
 #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
 #endif

+#if CONFIG_CLEAN_LOW_KBYTES < 0
+#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
+#endif
+
+#if CONFIG_CLEAN_MIN_KBYTES < 0
+#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
+#endif
+
+unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
+unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
+
 /*
  * From 0 .. 200.  Higher means more swappy.
  */
@@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}

 	/*
+	 * Force-scan anon if clean file pages is under vm.clean_min_kbytes
+	 * or vm.clean_low_kbytes (unless the swappiness setting
+	 * disagrees with swapping).
+	 */
+	if ((sc->clean_below_low || sc->clean_below_min) && swappiness) {
+		scan_balance = SCAN_ANON;
+		goto out;
+	}
+
+	/*
 	 * If there is enough inactive page cache, we do not reclaim
 	 * anything from the anonymous working right now.
 	 */
@@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			BUG();
 		}

+		/*
+		 * Don't reclaim clean file pages when their volume is below
+		 * vm.clean_min_kbytes.
+		 */
+		if (file && sc->clean_below_min)
+			scan = 0;
+
 		nr[lru] = scan;
 	}
 }
@@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			anon >> sc->priority;
 	}

+	if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
+		unsigned long reclaimable_file, dirty, clean;
+
+		reclaimable_file =
+			node_page_state(pgdat, NR_ACTIVE_FILE) +
+			node_page_state(pgdat, NR_INACTIVE_FILE) +
+			node_page_state(pgdat, NR_ISOLATED_FILE);
+		dirty = node_page_state(pgdat, NR_FILE_DIRTY);
+		if (reclaimable_file > dirty)
+			clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);
+
+		sc->clean_below_low = clean < sysctl_clean_low_kbytes;
+		sc->clean_below_min = clean < sysctl_clean_min_kbytes;
+	} else {
+		sc->clean_below_low = false;
+		sc->clean_below_min = false;
+	}
+
 	shrink_node_memcgs(pgdat, sc);

 	if (reclaim_state) {
--
2.11.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ