linux-kernel - Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for protecting the working set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 3 Nov 2010 09:48:42 +0900
From:	Minchan Kim <minchan.kim@...il.com>
To:	Rik van Riel <riel@...hat.com>
Cc:	Mandeep Singh Baines <msb@...omium.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mel@....ul.ie>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, wad@...omium.org,
	olofj@...omium.org, hughd@...omium.org
Subject: Re: [PATCH] RFC: vmscan: add min_filelist_kbytes sysctl for
 protecting the working set

Hi Rik,

On Tue, Nov 2, 2010 at 12:11 PM, Rik van Riel <riel@...hat.com> wrote:
> On 11/01/2010 03:43 PM, Mandeep Singh Baines wrote:
>
>> Yes, this prevents you from reclaiming the active list all at once. But if
>> the
>> memory pressure doesn't go away, you'll start to reclaim the active list
>> little by little. First you'll empty the inactive list, and then
>> you'll start scanning
>> the active list and pulling pages from inactive to active. The problem is
>> that
>> there is no minimum time limit to how long a page will sit in the inactive
>> list
>> before it is reclaimed. Just depends on scan rate which does not depend
>> on time.
>>
>> In my experiments, I saw the active list get smaller and smaller
>> over time until eventually it was only a few MB at which point the system
>> came
>> grinding to a halt due to thrashing.
>
> I believe that changing the active/inactive ratio has other
> potential thrashing issues.  Specifically, when the inactive
> list is too small, pages may not stick around long enough to
> be accessed multiple times and get promoted to the active
> list, even when they are in active use.
>
> I prefer a more flexible solution, that automatically does
> the right thing.

I agree. Ideally, it's the best if we handle it well in kernel internal.

>
> The problem you see is that the file list gets reclaimed
> very quickly, even when it is already very small.
>
> I wonder if a possible solution would be to limit how fast
> file pages get reclaimed, when the page cache is very small.
> Say, inactive_file * active_file < 2 * zone->pages_high ?

Why do you multiply inactive_file and active_file?
What's meaning?

I think it's very difficult to fix _a_ threshold.
At least, user have to set it with proper value to use the feature.
Anyway, we need default value. It needs some experiments in desktop
and embedded.

>
> At that point, maybe we could slow down the reclaiming of
> page cache pages to be significantly slower than they can
> be refilled by the disk.  Maybe 100 pages a second - that
> can be refilled even by an actual spinning metal disk
> without even the use of readahead.
>
> That can be rounded up to one batch of SWAP_CLUSTER_MAX
> file pages every 1/4 second, when the number of page cache
> pages is very low.

How about reducing scanning window size?
I think it could approximate the idea.

>
> This way HPC and virtual machine hosting nodes can still
> get rid of totally unused page cache, but on any system
> that actually uses page cache, some minimal amount of
> cache will be protected under heavy memory pressure.
>
> Does this sound like a reasonable approach?
>
> I realize the threshold may have to be tweaked...

Absolutely.

>
> The big question is, how do we integrate this with the
> OOM killer?  Do we pretend we are out of memory when
> we've hit our file cache eviction quota and kill something?

I think "Yes".
But I think killing isn't best if oom_badness can't select proper victim.
Normally, embedded system doesn't have swap. And it could try to keep
many task in memory due to application startup latency.
It means some tasks never executed during long time and just stay in
memory with consuming the memory.
OOM have to kill it. Anyway it's off topic.

>
> Would there be any downsides to this approach?

At first feeling, I have a concern unbalance aging of anon/file.
But I think it's no problem. It a result user want. User want to
protect file-backed page(ex, code page) so many anon swapout is
natural result to go on the system. If the system has no swap, we have
no choice except OOM.

>
> Are there any volunteers for implementing this idea?
> (Maybe someone who needs the feature?)

I made quick patch to discuss as combining your idea and Mandeep.
(Just pass the compile test.)


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7687228..98380ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,7 @@ extern unsigned long num_physpages;
 extern unsigned long totalram_pages;
 extern void * high_memory;
 extern int page_cluster;
+extern int min_filelist_kbytes;

 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3a45c22..c61f0c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1320,6 +1320,14 @@ static struct ctl_table vm_table[] = {
                .extra2         = &one,
        },
 #endif
+       {
+               .procname       = "min_filelist_kbytes",
+               .data           = &min_filelist_kbytes,
+               .maxlen         = sizeof(min_filelist_kbytes),
+               .mode           = 0644,
+               .proc_handler   = &proc_dointvec,
+               .extra1         = &zero,
+       },

 /*
  * NOTE: do not add new entries to this table unless you have read
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..3b0e95d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,11 @@ struct scan_control {
 int vm_swappiness = 60;
 long vm_total_pages;   /* The total number of pages which the VM controls */

+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ * 20M is a arbitrary value. We need more discussion.
+ */
+int min_filelist_kbytes = 1024 * 20;
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);

@@ -1635,6 +1640,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
        u64 fraction[2], denominator;
        enum lru_list l;
        int noswap = 0;
+       int low_pagecache = 0;

        /* If we have no swap space, do not bother scanning anon pages. */
        if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1651,6 +1657,7 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
                zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);

        if (scanning_global_lru(sc)) {
+               unsigned long pagecache_threshold;
                free  = zone_page_state(zone, NR_FREE_PAGES);
                /* If we have very few page cache pages,
                   force-scan anon pages. */
@@ -1660,6 +1667,10 @@ static void get_scan_count(struct zone *zone,
struct scan_control *sc,
                        denominator = 1;
                        goto out;
                }
+
+               pagecache_threshold = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+               if (file < pagecache_threshold)
+                       low_pagecache = 1;
        }

        /*
@@ -1715,6 +1726,12 @@ out:
                if (priority || noswap) {
                        scan >>= priority;
                        scan = div64_u64(scan * fraction[file], denominator);
+                       /*
+                        * If the system has low page cache, we slow down
+                        * scanning speed with 1/8 to protect working set.
+                        */
+                       if (low_pagecache)
+                               scan >>= 3;
                }
                nr[l] = nr_scan_try_batch(scan,
                                          &reclaim_stat->nr_saved_scan[l]);



> --
> All rights reversed
>



-- 
Kind regards,
Minchan Kim

View attachment "slow_down_file_lru.patch" of type "text/x-patch" (2705 bytes)