linux-kernel - Re: Howto prevent kernel from evicting code pages ever? (to avoid disk thrashing when about to run out of RAM)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKYFi-7H66SvwxpbZLSXankOTQ9LfCyUnL5_+hA6tpJYkJOZ8A@mail.gmail.com>
Date:   Mon, 10 Sep 2018 11:17:42 +0200
From:   Marcus Linsner <constantoverride@...il.com>
To:     linux-kernel@...r.kernel.org
Subject: Re: Howto prevent kernel from evicting code pages ever? (to avoid
 disk thrashing when about to run out of RAM)

On Wed, Aug 22, 2018 at 11:25 AM Marcus Linsner
<constantoverride@...il.com> wrote:
>
> Hi. How to make the kernel keep(lock?) all code pages in RAM so that
> kswapd0 won't evict them when the system is under low memory
> conditions ?
>
> The purpose of this is to prevent the kernel from causing lots of disk
> reads(effectively freezing the whole system) when about to run out of
> RAM, even when there is no swap enabled, but well before(in real time
> minutes) OOM-killer triggers to kill the offending process (eg. ld)!
>
> I can replicate this consistently with 4G (and 12G) max RAM inside a
> Qubes OS R4.0 AppVM running Fedora 28 while trying to compile Firefox.
> The disk thrashing (continuous 192+MiB/sec reads) occurs well before
> the OOM-killer triggers to kill 'ld' (or 'rustc') process and
> everything is frozen for (real time) minutes. I've also encountered
> this on bare metal myself, if it matters at all.
>
> I tried to ask this question on SO here:
> https://stackoverflow.com/q/51927528/10239615
> but maybe I have better luck on this mailing list where the kernel experts are.
>

This is what I got working so far, to prevent the disk thrashing
(constant re-reading of active executable pages from disk) that would
otherwise freeze the OS before running Out Of Memory:

the following patch can also be seen here:
https://github.com/constantoverride/qubes-linux-kernel/blob/devel-4.18/patches.addon/le9d.patch

revision 3
preliminary patch to avoid disk thrashing (constant reading) under
memory pressure before OOM-killer triggers
more info: https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..7636498 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -208,7 +208,7 @@ enum lru_list {

 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)

-#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_ACTIVE_FILE; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_INACTIVE_FILE; lru++)

 static inline int is_file_lru(enum lru_list lru)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f8..1f3ffb5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2086,9 +2086,9 @@ static unsigned long shrink_list(enum lr
                  struct scan_control *sc)
 {
     if (is_active_lru(lru)) {
-        if (inactive_list_is_low(lruvec, is_file_lru(lru),
-                     memcg, sc, true))
-            shrink_active_list(nr_to_scan, lruvec, sc, lru);
+        //if (inactive_list_is_low(lruvec, is_file_lru(lru),
+        //             memcg, sc, true))
+        //    shrink_active_list(nr_to_scan, lruvec, sc, lru);
         return 0;
     }

@@ -2234,7 +2234,7 @@ static void get_scan_count(struct lruvec
*lruvec, struct mem_cgroup *memcg,

     anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
         lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES);
-    file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
+    file  = //lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
         lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

     spin_lock_irq(&pgdat->lru_lock);
@@ -2345,7 +2345,7 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
              sc->priority == DEF_PRIORITY);

     blk_start_plug(&plug);
-    while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+    while (nr[LRU_INACTIVE_ANON] || //nr[LRU_ACTIVE_FILE] ||
                     nr[LRU_INACTIVE_FILE]) {
         unsigned long nr_anon, nr_file, percentage;
         unsigned long nr_scanned;
@@ -2372,7 +2372,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
          * stop reclaiming one LRU and reduce the amount scanning
          * proportional to the original scan target.
          */
-        nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+        nr_file = nr[LRU_INACTIVE_FILE] //+ nr[LRU_ACTIVE_FILE]
+            ;
         nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

         /*
@@ -2391,7 +2392,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
             percentage = nr_anon * 100 / scan_target;
         } else {
             unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
-                        targets[LRU_ACTIVE_FILE] + 1;
+                        //targets[LRU_ACTIVE_FILE] +
+                        1;
             lru = LRU_FILE;
             percentage = nr_file * 100 / scan_target;
         }
@@ -2409,10 +2411,12 @@ static void shrink_node_memcg(struct pgl
         nr[lru] = targets[lru] * (100 - percentage) / 100;
         nr[lru] -= min(nr[lru], nr_scanned);

+        if (LRU_FILE != lru) { //avoid this block for LRU_ACTIVE_FILE
         lru += LRU_ACTIVE;
         nr_scanned = targets[lru] - nr[lru];
         nr[lru] = targets[lru] * (100 - percentage) / 100;
         nr[lru] -= min(nr[lru], nr_scanned);
+        }

         scan_adjusted = true;
}


Tested on kernel 4.18.5 under Qubes OS, in both dom0 and VMs. It gets
rid of the disk thrashing that would otherwise seemingly-permanently
freeze a qube (VM) with continous disk reading (seen from dom0 via
sudo iotop). With the above, it only freezes for at most 1 second
before OOM-killer triggers and restores the RAM by killing some
process.

If anyone has a better idea, please let me know. I am hoping someone
knowledgeable can step in :)

I tried to find a way to also keep Inactive file pages in RAM, just
for tests(!) but couldn't figure out how (I'm not a programmer).
So, keeping just the Active file pages, seem good enough for now, even
though I can clearly see (via vm.block_dump=1) that there are still
some pages that are being re-read during high memory pressure, but
they for some reason don't cause any(or much) disk thrashing.

Cheers!