[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250514070820.51793-1-21cnbao@gmail.com>
Date: Wed, 14 May 2025 19:08:20 +1200
From: Barry Song <21cnbao@...il.com>
To: akpm@...ux-foundation.org,
linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org,
zhengtangquan@...o.com,
Barry Song <v-songbaohua@...o.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Matthew Wilcox <willy@...radead.org>,
Oscar Salvador <osalvador@...e.de>,
Ryan Roberts <ryan.roberts@....com>,
Zi Yan <ziy@...dia.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Michal Hocko <mhocko@...e.com>
Subject: [PATCH RFC v2] mm: add support for dropping LRU recency on process exit
From: Barry Song <v-songbaohua@...o.com>
Currently, both zap_pmd and zap_pte always promote young file folios,
regardless of whether the processes are dying.
However, in systems where the process recency fades upon dying, we may
want to reverse this behavior. The goal is to reclaim the folios from
the dying process as quickly as possible, allowing new processes to
acquire memory ASAP.
For example, while Firefox is killed and LibreOffice is launched,
activating Firefox's young file-backed folios makes it harder to
reclaim memory that LibreOffice doesn't use at all.
On systems like Android, processes are either explicitly stopped by
user action or reaped due to OOM after being inactive for a long time.
These processes are unlikely to restart in the near future. Rather than
promoting their folios, we skip promoting and demote their exclusive
folios so that memory can be reclaimed and made available for new
user-facing processes.
Users possibly do not care about the recency of a dying process.
However, we still need an explicit user indication to take this action.
Thus, we introduced a prctl to provide that necessary user-level hint
as suggested by Johannes and David.
We observed noticeable improvements in refaults, swap-ins, and swap-outs
on a hooked Android kernel. More data for this specific version will
follow.
Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: David Hildenbrand <david@...hat.com>
Cc: Johannes Weiner <hannes@...xchg.org>
Cc: Matthew Wilcox (Oracle) <willy@...radead.org>
Cc: Oscar Salvador <osalvador@...e.de>
Cc: Ryan Roberts <ryan.roberts@....com>
Cc: Zi Yan <ziy@...dia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Mike Rapoport <rppt@...nel.org>
Cc: Suren Baghdasaryan <surenb@...gle.com>
Cc: Michal Hocko <mhocko@...e.com>
Signed-off-by: Barry Song <v-songbaohua@...o.com>
---
-v2:
* add prctl as suggested by Johannes and David
* demote exclusive file folios if drop_recency can apply
-v1:
https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
include/linux/mm_types.h | 1 +
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 16 ++++++++++++++++
mm/huge_memory.c | 12 ++++++++++--
mm/internal.h | 14 ++++++++++++++
mm/memory.c | 12 +++++++++++-
6 files changed, 55 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 15808cad2bc1..84ab113c54a2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1733,6 +1733,7 @@ enum {
* on NFS restore
*/
//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
+#define MMF_FADE_ON_DEATH 18 /* Recency is discarded on process exit */
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 15c18ef4eb11..22d861157552 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -364,4 +364,7 @@ struct prctl_mm_map {
# define PR_TIMER_CREATE_RESTORE_IDS_ON 1
# define PR_TIMER_CREATE_RESTORE_IDS_GET 2
+#define PR_SET_FADE_ON_DEATH 78
+#define PR_GET_FADE_ON_DEATH 79
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..cabe1bbb35a4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
clear_bit(MMF_DISABLE_THP, &me->mm->flags);
mmap_write_unlock(me->mm);
break;
+ case PR_GET_FADE_ON_DEATH:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
+ break;
+ case PR_SET_FADE_ON_DEATH:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (mmap_write_lock_killable(me->mm))
+ return -EINTR;
+ if (arg2)
+ set_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
+ else
+ clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
+ mmap_write_unlock(me->mm);
+ break;
case PR_MPX_ENABLE_MANAGEMENT:
case PR_MPX_DISABLE_MANAGEMENT:
/* No longer implemented: */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..c99894611d4a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
+ bool drop_recency = false;
pmd_t orig_pmd;
spinlock_t *ptl;
@@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
add_mm_counter(tlb->mm, mm_counter_file(folio),
-HPAGE_PMD_NR);
+ drop_recency = zap_need_to_drop_recency(tlb->mm);
/*
* Use flush_needed to indicate whether the PMD entry
* is present, instead of checking pmd_present() again.
*/
- if (flush_needed && pmd_young(orig_pmd) &&
- likely(vma_has_recency(vma)))
+ if (flush_needed && pmd_young(orig_pmd) && !drop_recency &&
+ likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+ /*
+ * Userspace explicitly marks recency to fade when the process
+ * dies; demote exclusive file folios to aid reclamation.
+ */
+ if (drop_recency && !folio_maybe_mapped_shared(folio))
+ deactivate_file_folio(folio);
}
spin_unlock(ptl);
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..af9649b3e84a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,7 @@
#include <linux/khugepaged.h>
#include <linux/mm.h>
#include <linux/mm_inline.h>
+#include <linux/oom.h>
#include <linux/pagemap.h>
#include <linux/pagewalk.h>
#include <linux/rmap.h>
@@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio)
return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
}
+/*
+ * Returns true if the process attached to the mm is dying or undergoing
+ * OOM reaping, and its recency—explicitly marked by userspace—will also
+ * fade; otherwise, returns false.
+ */
+static inline bool zap_need_to_drop_recency(struct mm_struct *mm)
+{
+ if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm))
+ return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags);
+
+ return false;
+}
+
/*
* Retrieve the first entry of a folio based on a provided entry within the
* folio. We cannot rely on folio->swap as there is no guarantee that it has
diff --git a/mm/memory.c b/mm/memory.c
index 5a7e4c0e89c7..6dd01a7736a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
bool *force_flush, bool *force_break, bool *any_skipped)
{
struct mm_struct *mm = tlb->mm;
+ bool drop_recency = false;
bool delay_rmap = false;
if (!folio_test_anon(folio)) {
@@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
*force_flush = true;
}
}
- if (pte_young(ptent) && likely(vma_has_recency(vma)))
+
+ drop_recency = zap_need_to_drop_recency(mm);
+ if (pte_young(ptent) && !drop_recency &&
+ likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
rss[mm_counter(folio)] -= nr;
+ /*
+ * Userspace explicitly marks recency to fade when the process dies;
+ * demote exclusive file folios to aid reclamation.
+ */
+ if (drop_recency && !folio_maybe_mapped_shared(folio))
+ deactivate_file_folio(folio);
} else {
/* We don't need up-to-date accessed/dirty bits. */
clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
--
2.39.3 (Apple Git-146)
Powered by blists - more mailing lists