[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xymq3lezzjc3hzh2eduogqpn6okrbbkodjdwu6ylpeszakb22a@ck6yhdfdcayq>
Date: Tue, 20 May 2025 12:19:58 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, zhengtangquan@...o.com,
Barry Song <v-songbaohua@...o.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Matthew Wilcox <willy@...radead.org>,
Oscar Salvador <osalvador@...e.de>,
Ryan Roberts <ryan.roberts@....com>, Zi Yan <ziy@...dia.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH RFC v2] mm: add support for dropping LRU recency on
process exit
* Barry Song <21cnbao@...il.com> [250514 03:08]:
> From: Barry Song <v-songbaohua@...o.com>
>
> Currently, both zap_pmd and zap_pte always promote young file folios,
> regardless of whether the processes are dying.
> However, in systems where the process recency fades upon dying, we may
> want to reverse this behavior. The goal is to reclaim the folios from
> the dying process as quickly as possible, allowing new processes to
> acquire memory ASAP.
> For example, while Firefox is killed and LibreOffice is launched,
> activating Firefox's young file-backed folios makes it harder to
> reclaim memory that LibreOffice doesn't use at all.
>
> On systems like Android, processes are either explicitly stopped by
> user action or reaped due to OOM after being inactive for a long time.
> These processes are unlikely to restart in the near future. Rather than
> promoting their folios, we skip promoting and demote their exclusive
> folios so that memory can be reclaimed and made available for new
> user-facing processes.
>
> Users possibly do not care about the recency of a dying process.
> However, we still need an explicit user indication to take this action.
Can you add why? It'd be nice to capture the reasons pointed out in v1
discussion as they seem important to why this isn't set as a default for
all tasks.
> Thus, we introduced a prctl to provide that necessary user-level hint
> as suggested by Johannes and David.
I'm not sure it really makes much of a difference if we update the lru
or not in this case. Johannes point about this small change having
unknown results for the larger community is certainly the best argument
as to why we need this to be opt-in.
We should probably document it so that people can opt-in though :)
>
> We observed noticeable improvements in refaults, swap-ins, and swap-outs
> on a hooked Android kernel. More data for this specific version will
> follow.
Looking forward to the results. What happens when I kill my app and
reopen it? (close all apps, open the one that was being annoying?)
>
> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: Matthew Wilcox (Oracle) <willy@...radead.org>
> Cc: Oscar Salvador <osalvador@...e.de>
> Cc: Ryan Roberts <ryan.roberts@....com>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Mike Rapoport <rppt@...nel.org>
> Cc: Suren Baghdasaryan <surenb@...gle.com>
> Cc: Michal Hocko <mhocko@...e.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---
> -v2:
> * add prctl as suggested by Johannes and David
> * demote exclusive file folios if drop_recency can apply
> -v1:
> https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
>
> include/linux/mm_types.h | 1 +
> include/uapi/linux/prctl.h | 3 +++
> kernel/sys.c | 16 ++++++++++++++++
> mm/huge_memory.c | 12 ++++++++++--
> mm/internal.h | 14 ++++++++++++++
> mm/memory.c | 12 +++++++++++-
> 6 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 15808cad2bc1..84ab113c54a2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1733,6 +1733,7 @@ enum {
> * on NFS restore
> */
> //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
> +#define MMF_FADE_ON_DEATH 18 /* Recency is discarded on process exit */
Why is recency not in the MMF name? Why not MMF_NO_RECENCY or
something?
I guess we are back to no space in this flag.
>
> #define MMF_HAS_UPROBES 19 /* has uprobes */
> #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 15c18ef4eb11..22d861157552 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -364,4 +364,7 @@ struct prctl_mm_map {
> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1
> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2
>
> +#define PR_SET_FADE_ON_DEATH 78
> +#define PR_GET_FADE_ON_DEATH 79
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c434968e9f5d..cabe1bbb35a4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> mmap_write_unlock(me->mm);
> break;
> + case PR_GET_FADE_ON_DEATH:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> + break;
Is there a usecase for get?
> + case PR_SET_FADE_ON_DEATH:
Could you just check the value prior to setting and just return if it's
what you want? In which case, the setting is just change_bit(), and
there probably isn't a need for a get?
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> + if (arg2)
> + set_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> + else
> + clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> + mmap_write_unlock(me->mm);
> + break;
> case PR_MPX_ENABLE_MANAGEMENT:
> case PR_MPX_DISABLE_MANAGEMENT:
> /* No longer implemented: */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2780a12b25f0..c99894611d4a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
> int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> pmd_t *pmd, unsigned long addr)
> {
> + bool drop_recency = false;
> pmd_t orig_pmd;
> spinlock_t *ptl;
>
> @@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> add_mm_counter(tlb->mm, mm_counter_file(folio),
> -HPAGE_PMD_NR);
>
> + drop_recency = zap_need_to_drop_recency(tlb->mm);
> /*
> * Use flush_needed to indicate whether the PMD entry
> * is present, instead of checking pmd_present() again.
> */
> - if (flush_needed && pmd_young(orig_pmd) &&
> - likely(vma_has_recency(vma)))
> + if (flush_needed && pmd_young(orig_pmd) && !drop_recency &&
> + likely(vma_has_recency(vma)))
> folio_mark_accessed(folio);
> + /*
> + * Userspace explicitly marks recency to fade when the process
> + * dies; demote exclusive file folios to aid reclamation.
> + */
> + if (drop_recency && !folio_maybe_mapped_shared(folio))
> + deactivate_file_folio(folio);
> }
>
> spin_unlock(ptl);
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b8ed2017743..af9649b3e84a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -11,6 +11,7 @@
> #include <linux/khugepaged.h>
> #include <linux/mm.h>
> #include <linux/mm_inline.h>
> +#include <linux/oom.h>
> #include <linux/pagemap.h>
> #include <linux/pagewalk.h>
> #include <linux/rmap.h>
> @@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio)
> return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> }
>
> +/*
> + * Returns true if the process attached to the mm is dying or undergoing
> + * OOM reaping, and its recency—explicitly marked by userspace—will also
> + * fade; otherwise, returns false.
> + */
> +static inline bool zap_need_to_drop_recency(struct mm_struct *mm)
This name is confusing. We are zapping the need to drop the recency? If
this returns false, then the need to drop recency is false.. It is not
very easy to read and harder to understand how it translates to the
values it returns.
How about mm_has_exit_recency(), like vma_has_recency()?
Or mmf_update_recency()?
> +{
> + if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm))
FYI, failed forks may also set the address space as unstable.
> + return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags);
> +
> + return false;
> +}
> +
> /*
> * Retrieve the first entry of a folio based on a provided entry within the
> * folio. We cannot rely on folio->swap as there is no guarantee that it has
> diff --git a/mm/memory.c b/mm/memory.c
> index 5a7e4c0e89c7..6dd01a7736a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> bool *force_flush, bool *force_break, bool *any_skipped)
> {
> struct mm_struct *mm = tlb->mm;
> + bool drop_recency = false;
> bool delay_rmap = false;
>
> if (!folio_test_anon(folio)) {
> @@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> *force_flush = true;
> }
> }
> - if (pte_young(ptent) && likely(vma_has_recency(vma)))
> +
> + drop_recency = zap_need_to_drop_recency(mm);
> + if (pte_young(ptent) && !drop_recency &&
> + likely(vma_has_recency(vma)))
I really don't like that you are calling an atomic_read() and two flag
checks every time this block of code it executed. This must impact your
performance?
How about this:
1. Check in unmap_vmas() that the range is 0 - ULONG_MAX, and if the OOM
flag is set.
2. set a new zap_flags_t flag (mmf_update_recency, maybe?) if
test_bit(MMF_FADE_ON_DEATH)
3. check zap_details->zap_flags if that bit is set in this function.
4. (hopefully) profit with better performance :)
Since this really is a zap flag, it fits to make it one. It also means
that you will not need to check an atomic and will only check the one
flag as apposed to two.
I think we can live with some user (probably syzbot) unmapping 0 -
ULONG_MAX and incorrectly checking a flag and, in the very rare case of
actually using this flag, does not do the correct LRU aging. If you
unmap everything, we can be pretty confident that you will be on the
exit path rather quickly.
> folio_mark_accessed(folio);
> rss[mm_counter(folio)] -= nr;
> + /*
> + * Userspace explicitly marks recency to fade when the process dies;
> + * demote exclusive file folios to aid reclamation.
> + */
> + if (drop_recency && !folio_maybe_mapped_shared(folio))
> + deactivate_file_folio(folio);
Thanks,
Liam
Powered by blists - more mailing lists