linux-kernel - Re: [PATCH RFC v2] mm: add support for dropping LRU recency on process exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yKpRv392rbM=insAPZMA-cwvTNbcPzGAEycfPDzsbJZQ@mail.gmail.com>
Date: Thu, 22 May 2025 14:05:37 +1200
From: Barry Song <21cnbao@...il.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>, Barry Song <21cnbao@...il.com>, 
	akpm@...ux-foundation.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	zhengtangquan@...o.com, Barry Song <v-songbaohua@...o.com>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, David Hildenbrand <david@...hat.com>, 
	Johannes Weiner <hannes@...xchg.org>, Matthew Wilcox <willy@...radead.org>, 
	Oscar Salvador <osalvador@...e.de>, Ryan Roberts <ryan.roberts@....com>, Zi Yan <ziy@...dia.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Mike Rapoport <rppt@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH RFC v2] mm: add support for dropping LRU recency on
 process exit

Hi Liam,
I really appreciate your review—thank you!

On Wed, May 21, 2025 at 4:20 AM Liam R. Howlett <Liam.Howlett@...cle.com> wrote:
>
> * Barry Song <21cnbao@...il.com> [250514 03:08]:
> > From: Barry Song <v-songbaohua@...o.com>
> >
> > Currently, both zap_pmd and zap_pte always promote young file folios,
> > regardless of whether the processes are dying.
> > However, in systems where the process recency fades upon dying, we may
> > want to reverse this behavior. The goal is to reclaim the folios from
> > the dying process as quickly as possible, allowing new processes to
> > acquire memory ASAP.
> > For example, while Firefox is killed and LibreOffice is launched,
> > activating Firefox's young file-backed folios makes it harder to
> > reclaim memory that LibreOffice doesn't use at all.
> >
> > On systems like Android, processes are either explicitly stopped by
> > user action or reaped due to OOM after being inactive for a long time.
> > These processes are unlikely to restart in the near future. Rather than
> > promoting their folios, we skip promoting and demote their exclusive
> > folios so that memory can be reclaimed and made available for new
> > user-facing processes.
> >
> > Users possibly do not care about the recency of a dying process.
> > However, we still need an explicit user indication to take this action.
>
> Can you add why?  It'd be nice to capture the reasons pointed out in v1
> discussion as they seem important to why this isn't set as a default for
> all tasks.

Essentially, I took Johannes’ point (and to some extent David’s as
well) to be that it behaves somewhat unpredictably in broader
application scenarios—for example, when repeatedly executing a file
in a script or restarting an application shortly after it exits.

Also, when a shared library is mapped by multiple processes, we might
still want to retain recency information from a process that is exiting.
So we might only want to do that only for exclusive folios.

This actually leads to two questions:

1. Are we confident that the recency of a dead process is no longer
   useful within a period of time?

2. Should we limit the optimization only to exclusive folios—for
   example, shared objects (.so files) that are specific to the
   exiting process?

For both questions, the answer seems to be yes.

Though in the first case—when we repeatedly restart the same
application—the folios are likely still in the LRU and may still be
hit even if we unconditionally demote them. But that's not guaranteed.
So we likely need a userspace hint to eliminate the uncertainty.

>
> > Thus, we introduced a prctl to provide that necessary user-level hint
> > as suggested by Johannes and David.
>
> I'm not sure it really makes much of a difference if we update the lru
> or not in this case.  Johannes point about this small change having
> unknown results for the larger community is certainly the best argument
> as to why we need this to be opt-in.
>
> We should probably document it so that people can opt-in though :)
>
> >
> > We observed noticeable improvements in refaults, swap-ins, and swap-outs
> > on a hooked Android kernel. More data for this specific version will
> > follow.
>
> Looking forward to the results.  What happens when I kill my app and
> reopen it? (close all apps, open the one that was being annoying?)

I'm not sure I fully understand your question. In Android, we're
primarily concerned with smooth app switching. For example, in a
sequence like A → B → C → D → E, if we can quickly reclaim folios
from dead processes, it helps us launch new (different) apps faster.

However, if we do A → kill A → start A → kill A → start A repeatedly,
it’s likely not a problem because our memory can hold the same
application. The issue arises when memory isn’t enough to hold
A + B + C + D + E simultaneously.

I’m not overly concerned about repeatedly restarting the same
application in Android. However, for wider scenarios across various
industries, I’m uncertain.

>
> >
> > Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
> > Cc: David Hildenbrand <david@...hat.com>
> > Cc: Johannes Weiner <hannes@...xchg.org>
> > Cc: Matthew Wilcox (Oracle) <willy@...radead.org>
> > Cc: Oscar Salvador <osalvador@...e.de>
> > Cc: Ryan Roberts <ryan.roberts@....com>
> > Cc: Zi Yan <ziy@...dia.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
> > Cc: Vlastimil Babka <vbabka@...e.cz>
> > Cc: Mike Rapoport <rppt@...nel.org>
> > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > Cc: Michal Hocko <mhocko@...e.com>
> > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > ---
> >  -v2:
> >  * add prctl as suggested by Johannes and David
> >  * demote exclusive file folios if drop_recency can apply
> >  -v1:
> >  https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
> >
> >  include/linux/mm_types.h   |  1 +
> >  include/uapi/linux/prctl.h |  3 +++
> >  kernel/sys.c               | 16 ++++++++++++++++
> >  mm/huge_memory.c           | 12 ++++++++++--
> >  mm/internal.h              | 14 ++++++++++++++
> >  mm/memory.c                | 12 +++++++++++-
> >  6 files changed, 55 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 15808cad2bc1..84ab113c54a2 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1733,6 +1733,7 @@ enum {
> >   * on NFS restore
> >   */
> >  //#define MMF_EXE_FILE_CHANGED       18      /* see prctl_set_mm_exe_file() */
> > +#define MMF_FADE_ON_DEATH    18      /* Recency is discarded on process exit */
>
> Why is recency not in the MMF name?  Why not MMF_NO_RECENCY or
> something?

I included RECENCY in the name but found it too long. On the other hand,
MMF_NO_RECENCY seems insufficient to convey the true meaning, since we
do have recency—it’s just lost on death. So perhaps the original, longer
names I considered are better: MMF_RECENCY_FADE_ON_DEATH or
MMF_NO_RECENCY_ON_DEATH?

>
> I guess we are back to no space in this flag.

Yes, it is 32 bits.

>
> >
> >  #define MMF_HAS_UPROBES              19      /* has uprobes */
> >  #define MMF_RECALC_UPROBES   20      /* MMF_HAS_UPROBES can be wrong */
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 15c18ef4eb11..22d861157552 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -364,4 +364,7 @@ struct prctl_mm_map {
> >  # define PR_TIMER_CREATE_RESTORE_IDS_ON              1
> >  # define PR_TIMER_CREATE_RESTORE_IDS_GET     2
> >
> > +#define PR_SET_FADE_ON_DEATH                 78
> > +#define PR_GET_FADE_ON_DEATH                 79
> > +
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index c434968e9f5d..cabe1bbb35a4 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >                       clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> >               mmap_write_unlock(me->mm);
> >               break;
> > +     case PR_GET_FADE_ON_DEATH:
> > +             if (arg2 || arg3 || arg4 || arg5)
> > +                     return -EINVAL;
> > +             error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> > +             break;
>
> Is there a usecase for get?

Probably not. I was just trying to implement put/get for a pair.
I’m happy to remove it if you feel it’s redundant.

>
> > +     case PR_SET_FADE_ON_DEATH:
>
> Could you just check the value prior to setting and just return if it's
> what you want?  In which case, the setting is just change_bit(), and
> there probably isn't a need for a get?

Ok.

>
> > +             if (arg3 || arg4 || arg5)
> > +                     return -EINVAL;
> > +             if (mmap_write_lock_killable(me->mm))
> > +                     return -EINTR;
> > +             if (arg2)
> > +                     set_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> > +             else
> > +                     clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags);
> > +             mmap_write_unlock(me->mm);
> > +             break;
> >       case PR_MPX_ENABLE_MANAGEMENT:
> >       case PR_MPX_DISABLE_MANAGEMENT:
> >               /* No longer implemented: */
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 2780a12b25f0..c99894611d4a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
> >  int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                pmd_t *pmd, unsigned long addr)
> >  {
> > +     bool drop_recency = false;
> >       pmd_t orig_pmd;
> >       spinlock_t *ptl;
> >
> > @@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                       add_mm_counter(tlb->mm, mm_counter_file(folio),
> >                                      -HPAGE_PMD_NR);
> >
> > +                     drop_recency = zap_need_to_drop_recency(tlb->mm);
> >                       /*
> >                        * Use flush_needed to indicate whether the PMD entry
> >                        * is present, instead of checking pmd_present() again.
> >                        */
> > -                     if (flush_needed && pmd_young(orig_pmd) &&
> > -                         likely(vma_has_recency(vma)))
> > +                     if (flush_needed && pmd_young(orig_pmd) && !drop_recency &&
> > +                                     likely(vma_has_recency(vma)))
> >                               folio_mark_accessed(folio);
> > +                     /*
> > +                      * Userspace explicitly marks recency to fade when the process
> > +                      * dies; demote exclusive file folios to aid reclamation.
> > +                      */
> > +                     if (drop_recency && !folio_maybe_mapped_shared(folio))
> > +                             deactivate_file_folio(folio);
> >               }
> >
> >               spin_unlock(ptl);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 6b8ed2017743..af9649b3e84a 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -11,6 +11,7 @@
> >  #include <linux/khugepaged.h>
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> > +#include <linux/oom.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/pagewalk.h>
> >  #include <linux/rmap.h>
> > @@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio)
> >       return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
> >  }
> >
> > +/*
> > + * Returns true if the process attached to the mm is dying or undergoing
> > + * OOM reaping, and its recency—explicitly marked by userspace—will also
> > + * fade; otherwise, returns false.
> > + */
> > +static inline bool zap_need_to_drop_recency(struct mm_struct *mm)
>
> This name is confusing.  We are zapping the need to drop the recency? If
> this returns false, then the need to drop recency is false..  It is not
> very easy to read and harder to understand how it translates to the
> values it returns.
>
> How about mm_has_exit_recency(), like vma_has_recency()?
> Or mmf_update_recency()?

It seems mm_has_exit_recency() is good.

>
> > +{
> > +     if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm))
>
> FYI, failed forks may also set the address space as unstable.
>
> > +             return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags);
> > +
> > +     return false;
> > +}
> > +
> >  /*
> >   * Retrieve the first entry of a folio based on a provided entry within the
> >   * folio. We cannot rely on folio->swap as there is no guarantee that it has
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5a7e4c0e89c7..6dd01a7736a8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> >               bool *force_flush, bool *force_break, bool *any_skipped)
> >  {
> >       struct mm_struct *mm = tlb->mm;
> > +     bool drop_recency = false;
> >       bool delay_rmap = false;
> >
> >       if (!folio_test_anon(folio)) {
> > @@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> >                               *force_flush = true;
> >                       }
> >               }
> > -             if (pte_young(ptent) && likely(vma_has_recency(vma)))
> > +
> > +             drop_recency = zap_need_to_drop_recency(mm);
> > +             if (pte_young(ptent) && !drop_recency &&
> > +                             likely(vma_has_recency(vma)))
>
>
> I really don't like that you are calling an atomic_read() and two flag
> checks every time this block of code it executed.  This must impact your
> performance?

Fair enough. That seems like a valid point to consider regarding atomic
operations.

>
> How about this:
> 1. Check in unmap_vmas() that the range is 0 - ULONG_MAX, and if the OOM
> flag is set.
> 2. set a new zap_flags_t flag (mmf_update_recency, maybe?) if
> test_bit(MMF_FADE_ON_DEATH)
> 3. check zap_details->zap_flags if that bit is set in this function.
> 4. (hopefully) profit with better performance :)
>
> Since this really is a zap flag, it fits to make it one.  It also means
> that you will not need to check an atomic and will only check the one
> flag as apposed to two.
>
> I think we can live with some user (probably syzbot) unmapping 0 -
> ULONG_MAX and incorrectly checking a flag and, in the very rare case of
> actually using this flag, does not do the correct LRU aging.  If you
> unmap everything, we can be pretty confident that you will be on the
> exit path rather quickly.

Good idea—let me give this a try.

>
> >                       folio_mark_accessed(folio);
> >               rss[mm_counter(folio)] -= nr;
> > +             /*
> > +              * Userspace explicitly marks recency to fade when the process dies;
> > +              * demote exclusive file folios to aid reclamation.
> > +              */
> > +             if (drop_recency && !folio_maybe_mapped_shared(folio))
> > +                     deactivate_file_folio(folio);
>
> Thanks,
> Liam
>

Thanks
Barry