linux-kernel - Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fe38e328-5e64-44b2-9e62-f764c4b307bd@vivo.com>
Date: Wed, 10 Sep 2025 22:07:42 +0800
From: Lei Liu <liulei.rjpt@...o.com>
To: Barry Song <21cnbao@...il.com>, Kairui Song <ryncsn@...il.com>
Cc: Michal Hocko <mhocko@...e.com>, David Rientjes <rientjes@...gle.com>,
 Shakeel Butt <shakeel.butt@...ux.dev>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>,
 Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>,
 Roman Gushchin <roman.gushchin@...ux.dev>,
 Muchun Song <muchun.song@...ux.dev>, David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
 <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Brendan Jackman
 <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>,
 Chen Yu <yu.c.chen@...el.com>, Hao Jia <jiahao1@...iang.com>,
 "Kirill A. Shutemov" <kas@...nel.org>, Usama Arif <usamaarif642@...il.com>,
 Oleg Nesterov <oleg@...hat.com>, Christian Brauner <brauner@...nel.org>,
 Mateusz Guzik <mjguzik@...il.com>, Steven Rostedt <rostedt@...dmis.org>,
 Andrii Nakryiko <andrii@...nel.org>, Al Viro <viro@...iv.linux.org.uk>,
 Fushuai Wang <wangfushuai@...du.com>,
 "open list:MEMORY MANAGEMENT - OOM KILLER" <linux-mm@...ck.org>,
 open list <linux-kernel@...r.kernel.org>,
 "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)"
 <cgroups@...r.kernel.org>
Subject: Re: [PATCH v0 0/2] mm: swap: Gather swap entries and batch async
 release


On 2025/9/9 17:24, Barry Song wrote:
> [You don't often get email from 21cnbao@...il.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Tue, Sep 9, 2025 at 3:30 PM Kairui Song <ryncsn@...il.com> wrote:
>> On Tue, Sep 9, 2025 at 3:04 PM Lei Liu <liulei.rjpt@...o.com> wrote:
>> Hi Lei,
>>
>>> 1. Problem Scenario
>>> On systems with ZRAM and swap enabled, simultaneous process exits create
>>> contention. The primary bottleneck occurs during swap entry release
>>> operations, causing exiting processes to monopolize CPU resources. This
>>> leads to scheduling delays for high-priority processes.
>>>
>>> 2. Android Use Case
>>> During camera launch, LMKD terminates background processes to free memory.
>>> Exiting processes compete for CPU cycles, delaying the camera preview
>>> thread and causing visible stuttering - directly impacting user
>>> experience.
>>>
>>> 3. Root Cause Analysis
>>> When background applications heavily utilize swap space, process exit
>>> profiling reveals 55% of time spent in free_swap_and_cache_nr():
>>>
>>> Function              Duration (ms)   Percentage
>>> do_signal               791.813     **********100%
>>> do_group_exit           791.813     **********100%
>>> do_exit                 791.813     **********100%
>>> exit_mm                 577.859        *******73%
>>> exit_mmap               577.497        *******73%
>>> zap_pte_range           558.645        *******71%
>>> free_swap_and_cache_nr  433.381          *****55%
>>> free_swap_slot          403.568          *****51%
>> Thanks for sharing this case.
>>
>> One problem is that now the free_swap_slot function no longer exists
>> after 0ff67f990bd4. Have you tested the latest kernel? Or what is the
>> actual overhead here?
>>
>> Some batch freeing optimizations are introduced. And we have reworked
>> the whole locking mechanism for swap, so even on a system with 96t the
>> contention seems barely observable with common workloads.
>>
>> And another series is further reducing the contention and the overall
>> overhead (24% faster freeing for phase 1):
>> https://lore.kernel.org/linux-mm/20250905191357.78298-1-ryncsn@gmail.com/
>>
>> Will these be helpful for you? I think optimizing the root problem is
>> better than just deferring the overhead with async workers, which may
>> increase the overall overhead and complexity.
>>
> I feel the cover letter does not clearly describe where the bottleneck
> occurs or where the performance gains originate. To be honest, even
> the versions submitted last year did not present the bottleneck clearly.
>
> For example, is this due to lock contention (in which case we would
> need performance data to see how much CPU time is spent waiting for
> locks), or simply because we can simultaneously zap present and
> non-present PTEs?
>
> Thanks
> Barry

Hi Barry

Thank you for your question. Here is the issue we are encountering:

Flame graph of time distribution for douyin process exit (~400MB swapped):
do_notify_resume         3.89%
get_signal               3.89%
do_signal_exit           3.88%
do_exit                  3.88%
mmput                    3.22%
exit_mmap                3.22%
unmap_vmas               3.08%
unmap_page_range         3.07%
free_swap_and_cache_nr   1.31%****
swap_entry_range_free    1.17%****
zram_slot_free_notify    1.11%****
zram_free_hw_entry_dc    0.43%
free_zspage[zsmalloc]    0.09%

CPU: 8-core ARM64 (14.21GHz+33.5GHz+4*2.7GHz), 12GB RAM

Process with ~400MB swap exit situation:
Exit takes 200-300ms, ~4% CPU load
With more zram compression/swap, exit time increases to 400-500ms
free_swap_and_cache_nr avg: 0.5ms, max: ~1.5ms (running time)
free_swap_and_cache_nr dominates exit time (33%, up to 50% in worst cases
). Main time is zram resource freeing (0.25ms per operation). With dozens
of simultaneous exits, cumulative time becomes significant.

Optimization approach:
Focus isn't on optimizing hot functions (limited improvement potential).
High load comes from too many simultaneous exits. We'll make time-consumin
g interfaces in do_exit asynchronous to accelerate exit completion while
allowing non-swap page (file/anonymous) freeing by other processes.

Camera startup scenario:
20-30 background apps, anonymous pages compressed to zram (200-500MB).
Camera launch triggers lmkd to kill 10+ apps - their exits consume 25%+
CPU. System services/third-party processes use 60%+ CPU, leaving camera
startup process CPU-starved and delayed.

Sincere wishes,
Lei