linux-kernel - Re: [PATCH v3 2/2] mm: tlb: add tlb swap entries batch async release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5589650f-0511-46de-b08f-8accba10de4a@vivo.com>
Date: Tue, 10 Sep 2024 17:27:41 +0800
From: zhiguojiang <justinjiang@...o.com>
To: Barry Song <21cnbao@...il.com>
Cc: David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Will Deacon <will@...nel.org>,
 "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>, Nick Piggin
 <npiggin@...il.com>, Peter Zijlstra <peterz@...radead.org>,
 Arnd Bergmann <arnd@...db.de>, Johannes Weiner <hannes@...xchg.org>,
 Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
 Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
 linux-arch@...r.kernel.org, cgroups@...r.kernel.org,
 opensource.kernel@...o.com
Subject: Re: [PATCH v3 2/2] mm: tlb: add tlb swap entries batch async release



在 2024/9/10 12:22, Barry Song 写道:
> On Tue, Sep 10, 2024 at 2:44 AM zhiguojiang <justinjiang@...o.com> wrote:
>>
>>
>> 在 2024/9/9 14:49, David Hildenbrand 写道:
>>> On 05.08.24 17:36, Zhiguo Jiang wrote:
>>>> One of the main reasons for the prolonged exit of the process with
>>>> independent mm is the time-consuming release of its swap entries.
>>>> The proportion of swap memory occupied by the process increases over
>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>> lengthy path for releasing swap entries further contributes to the
>>>> longer time required to release swap entries.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at different
>>>> time points.
>>>> Example: The process launched last: com.qiyi.video
>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  | 15min  |
>>>> -------------------------------------------------------------------
>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 | 199748  |
>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 | 67660  |
>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 | 131136  |
>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  | 952  |
>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 | 366488  |
>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% | 64.72%  |
>>>> Note: min - minute.
>>>>
>>>> When there are multiple processes with independent mm and the high
>>>> memory pressure in system, if the large memory required process is
>>>> launched at this time, system will is likely to trigger the
>>>> instantaneous
>>>> killing of many processes with independent mm. Due to multiple exiting
>>>> processes occupying multiple CPU core resources for concurrent
>>>> execution,
>>>> leading to some issues such as the current non-exiting and important
>>>> processes lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the Android products and achieved the
>>>> expected benefits.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After restarting the machine, start 15 app processes first, and then
>>>> start the camera app processes, we monitor the cold start and preview
>>>> time datas of the camera app processes.
>>>>
>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>>>>
>>>> Test datas of camera processes preview time (unit: millisecond):
>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>>>>
>>>> Base on the average of the six sets of test datas above, we can see that
>>>> the benefit datas of the modified patch:
>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>>      processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>>      kworker instead of multiple exiting processes parallel execution.
>>>> 3. Release pte_present memory occupied by exiting processes more
>>>>      efficiently.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@...o.com>
>>>> ---
>>>>    arch/s390/include/asm/tlb.h |   8 +
>>>>    include/asm-generic/tlb.h   |  44 ++++++
>>>>    include/linux/mm_types.h    |  58 +++++++
>>>>    mm/memory.c                 |   3 +-
>>>>    mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>>>>    5 files changed, 408 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>> index e95b2c8081eb..3f681f63390f
>>>> --- a/arch/s390/include/asm/tlb.h
>>>> +++ b/arch/s390/include/asm/tlb.h
>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct
>>>> mmu_gather *tlb,
>>>>            struct page *page, bool delay_rmap, int page_size);
>>>>    static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>>            struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> +        swp_entry_t entry, int nr);
>>>
>>> The problem I am having is that swap entries don't have any
>>> intersection with the TLB. It sounds like we're squeezing something
>>> into an existing concept (MMU gather) that just doesn't belong in there.
>> I referred to the mechanism of batch release in tlb, and perhaps a new
>> structure needs to be created to implement this feature.
> We already use swap_slots_cache to collect multiple swap entries and
> free them in
> batches. Would it be better to incorporate our new logic there? might
> be much less
> change and don't need to touch zap_pte_range() ?  for example, while slot_caches
> are almost full, wake up the async thread to free? Or, do you think
> that cache->free_lock
> is also a contended lock?
There should be no need to add mmu_swap_batch logic in the sync release
method, and this may lead to repetitive logic. The number of async
threads for releasing slot_caches seems also uncertain for a large
number of CPUs architectures. The cache->free_lock is a percpu spinlock,
which will disable preemption, so I guess there should be no lock
contention issue with it.

I understand that you want to incorporate this feature through sync
manner, and I think it may be a bit difficult.
>
>> Thanks
>> Zhiguo
>>
> Thanks
> Barry
Thanks
Zhiguo