[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d42c8d30-98be-4251-82d3-63704841bc25@redhat.com>
Date: Tue, 10 Sep 2024 10:52:34 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>, zhiguojiang <justinjiang@...o.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Will Deacon <will@...nel.org>,
"Aneesh Kumar K.V" <aneesh.kumar@...nel.org>, Nick Piggin
<npiggin@...il.com>, Peter Zijlstra <peterz@...radead.org>,
Arnd Bergmann <arnd@...db.de>, Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
linux-arch@...r.kernel.org, cgroups@...r.kernel.org,
opensource.kernel@...o.com
Subject: Re: [PATCH v3 2/2] mm: tlb: add tlb swap entries batch async release
On 10.09.24 06:22, Barry Song wrote:
> On Tue, Sep 10, 2024 at 2:44 AM zhiguojiang <justinjiang@...o.com> wrote:
>>
>>
>>
>> 在 2024/9/9 14:49, David Hildenbrand 写道:
>>> On 05.08.24 17:36, Zhiguo Jiang wrote:
>>>> One of the main reasons for the prolonged exit of the process with
>>>> independent mm is the time-consuming release of its swap entries.
>>>> The proportion of swap memory occupied by the process increases over
>>>> time due to high memory pressure triggering to reclaim anonymous folio
>>>> into swapspace, e.g., in Android devices, we found this proportion can
>>>> reach 60% or more after a period of time. Additionally, the relatively
>>>> lengthy path for releasing swap entries further contributes to the
>>>> longer time required to release swap entries.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After booting up, start 15 processes first, and then observe the
>>>> physical memory size occupied by the last launched process at different
>>>> time points.
>>>> Example: The process launched last: com.qiyi.video
>>>> | memory type | 0min | 1min | 5min | 10min | 15min |
>>>> -------------------------------------------------------------------
>>>> | VmRSS(KB) | 453832 | 252300 | 204364 | 199944 | 199748 |
>>>> | RssAnon(KB) | 247348 | 99296 | 71268 | 67808 | 67660 |
>>>> | RssFile(KB) | 205536 | 152020 | 132144 | 131184 | 131136 |
>>>> | RssShmem(KB) | 1048 | 984 | 952 | 952 | 952 |
>>>> | VmSwap(KB) | 202692 | 334852 | 362880 | 366340 | 366488 |
>>>> | Swap ratio(%) | 30.87% | 57.03% | 63.97% | 64.69% | 64.72% |
>>>> Note: min - minute.
>>>>
>>>> When there are multiple processes with independent mm and the high
>>>> memory pressure in system, if the large memory required process is
>>>> launched at this time, system will is likely to trigger the
>>>> instantaneous
>>>> killing of many processes with independent mm. Due to multiple exiting
>>>> processes occupying multiple CPU core resources for concurrent
>>>> execution,
>>>> leading to some issues such as the current non-exiting and important
>>>> processes lagging.
>>>>
>>>> To solve this problem, we have introduced the multiple exiting process
>>>> asynchronous swap entries release mechanism, which isolates and caches
>>>> swap entries occupied by multiple exiting processes, and hands them over
>>>> to an asynchronous kworker to complete the release. This allows the
>>>> exiting processes to complete quickly and release CPU resources. We have
>>>> validated this modification on the Android products and achieved the
>>>> expected benefits.
>>>>
>>>> Testing Platform: 8GB RAM
>>>> Testing procedure:
>>>> After restarting the machine, start 15 app processes first, and then
>>>> start the camera app processes, we monitor the cold start and preview
>>>> time datas of the camera app processes.
>>>>
>>>> Test datas of camera processes cold start time (unit: millisecond):
>>>> | seq | 1 | 2 | 3 | 4 | 5 | 6 | average |
>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 | 1512 |
>>>> | after | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 | 1204 |
>>>>
>>>> Test datas of camera processes preview time (unit: millisecond):
>>>> | seq | 1 | 2 | 3 | 4 | 5 | 6 | average |
>>>> | before | 267 | 402 | 504 | 513 | 161 | 265 | 352 |
>>>> | after | 188 | 223 | 301 | 203 | 162 | 154 | 205 |
>>>>
>>>> Base on the average of the six sets of test datas above, we can see that
>>>> the benefit datas of the modified patch:
>>>> 1. The cold start time of camera app processes has reduced by about 20%.
>>>> 2. The preview time of camera app processes has reduced by about 42%.
>>>>
>>>> It offers several benefits:
>>>> 1. Alleviate the high system cpu loading caused by multiple exiting
>>>> processes running simultaneously.
>>>> 2. Reduce lock competition in swap entry free path by an asynchronous
>>>> kworker instead of multiple exiting processes parallel execution.
>>>> 3. Release pte_present memory occupied by exiting processes more
>>>> efficiently.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@...o.com>
>>>> ---
>>>> arch/s390/include/asm/tlb.h | 8 +
>>>> include/asm-generic/tlb.h | 44 ++++++
>>>> include/linux/mm_types.h | 58 +++++++
>>>> mm/memory.c | 3 +-
>>>> mm/mmu_gather.c | 296 ++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 408 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
>>>> index e95b2c8081eb..3f681f63390f
>>>> --- a/arch/s390/include/asm/tlb.h
>>>> +++ b/arch/s390/include/asm/tlb.h
>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct
>>>> mmu_gather *tlb,
>>>> struct page *page, bool delay_rmap, int page_size);
>>>> static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>>>> struct page *page, unsigned int nr_pages, bool delay_rmap);
>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
>>>> + swp_entry_t entry, int nr);
>>>
>>>
>>> The problem I am having is that swap entries don't have any
>>> intersection with the TLB. It sounds like we're squeezing something
>>> into an existing concept (MMU gather) that just doesn't belong in there.
>> I referred to the mechanism of batch release in tlb, and perhaps a new
>> structure needs to be created to implement this feature.
>
> We already use swap_slots_cache to collect multiple swap entries and
> free them in
> batches. Would it be better to incorporate our new logic there?
Yes, very likely.
might
> be much less
> change and don't need to touch zap_pte_range() ? for example, while slot_caches
> are almost full, wake up the async thread to free?
FWIW, the async part over-complicates the whole thing. I agree that if
possible, we should try to avoid it; maybe it's not easily possible.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists