[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87v7xk4p9z.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 22 Oct 2024 14:36:56 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: MengEn Sun <mengensun88@...il.com>
Cc: akpm@...ux-foundation.org, alexjlzheng@...cent.com,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, mengensun@...cent.com
Subject: Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with
NOHZ full
MengEn Sun <mengensun88@...il.com> writes:
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>>
>> Have verified the issue with some test? If not, I suggest you to do
>> that.
>>
>
> I have conducted tests:
> Applying this patch or not does not have a significant impact on the
> test results.
I don't expect some measurable performance difference with the patch.
If we can observe that the PCP size isn't tuned down to high_min before
and is after, that should be a valid test result to show the value of
the patch. Can you try that? The PCP size can be observed via
/proc/zoneinfo.
> perhaps my testing was not thorough enough. #^_^
>
> But, the logic of the code is like following:
> CPU0 CPUx
> ---- -----
> T0: vmstat_work is pending
> T1: vmstat_shepherd
> check vmstat_work
> and do nothing
> T2: vmstat_work is in unpending state.
>
> T3: alloc many pages
> T4: free all the pages allocated at T3
> T5: entry NOHZ, flushing all zonestats
> and nodestats
> T6: next vmstat_shepherd fired
>
> In my opinion, there are indeed some issues. I'm not sure if there's
> something I haven't understood?
>
>
> By the way,
> There are two other questions for me:
> Q1:
> Vmstat_work is a **deferreable work** So, It may be delayed for a long time
> by NOHZ. As a result, "vmstat_update() may not be executed once every
> second in the above scenario. Therefore, I'm not sure if using a deferrable
> work to reduce pcp->high is appropriate. In my tests, if I don't use
> deferrable work, it takes about a minute to reduce high to high_min, but
> using deferrable work may take several minutes to reduce high to high_min.
It's not a big issue to take longer time to decay pcp->high.
> Q2:
> On a big machine, for example, with 1TB of memory, the default maximum
> memory on PCP can be 1TB * 0.125.
> This portion of memory is not accounted for in MemFree in /proc/meminfo.
> Users can see this portion of memory from /proc/zoneinfo, but the memory
> reported by the `free` command is reduced.
> can we include the PCP memory in the MemFree statistic in /proc/meminfo?
This has been discussed before.
https://lore.kernel.org/linux-mm/20220816084426.135528-1-wangkefeng.wang@huawei.com/
https://lore.kernel.org/linux-mm/20240830014453.3070909-1-mawupeng1@huawei.com/
>> > While, This seems to be fine:
>> > - if freeing and allocating memory occur later, it may the
>> > high_max may be adjust automatically
>> > - If memory is tight, the memory reclamation process will
>> > release the pcp
>>
>> This could be a real issue for me.
>
> Thanks, I will test more carefully for those issue
>
>>
>> > Whatever, we make vmstat_shepherd to checking whether we need
>> > decay pcp high_max, and fire pcp_decay_high early if we need.
>> >
>> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
>> > Reviewed-by: Jinliang Zheng <alexjlzheng@...cent.com>
>> > Signed-off-by: MengEn Sun <mengensun@...cent.com>
>> > ---
>> > changelog:
>> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
>> > v2: Make the commit message clearer by adding some comments.
>> > ---
>> > mm/vmstat.c | 9 +++++++++
>> > 1 file changed, 9 insertions(+)
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index 1917c034c045..07b494b06872 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>> >
>> > for_each_populated_zone(zone) {
>> > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>> > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> > struct per_cpu_nodestat *n;
>> >
>> > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
>> > + * entering NOHZ full, see quiet_vmstat. so, we check pcp
>> > + * high_{min,max} to determine whether it is necessary to run
>> > + * decay_pcp_high on the corresponding CPU
>> > + */
>>
>> Please follow the comments coding style.
>>
>> /*
>> * comments line 1
>> * comments line 2
>> */
>>
>
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>> > + if (pcp->high_max > pcp->high_min)
>> > + return true;
>> > +
>>
>> We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high.
>> Your code may make need_update() return true in most cases.
>
> You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?
>
>>
>> > /*
>> > * The fast way of checking if there are any vmstat diffs.
>> > */
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists