linux-kernel - Re: [PATCH linux-mm v2] mm: make pcp_decay

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1729574046-3392-1-git-send-email-mengensun@tencent.com>
Date: Tue, 22 Oct 2024 13:14:06 +0800
From: MengEn Sun <mengensun88@...il.com>
To: ying.huang@...el.com
Cc: akpm@...ux-foundation.org,
	alexjlzheng@...cent.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	mengensun88@...il.com,
	mengensun@...cent.com
Subject: Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

Thank you for your suggestion. I understand and am ready to make
some changes  

> 
> Have verified the issue with some test?  If not, I suggest you to do
> that.
> 

I have conducted tests:
Applying this patch or not does not have a significant impact on the test results.
perhaps my testing was not thorough enough. #^_^

But, the logic of the code is like following:
CPU0                              CPUx
----                              -----
                                  T0: vmstat_work is pending
T1: vmstat_shepherd
    check vmstat_work
    and do nothing
                                  T2: vmstat_work is in unpending state.

                                  T3: alloc many pages
                                  T4: free all the pages allocated at T3
                                  T5: entry NOHZ, flushing all zonestats
                                      and nodestats
T6: next vmstat_shepherd fired

In my opinion, there are indeed some issues. I'm not sure if there's
something I haven't understood?


By the way,
There are two other questions for me:
Q1:
Vmstat_work is a **deferreable work** So, It may be delayed for a long time
by NOHZ. As a result, "vmstat_update() may not be executed once every
second in the above scenario. Therefore, I'm not sure if using a deferrable
work to reduce pcp->high is appropriate. In my tests, if I don't use
deferrable work, it takes about a minute to reduce high to high_min, but
using deferrable work may take several minutes to reduce high to high_min.

Q2:
On a big machine, for example, with 1TB of memory, the default maximum
memory on PCP can be 1TB * 0.125.
This portion of memory is not accounted for in MemFree in /proc/meminfo.
Users can see this portion of memory from /proc/zoneinfo, but the memory
reported by the `free` command is reduced.
can we include the PCP memory in the MemFree statistic in /proc/meminfo?

> > While, This seems to be fine:
> > - if freeing and allocating memory occur later, it may the
> >   high_max may be adjust automatically
> > - If memory is tight, the memory reclamation process will
> >   release the pcp
> 
> This could be a real issue for me.

Thanks, I will test more carefully for those issue

> 
> > Whatever, we make vmstat_shepherd to checking whether we need
> > decay pcp high_max, and fire pcp_decay_high early if we need.
> >
> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
> > Reviewed-by: Jinliang Zheng <alexjlzheng@...cent.com>
> > Signed-off-by: MengEn Sun <mengensun@...cent.com>
> > ---
> > changelog:
> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
> > v2: Make the commit message clearer by adding some comments.
> > ---
> >  mm/vmstat.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 1917c034c045..07b494b06872 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
> >  
> >  	for_each_populated_zone(zone) {
> >  		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> > +		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> >  		struct per_cpu_nodestat *n;
> >  
> > +		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
> > +		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
> > +		 * high_{min,max} to determine whether it is necessary to run
> > +		 * decay_pcp_high on the corresponding CPU
> > +		 */
> 
> Please follow the comments coding style.
> 
>                 /*
>                  * comments line 1
>                  * comments line 2
>                  */
> 

Thank you for your suggestion. I understand and am ready to make
some changes

> > +		if (pcp->high_max > pcp->high_min)
> > +			return true;
> > +
> 
> We don't tune pcp->high_max/min in fact.  Instead, we tune pcp->high.
> Your code may make need_update() return true in most cases.

You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?

> 
> >  		/*
> >  		 * The fast way of checking if there are any vmstat diffs.
> >  		 */
> 
> --
> Best Regards,
> Huang, Ying

Best Regards,
MengEn, Sun