linux-kernel - Re: [PATCH] mm: fix draining remote pageset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87msykc9ip.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Tue, 22 Aug 2023 06:31:42 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Christoph Lameter <cl@...ux.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] mm: fix draining remote pageset

Michal Hocko <mhocko@...e.com> writes:

> On Mon 21-08-23 16:30:18, Huang, Ying wrote:
>> Michal Hocko <mhocko@...e.com> writes:
>> 
>> > On Wed 16-08-23 15:08:23, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@...e.com> writes:
>> >> 
>> >> > On Mon 14-08-23 09:59:51, Huang, Ying wrote:
>> >> >> Hi, Michal,
>> >> >> 
>> >> >> Michal Hocko <mhocko@...e.com> writes:
>> >> >> 
>> >> >> > On Fri 11-08-23 17:08:19, Huang Ying wrote:
>> >> >> >> If there is no memory allocation/freeing in the remote pageset after
>> >> >> >> some time (3 seconds for now), the remote pageset will be drained to
>> >> >> >> avoid memory wastage.
>> >> >> >> 
>> >> >> >> But in the current implementation, vmstat updater worker may not be
>> >> >> >> re-queued when we are waiting for the timeout (pcp->expire != 0) if
>> >> >> >> there are no vmstat changes, for example, when CPU goes idle.
>> >> >> >
>> >> >> > Why is that a problem?
>> >> >> 
>> >> >> The pages of the remote zone may be kept in the local per-CPU pageset
>> >> >> for long time as long as there's no page allocation/freeing on the
>> >> >> logical CPU.  In addition to the logical CPU goes idle, this is also
>> >> >> possible if the logical CPU is busy in the user space.
>> >> >
>> >> > But why is this a problem? Is the scale of the problem sufficient to
>> >> > trigger out of memory situations or be otherwise harmful?
>> >> 
>> >> This may trigger premature page reclaiming.  The pages in the PCP of the
>> >> remote zone would have been freed to satisfy the page allocation for the
>> >> remote zone to avoid page reclaiming.  It's highly possible that the
>> >> local CPU just allocate/free from/to the remote zone temporarily.
>> >
>> > I am slightly confused here but I suspect by zone you mean remote pcp.
>> > But more importantly is this a concern seen in real workload? Can you
>> > quantify it in some manner? E.g. with this patch we have X more kswapd
>> > scanning or even hit direct reclaim much less often.
>> >> So,
>> >> we should free PCP pages of the remote zone if there is no page
>> >> allocation/freeing from/to the remote zone for 3 seconds.
>> >
>> > Well, I would argue this depends a lot. There are workloads which really
>> > like to have CPUs idle and yet they would like to benefit from the
>> > allocator fast path after that CPU goes out of idle because idling is
>> > their power saving opportunity while workloads want to act quickly after
>> > there is something to run.
>> >
>> > That being said, we really need some numbers (ideally from real world)
>> > that proves this is not just a theoretical concern.
>> 
>> The behavior to drain the PCP of the remote zone (that is, remote PCP)
>> was introduced in commit 4ae7c03943fc ("[PATCH] Periodically drain non
>> local pagesets").  The goal of draining was well documented in the
>> change log.  IIUC, some of your questions can be answered there?
>> 
>> This patch just restores the original behavior changed by commit
>> 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8").
>
> Let me repeat. You need some numbers to show this is needed.

I have done some test for this patch as follows,

- Run some workloads, use `numactl` to bind CPU to node 0 and memory to
  node 1.  So the PCP of the CPU on node 0 for zone on node 1 will be
  filled.

- After workloads finish, idle for 60s

- Check /proc/zoneinfo

With the original kernel, the number of pages in the PCP of the CPU on
node 0 for zone on node 1 is non-zero after idle.  With the patched
kernel, that becomes 0 after idle.  We avoid to keep pages in the remote
PCP during idle.

This is the number I have.  If you think it isn't enough to justify the
patch, then I'm OK too (although I think it's enough).  Because the
remote PCP will be drained later when some pages are allocated/freed on
the CPU.

--
Best Regards,
Huang, Ying