linux-kernel - Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZBnp69SHwM5SOttF@tpad>
Date:   Tue, 21 Mar 2023 14:31:23 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     David Hildenbrand <david@...hat.com>,
        Christoph Lameter <cl@...ux.com>,
        Aaron Tomlin <atomlin@...mlin.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining

On Tue, Mar 21, 2023 at 03:20:31PM +0000, Mel Gorman wrote:
> On Thu, Mar 02, 2023 at 11:10:03AM +0100, David Hildenbrand wrote:
> > [...]
> > 
> > > 
> > > > (2) drain_zone_pages() documents that we're draining the PCP
> > > >      (bulk-freeing them) of the current CPU on remote nodes. That bulk-
> > > >      freeing will properly adjust free memory counters. What exactly is
> > > >      the impact when no longer doing that? Won't the "snapshot" of some
> > > >      counters eventually be wrong? Do we care?
> > > 
> > > Don't see why the snapshot of counters will be wrong.
> > > 
> > > Instead of freeing pages on pcp list of remote nodes after they are
> > > considered idle ("3 seconds idle till flush"), what will happen is that
> > > drain_all_pages() will free those pcps, for example after an allocation
> > > fails on direct reclaim:
> > > 
> > >          page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > > 
> > >          /*
> > >           * If an allocation failed after direct reclaim, it could be because
> > >           * pages are pinned on the per-cpu lists or in high alloc reserves.
> > >           * Shrink them and try again
> > >           */
> > >          if (!page && !drained) {
> > >                  unreserve_highatomic_pageblock(ac, false);
> > >                  drain_all_pages(NULL);
> > >                  drained = true;
> > >                  goto retry;
> > >          }
> > > 
> > > In both cases the pages are freed (and counters maintained) here:
> > > 
> > > static inline void __free_one_page(struct page *page,
> > >                  unsigned long pfn,
> > >                  struct zone *zone, unsigned int order,
> > >                  int migratetype, fpi_t fpi_flags)
> > > {
> > >          struct capture_control *capc = task_capc(zone);
> > >          unsigned long buddy_pfn = 0;
> > >          unsigned long combined_pfn;
> > >          struct page *buddy;
> > >          bool to_tail;
> > > 
> > >          VM_BUG_ON(!zone_is_initialized(zone));
> > >          VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > > 
> > >          VM_BUG_ON(migratetype == -1);
> > >          if (likely(!is_migrate_isolate(migratetype)))
> > >                  __mod_zone_freepage_state(zone, 1 << order, migratetype);
> > > 
> > >          VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> > >          VM_BUG_ON_PAGE(bad_range(zone, page), page);
> > > 
> > >          while (order < MAX_ORDER - 1) {
> > >                  if (compaction_capture(capc, page, order, migratetype)) {
> > >                          __mod_zone_freepage_state(zone, -(1 << order),
> > >                                                                  migratetype);
> > >                          return;
> > >                  }
> > > 
> > > > Describing the difference between instructed refresh of vmstat and "remotely
> > > > drain per-cpu lists" in order to move free memory from the pcp to the buddy
> > > > would be great.
> > > 
> > > The difference is that now remote PCPs will be drained on demand, either via
> > > kcompactd or direct reclaim (through drain_all_pages), when memory is
> > > low.
> > > 
> > > For example, with the following test:
> > > 
> > > dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:
> > > 
> > >        kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
> > >        kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
> > >                dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > >                dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > >       gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > > 
> > > The commit message was indeed incorrect. Updated one:
> > > 
> > > "mm/vmstat: remove remote node draining
> > > 
> > > Draining of pages from the local pcp for a remote zone should not be
> > > necessary, since once the system is low on memory (or compaction on a
> > > zone is in effect), drain_all_pages should be called freeing any unused
> > > pcps."
> > > 
> > > Thanks!
> > 
> > Thanks for the explanation, that makes sense to me. Feel free to add my
> > 
> > Acked-by: David Hildenbrand <david@...hat.com>
> > 
> > ... hoping that some others (Mel, Vlastimil?) can have another look.
> > 
> 
> I was on extended leave and am still in the process of triaging a few
> thousand mails so I'm working off memory here instead of the code. This
> is a straight-forward enough question to answer quickly in case I forget
> later.
> 
> Short answer: I'm not a fan of the patch in concept and I do not think it
> should be merged.
> 
> I agree that drain_all_pages() would free the PCP pages on demand in
> direct reclaim context but it happens after reclaim has already
> happened. Hence, the reclaim may be necessary and may cause overreclaim
> in some situations due to remote CPUs pinning memory in PCP lists.
> 
> Similarly, kswapd may trigger early because PCP pages do not contribute
> to NR_FREE_PAGES so watermark checks can fail even though pages are
> free, just inaccessible.
> 
> Finally, remote pages expire because ideally CPUs allocate local memory
> assuming memory policies are not forcing use of remote nodes. The expiry
> means that remote pages get freed back to the buddy lists after a short
> period. By removing the expiry, it's possible that a local allocation will
> fail and spill over to a remote node prematurely because free pages were
> pinned on the PCP lists.
> 
> As this patch has the possibility of reclaiming early in both direct and
> kswapd context and increases the risk of remote node fallback, I think it
> needs much stronger justification and a warning about the side-effects. For
> this version unless I'm very wrong -- NAK :(

Mel,

Agreed. -v7 of the series dropped this patch and implements draining 
of non-local NUMA node memory on pcp caches to happen from cpu_vm_stats_fold
(which might execute remotely from vmstat_shepherd).
If you can take a look at -v7, it would be awesome.

Subject: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold

Large NUMA systems might have significant portions
of system memory to be trapped in pcp queues. The number of pcp is
determined by the number of processors and nodes in a system. A system
with 4 processors and 2 nodes has 8 pcps which is okay. But a system
with 1024 processors and 512 nodes has 512k pcps with a high potential
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <vbabka@...e.cz>




> 
> -- 
> Mel Gorman
> SUSE Labs
> 
>