[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YbD8naaJrZQANahP@google.com>
Date: Wed, 8 Dec 2021 10:42:37 -0800
From: Minchan Kim <minchan@...nel.org>
To: 胡玮文 <sehuww@...l.scut.edu.cn>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
linux-mm <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
Michal Hocko <mhocko@...e.com>,
David Hildenbrand <david@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
John Dias <joaodias@...gle.com>
Subject: Re: [RFC] mm: introduce page pinner
On Wed, Dec 08, 2021 at 07:54:35PM +0800, 胡玮文 wrote:
> On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote:
> > The contiguous memory allocation fails if one of the pages in
> > requested range has unexpected elevated reference count since
> > VM couldn't migrate the page out. It's very common pattern for
> > CMA allocation failure. The temporal elevated page refcount
> > could happen from various places and it's really hard to chase
> > who held the temporal page refcount at that time, which is the
> > vital information to debug the allocation failure.
Hi,
Please don't cut down original Cc list without special reason.
>
> Hi Minchan,
>
> I'm a newbie here. We are debugging a problem where every CPU core is doing
> compaction but making no progress, because of the unexpected page refcount. I'm
> interested in your approach, but this patch seems only to cover the CMA
> allocation path. So could it be extended to debugging migrate failure during
> compaction? I'm not familiar with the kernel codebase, here is my untested
> thought:
The compaction failure will produce a lot events I wanted to avoid
in my system but I think your case is reasonable if you doesn't
mind the large events.
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index cf25b00f03c8..85dacbca8fa0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -46,6 +46,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/page_idle.h>
> #include <linux/page_owner.h>
> +#include <linux/page_pinner.h>
> #include <linux/sched/mm.h>
> #include <linux/ptrace.h>
> #include <linux/oom.h>
> @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>
> if (!mapping) {
> /* Anonymous page without mapping */
> - if (folio_ref_count(folio) != expected_count)
> + if (folio_ref_count(folio) != expected_count) {
> + page_pinner_failure(&folio->page);
> return -EAGAIN;
> + }
>
> /* No turning back from here */
> newfolio->index = folio->index;
> @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> xas_lock_irq(&xas);
> if (!folio_ref_freeze(folio, expected_count)) {
> xas_unlock_irq(&xas);
> + page_pinner_failure(&folio->page);
> return -EAGAIN;
> }
>
> I'm not sure what to do with the new folio, it seems using folio->page in new
> codes is not correct.
If you want to cover compaction only, maybe this one:
diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..7bfbf7205fb8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
/* All pages were either migrated or will be released */
cc->nr_migratepages = 0;
if (err) {
+ struct page *failed_page;
+
+ list_for_each_entry(failed_page, &cc->migratepages, lru)
+ page_pinner_failure(failed_page);
+
putback_movable_pages(&cc->migratepages);
/*
* migrate_pages() may return -ENOMEM when scanners meet
However, for the case, I want to introduce some filter options like
failure reason(?)
page_pinner_failure(pfn, reason)
So, I could keep getting only CMA allocation failure events, not
compaction failure.
>
> > This patch introduces page pinner to keep track of Page Pinner
> > who caused the CMA allocation failure. How page pinner work is
> > once VM found the non-migrated page after trying migration
> > during contiguos allocation, it marks the page and every page-put
> > operation on the page since then will have event trace. Since
> > page-put is always with page-get, the page-put event trace helps
> > to deduce where the pair page-get originated from.
> >
> > The reason why the feature tracks page-put instead of page-get
> > indirectly is that since VM couldn't expect when page migration
> > fails, it should keep track of every page-get for migratable page
> > to dump information at failure. Considering backtrace as vitial
> > information as well as page's get/put is one of hottest path,
> > it's too heavy approach. Thus, to minimize runtime overhead,
> > this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT
> > debugging option to indicate migration-failed page and only
> > tracks every page-put operation for the page since the failure.
> >
> > usage:
> >
> > trace_dir="/sys/kernel/tracing"
> > echo 1 > $trace_dir/events/page_pinner/enable
> > echo 1 > $trace_dir/options/stacktrace
> > ..
> > run workload
> > ..
> > ..
> >
> > cat $trace_dir/trace
> >
> > <...>-498 [006] .... 33306.301621: page_pinner_failure: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=1 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-498 [006] .... 33306.301625: <stack trace>
> > => __page_pinner_failure
> > => test_pages_isolated
> > => alloc_contig_range
> > => cma_alloc
> > => cma_heap_allocate
> > => dma_heap_ioctl
> > => __arm64_sys_ioctl
> > => el0_svc_common
> > => do_el0_svc
> > => el0_svc
> > => el0_sync_handler
> > => el0_sync
> > <...>-24965 [001] .... 33306.392836: page_pinner_put: pfn=0x9f0bb0 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=00000000aec7812a mt=5
> > <...>-24965 [001] .... 33306.392846: <stack trace>
> > => __page_pinner_put
> > => release_pages
> > => free_pages_and_swap_cache
> > => tlb_flush_mmu_free
> > => tlb_flush_mmu
> > => zap_pte_range
> > => unmap_page_range
> > => unmap_vmas
> > => exit_mmap
> > => __mmput
> > => mmput
> > => exit_mm
> > => do_exit
> > => do_group_exit
> > => get_signal
> > => do_signal
> > => do_notify_resume
> > => work_pending
> >
> > Signed-off-by: Minchan Kim <minchan@...nel.org>
> > ---
> > The PagePinner named after PageOwner since I wanted to keep track of
> > page refcount holder. Feel free to suggest better names.
> > Actually, I had alloc_contig_failure tracker as a candidate.
> >
> > include/linux/mm.h | 7 ++-
> > include/linux/page_ext.h | 3 +
> > include/linux/page_pinner.h | 47 ++++++++++++++++
> > include/trace/events/page_pinner.h | 60 ++++++++++++++++++++
> > mm/Kconfig.debug | 13 +++++
> > mm/Makefile | 1 +
> > mm/page_alloc.c | 3 +
> > mm/page_ext.c | 4 ++
> > mm/page_isolation.c | 3 +
> > mm/page_pinner.c | 90 ++++++++++++++++++++++++++++++
> > 10 files changed, 230 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/page_pinner.h
> > create mode 100644 include/trace/events/page_pinner.h
> > create mode 100644 mm/page_pinner.c
< snip>
> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index 1e73717802f8..0ad4a3b8f4eb 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -62,6 +62,19 @@ config PAGE_OWNER
> >
> > If unsure, say N.
> >
> > +config PAGE_PINNER
> > + bool "Track page pinner"
> > + select PAGE_EXTENSION
> > + depends on DEBUG_KERNEL && TRACEPOINTS
> > + help
> > + This keeps track of what call chain is the pinner of a page, may
> > + help to find contiguos page allocation failure. Even if you include
> > + this feature in your build, it is disabled by default. You should
> > + pass "page_pinner=on" to boot parameter in order to enable it. Eats
> > + a fair amount of memory if enabled.
>
> I'm a bit confused. It seems page pinner does not allocate any additional
> memory if you enable it by boot parameter. So the description seems inaccurate.
It will allocate page_ext descriptors so consumes the memory.
>
> > +
> > + If unsure, say N.
> > +
> > config PAGE_POISONING
> > bool "Poison pages after freeing"
> > help
> > diff --git a/mm/Makefile b/mm/Makefile
> > index fc60a40ce954..0c9b78b15070 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> > obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
> > obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
> > obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> > +obj-$(CONFIG_PAGE_PINNER) += page_pinner.o
> > obj-$(CONFIG_CLEANCACHE) += cleancache.o
> > obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
> > obj-$(CONFIG_ZPOOL) += zpool.o
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f41a5e990ac0..6e3a6f875a40 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -63,6 +63,7 @@
> > #include <linux/sched/rt.h>
> > #include <linux/sched/mm.h>
> > #include <linux/page_owner.h>
> > +#include <linux/page_pinner.h>
> > #include <linux/kthread.h>
> > #include <linux/memcontrol.h>
> > #include <linux/ftrace.h>
> > @@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > if (memcg_kmem_enabled() && PageMemcgKmem(page))
> > __memcg_kmem_uncharge_page(page, order);
> > reset_page_owner(page, order);
> > + reset_page_pinner(page, order);
> > return false;
> > }
> >
> > @@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > page_cpupid_reset_last(page);
> > page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > reset_page_owner(page, order);
> > + reset_page_pinner(page, order);
> >
> > if (!PageHighMem(page)) {
> > debug_check_no_locks_freed(page_address(page),
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index 2a52fd9ed464..0dafe968b212 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -8,6 +8,7 @@
> > #include <linux/kmemleak.h>
> > #include <linux/page_owner.h>
> > #include <linux/page_idle.h>
> > +#include <linux/page_pinner.h>
> >
> > /*
> > * struct page extension
> > @@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> > #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
> > &page_idle_ops,
> > #endif
> > +#ifdef CONFIG_PAGE_PINNER
> > + &page_pinner_ops,
> > +#endif
> > };
> >
> > unsigned long page_ext_size = sizeof(struct page_ext);
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index a95c2c6562d0..a9ddea1c9166 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -9,6 +9,7 @@
> > #include <linux/memory.h>
> > #include <linux/hugetlb.h>
> > #include <linux/page_owner.h>
> > +#include <linux/page_pinner.h>
> > #include <linux/migrate.h>
> > #include "internal.h"
> >
> > @@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
> >
> > out:
> > trace_test_pages_isolated(start_pfn, end_pfn, pfn);
> > + if (ret < 0)
> > + page_pinner_failure(pfn_to_page(pfn));
> >
> > return ret;
> > }
> > diff --git a/mm/page_pinner.c b/mm/page_pinner.c
> > new file mode 100644
> > index 000000000000..300a90647557
> > --- /dev/null
> > +++ b/mm/page_pinner.c
> > @@ -0,0 +1,90 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/mm.h>
> > +#include <linux/page_pinner.h>
> > +
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/page_pinner.h>
> > +
> > +static bool page_pinner_enabled;
> > +DEFINE_STATIC_KEY_FALSE(page_pinner_inited);
> > +EXPORT_SYMBOL(page_pinner_inited);
> > +
> > +static int __init early_page_pinner_param(char *buf)
> > +{
> > + return kstrtobool(buf, &page_pinner_enabled);
> > +}
> > +early_param("page_pinner", early_page_pinner_param);
> > +
> > +static bool need_page_pinner(void)
> > +{
> > + return page_pinner_enabled;
> > +}
> > +
> > +static void init_page_pinner(void)
> > +{
> > + if (!page_pinner_enabled)
> > + return;
> > +
> > + static_branch_enable(&page_pinner_inited);
> > +}
> > +
> > +struct page_ext_operations page_pinner_ops = {
> > + .need = need_page_pinner,
> > + .init = init_page_pinner,
> > +};
> > +
> > +void __reset_page_pinner(struct page *page, unsigned int order)
> > +{
> > + struct page_ext *page_ext;
> > + int i;
> > +
> > + page_ext = lookup_page_ext(page);
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > + break;
> > +
> > + clear_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > + page_ext = page_ext_next(page_ext);
> > + }
> > +}
> > +
> > +void __page_pinner_failure(struct page *page)
> > +{
> > + struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + trace_page_pinner_failure(page);
> > + test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags);
> > +}
> > +
> > +void __page_pinner_put(struct page *page)
> > +{
> > + struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > + if (unlikely(!page_ext))
> > + return;
> > +
> > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags))
> > + return;
> > +
> > + trace_page_pinner_put(page);
> > +}
> > +EXPORT_SYMBOL(__page_pinner_put);
> > +
> > +
> > +static int __init page_pinner_init(void)
> > +{
> > + if (!static_branch_unlikely(&page_pinner_inited)) {
> > + pr_info("page_pinner is disabled\n");
> > + return 0;
> > + }
> > +
> > + pr_info("page_pinner is enabled\n");
> > + return 0;
> > +}
> > +late_initcall(page_pinner_init)
> > --
> > 2.34.1.400.ga245620fadb-goog
> >
>
> More info about my compaction issue:
>
> This call stack returns -EAGAIN in 99.9% cases on the problematic host
> (Ubuntu 20.04 with kernel 5.11.0-40):
>
> migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAGAIN
> migrate_page
> fallback_migrate_page
> move_to_new_page
> migrate_pages
> compact_zone
> compact_zone_order
> try_to_compact_pages
> __alloc_pages_direct_compact
> __alloc_pages_slowpath.constprop.0
> __alloc_pages_nodemask
> alloc_pages_vma
> do_huge_pmd_anonymous_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
>
> The offending pages are from shm, allocated by mmap() with MAP_SHARED by a
> machine learning program. They may have relationships with NVIDIA CUDA, but I
> want to confirm this, and make improvements if possible.
So you are suspecting some kernel driver hold a addtional refcount
using get_user_pages or page get API?
>
> When the issue reproduce, a single page fault that triggers a sync compaction
> can take tens of seconds. Then all 40 CPU threads are doing compaction, and
> application runs several order of magnitude slower.
>
> Disabling sync compaction is a workaround (the default is "madvise"):
>
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
>
> Previously I asked for help at https://lore.kernel.org/linux-mm/20210516085644.13800-1-hdanton@sina.com/
> Now I have more information but still cannot pinpoint the root cause.
>
> Thanks,
> Hu Weiwen
>
>
Powered by blists - more mailing lists