lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YdcxlGPLCcIwVKIg@google.com>
Date:   Thu, 6 Jan 2022 10:14:44 -0800
From:   Minchan Kim <minchan@...nel.org>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Michal Hocko <mhocko@...e.com>,
        David Hildenbrand <david@...hat.com>,
        linux-mm <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        John Dias <joaodias@...gle.com>, huww98@...look.com,
        John Hubbard <jhubbard@...dia.com>
Subject: Re: [RFC v2] mm: introduce page pin owner

On Tue, Dec 28, 2021 at 09:59:04AM -0800, Minchan Kim wrote:
> A Contiguous Memory Allocator(CMA) allocation can fail if any page
> within the requested range has an elevated refcount(a pinned page).
> 
> Debugging such failures is difficult, because the struct pages only
> show a combined refcount, and do not show the callstacks or
> backtraces of the code that acquired each refcount. So the source
> of the page pins remains a mystery, at the time of CMA failure.
> 
> In order to solve this without adding too much overhead, just do
> nothing most of the time, which is pretty low overhead. However,
> once a CMA failure occurs, then mark the page (this requires a
> pointer's worth of space in struct page, but it uses page extensions
> to get that), and start tracing the subsequent put_page() calls.
> As the program finishes up, each page pin will be undone, and
> traced with a backtrace. The programmer reads the trace output and
> sees the list of all page pinning code paths.
> 
> This will consume an additional 8 bytes per 4KB page, or an
> additional 0.2% of RAM. In addition to the storage space, it will
> have some performance cost, due to increasing the size of struct
> page so that it is greater than the cacheline size (or multiples
> thereof) of popular (x86, ...) CPUs.
> 
> The idea can apply every user of migrate_pages as well as CMA to
> know the reason why the page migration failed. To support it,
> the implementation takes "enum migrate_reason" string as filter
> of the tracepoint(see below).

Hi Folks,

Any comment to proceed the work?

Thanks.

> 
> Usage)
> 
> trace_dir="/sys/kernel/tracing"
> echo 1 > $trace_dir/options/stacktrace
> echo 1 > $trace_dir/events/page_pin_owner/enable
> echo "contig_range" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> 
> ===
> example:
> If you are interested in compaction failure, you want to use
> "compaction" as a filter instead of "contig_range".
> For the other filter strings, you can refer migrate_reason_names
> in mm/debug.c
> ===
> 
> ..
> run workload
> ..
> 
> cat $trace_dir/trace
> ..
> 
>             bash-7442    [007] ...1.   288.504690: report_page_pinners: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5 reason=contig_range err=-11
>             bash-7442    [007] ...1.   288.504691: <stack trace>
>  => trace_event_raw_event_report_page_pinners
>  => __report_page_pinners
>  => migrate_pages
>  => alloc_contig_range
>  => cma_alloc.cold
>  => cma_alloc_write
>  => simple_attr_write
>  => debugfs_attr_write
>  => full_proxy_write
>  => vfs_write
>  => ksys_write
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
> ..
> 
>             bash-7442    [007] ...1.   288.506426: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5
>             bash-7442    [007] ...1.   288.506427: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => put_page
>  => putback_movable_pages
>  => alloc_contig_range
>  => cma_alloc.cold
>  => cma_alloc_write
>  => simple_attr_write
>  => debugfs_attr_write
>  => full_proxy_write
>  => vfs_write
>  => ksys_write
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
>              cc1-31104   [001] d..2.   288.507632: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=1 mapcount=0 mapping=ffff88f5480a2709 mt=4
>              cc1-31104   [001] d..2.   288.507636: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => release_pages
>  => tlb_flush_mmu
>  => unmap_page_range
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => __x64_sys_exit_group
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
> 
>             make-31157   [007] d..3.   288.508333: page_pin_owner_put: pfn=0x91430 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=ffff88f5480a2709 mt=5
>             make-31157   [007] d..3.   288.508335: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => release_pages
>  => __pagevec_lru_add
>  => lru_add_drain_cpu
>  => lru_add_drain
>  => exit_mmap
>  => mmput
>  => begin_new_exec
>  => load_elf_binary
>  => bprm_execve
>  => do_execveat_common.isra.0
>  => __x64_sys_execve
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> Signed-off-by: Minchan Kim <minchan@...nel.org>
> ---
> * from v1 - https://lore.kernel.org/lkml/20211206184730.858850-1-minchan@kernel.org/
>   * PagePinOwner naming suggestion - jhubbard@
>   * Description/Kconfig suggestion - jhubbard@
>   * General migrate_page supports - huww98@
> 
>  include/linux/mm.h                    |  7 ++-
>  include/linux/page_ext.h              |  3 +
>  include/linux/page_pin_owner.h        | 48 ++++++++++++++
>  include/trace/events/page_pin_owner.h | 81 ++++++++++++++++++++++++
>  mm/Kconfig.debug                      | 15 +++++
>  mm/Makefile                           |  1 +
>  mm/migrate.c                          |  5 +-
>  mm/page_alloc.c                       |  3 +
>  mm/page_ext.c                         |  4 ++
>  mm/page_pin_owner.c                   | 91 +++++++++++++++++++++++++++
>  10 files changed, 256 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/page_pin_owner.h
>  create mode 100644 include/trace/events/page_pin_owner.h
>  create mode 100644 mm/page_pin_owner.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 82be88c1fdbb..5c437faf129c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,7 @@
>  #include <linux/err.h>
>  #include <linux/page-flags.h>
>  #include <linux/page_ref.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/memremap.h>
>  #include <linux/overflow.h>
>  #include <linux/sizes.h>
> @@ -714,8 +715,12 @@ static inline unsigned int folio_order(struct folio *folio)
>   */
>  static inline int put_page_testzero(struct page *page)
>  {
> +	int ret;
> +
>  	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> -	return page_ref_dec_and_test(page);
> +	ret = page_ref_dec_and_test(page);
> +	page_pin_owner_put(page);
> +	return ret;
>  }
>  
>  static inline int folio_put_testzero(struct folio *folio)
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1e087f..3236efd5ab07 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -23,6 +23,9 @@ enum page_ext_flags {
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
>  #endif
> +#if defined(CONFIG_PAGE_PIN_OWNER)
> +	PAGE_EXT_PIN_OWNER,
> +#endif
>  };
>  
>  /*
> diff --git a/include/linux/page_pin_owner.h b/include/linux/page_pin_owner.h
> new file mode 100644
> index 000000000000..e68adcdd6799
> --- /dev/null
> +++ b/include/linux/page_pin_owner.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_PAGE_PIN_OWNER_H
> +#define __LINUX_PAGE_PIN_OWNER_H
> +
> +#include <linux/jump_label.h>
> +
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +extern struct static_key_false page_pin_owner_inited;
> +extern struct page_ext_operations page_pin_owner_ops;
> +
> +void __report_page_pinners(struct page *page, int reason, int err);
> +void __page_pin_owner_put(struct page *page);
> +void __reset_page_pin_owner(struct page *page, unsigned int order);
> +
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	if (static_branch_unlikely(&page_pin_owner_inited))
> +		__reset_page_pin_owner(page, order);
> +}
> +
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__report_page_pinners(page, reason, err);
> +}
> +
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__page_pin_owner_put(page);
> +}
> +
> +#else
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +}
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +}
> +#endif /* CONFIG_PAGE_PIN_OWNER */
> +#endif /*__LINUX_PAGE_PIN_OWNER_H */
> diff --git a/include/trace/events/page_pin_owner.h b/include/trace/events/page_pin_owner.h
> new file mode 100644
> index 000000000000..0096297312f5
> --- /dev/null
> +++ b/include/trace/events/page_pin_owner.h
> @@ -0,0 +1,81 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM page_pin_owner
> +
> +#if !defined(_TRACE_PAGE_PIN_OWNER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PAGE_PIN_OWNER_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/mmflags.h>
> +
> +TRACE_EVENT(page_pin_owner_put,
> +
> +	TP_PROTO(struct page *page),
> +
> +	TP_ARGS(page),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count,
> +			__entry->mapcount, __entry->mapping, __entry->mt)
> +);
> +
> +TRACE_EVENT(report_page_pinners,
> +
> +	TP_PROTO(struct page *page, const char *reason, int err),
> +
> +	TP_ARGS(page, reason, err),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		__field(const char *, reason)
> +		__field(int, err)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		__entry->reason = reason;
> +		__entry->err = err;
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d reason=%s err=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count, __entry->mapcount, __entry->mapping,
> +			__entry->mt, __entry->reason, __entry->err)
> +);
> +
> +#endif /* _TRACE_PAGE_PIN_OWNER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 5bd5bb097252..ed2d5e6450b7 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -86,6 +86,21 @@ config PAGE_TABLE_CHECK_ENFORCED
>  
>  	  If unsure say "n".
>  
> +config PAGE_PIN_OWNER
> +	bool "Track page pinner"
> +	depends on DEBUG_KERNEL && TRACEPOINTS && STACKTRACE_SUPPORT
> +	select PAGE_EXTENSION
> +	select STACKTRACE
> +	help
> +	  This keeps track of what call chain is the pinner of a page, may
> +	  help to find contiguous page allocation(memory-hotplug, compaction,
> +	  cma and so on) failure. Even if you include this feature in your
> +	  build, it is disabled by default. You should pass "page_pin_owner=on"
> +	  to boot parameter in order to enable it. Eats a fair amount of memory
> +	  if enabled.
> +
> +	  If unsure, say N.
> +
>  config PAGE_POISONING
>  	bool "Poison pages after freeing"
>  	help
> diff --git a/mm/Makefile b/mm/Makefile
> index 588d3113f3b0..ca89f4fa38ce 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,6 +104,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>  obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> +obj-$(CONFIG_PAGE_PIN_OWNER) += page_pin_owner.o
>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)	+= zpool.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c7da064b4781..aa8a2c21da8e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1105,6 +1105,8 @@ static int unmap_and_move(new_page_t get_new_page,
>  	rc = __unmap_and_move(page, newpage, force, mode);
>  	if (rc == MIGRATEPAGE_SUCCESS)
>  		set_page_owner_migrate_reason(newpage, reason);
> +	else
> +		report_page_pinners(page, reason, rc);
>  
>  out:
>  	if (rc != -EAGAIN) {
> @@ -1273,7 +1275,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	if (rc == MIGRATEPAGE_SUCCESS) {
>  		move_hugetlb_state(hpage, new_hpage, reason);
>  		put_new_page = NULL;
> -	}
> +	} else
> +		report_page_pinners(hpage, reason, rc);
>  
>  out_unlock:
>  	unlock_page(hpage);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5d62e1c8d81..98125b1b6e7e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -64,6 +64,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/mm.h>
>  #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/page_table_check.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
> @@ -1310,6 +1311,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		if (memcg_kmem_enabled() && PageMemcgKmem(page))
>  			__memcg_kmem_uncharge_page(page, order);
>  		reset_page_owner(page, order);
> +		reset_page_pin_owner(page, order);
>  		page_table_check_free(page, order);
>  		return false;
>  	}
> @@ -1350,6 +1352,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  	page_cpupid_reset_last(page);
>  	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>  	reset_page_owner(page, order);
> +	reset_page_pin_owner(page, order);
>  	page_table_check_free(page, order);
>  
>  	if (!PageHighMem(page)) {
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 2e66d934d63f..3c0df97b9b8b 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/page_idle.h>
>  #include <linux/page_table_check.h>
>  
> @@ -79,6 +80,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_TABLE_CHECK
>  	&page_table_check_ops,
>  #endif
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +	&page_pin_owner_ops,
> +#endif
>  };
>  
>  unsigned long page_ext_size = sizeof(struct page_ext);
> diff --git a/mm/page_pin_owner.c b/mm/page_pin_owner.c
> new file mode 100644
> index 000000000000..736617df093c
> --- /dev/null
> +++ b/mm/page_pin_owner.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/page_pin_owner.h>
> +#include <linux/migrate.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/page_pin_owner.h>
> +
> +static bool page_pin_owner_enabled;
> +DEFINE_STATIC_KEY_FALSE(page_pin_owner_inited);
> +EXPORT_SYMBOL(page_pin_owner_inited);
> +
> +static int __init early_page_pin_owner_param(char *buf)
> +{
> +	return kstrtobool(buf, &page_pin_owner_enabled);
> +}
> +early_param("page_pin_owner", early_page_pin_owner_param);
> +
> +static bool need_page_pin_owner(void)
> +{
> +	return page_pin_owner_enabled;
> +}
> +
> +static void init_page_pin_owner(void)
> +{
> +	if (!page_pin_owner_enabled)
> +		return;
> +
> +	static_branch_enable(&page_pin_owner_inited);
> +}
> +
> +struct page_ext_operations page_pin_owner_ops = {
> +	.need = need_page_pin_owner,
> +	.init = init_page_pin_owner,
> +};
> +
> +void __reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	struct page_ext *page_ext;
> +	int i;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +			break;
> +
> +		clear_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +		page_ext = page_ext_next(page_ext);
> +	}
> +}
> +
> +void __report_page_pinners(struct page *page, int reason, int err)
> +{
> +	struct page_ext *page_ext;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	test_and_set_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +	trace_report_page_pinners(page, migrate_reason_names[reason], err);
> +}
> +
> +void __page_pin_owner_put(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +		return;
> +
> +	trace_page_pin_owner_put(page);
> +}
> +EXPORT_SYMBOL(__page_pin_owner_put);
> +
> +static int __init page_pin_owner_init(void)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited)) {
> +		pr_info("page_pin_owner is disabled\n");
> +		return 0;
> +	}
> +
> +	pr_info("page_pin_owner is enabled\n");
> +	return 0;
> +}
> +late_initcall(page_pin_owner_init)
> -- 
> 2.34.1.448.ga2b2bfdf31-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ