linux-kernel - Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <564C9A86.1090906@suse.cz>
Date:	Wed, 18 Nov 2015 16:34:30 +0100
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Joonsoo Kim <js1304@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Michal Nazarewicz <mina86@...a86.com>,
	Minchan Kim <minchan@...nel.org>, Mel Gorman <mgorman@...e.de>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	linux-api@...r.kernel.org, Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page
 reference manipulation

On 11/09/2015 08:23 AM, Joonsoo Kim wrote:
> CMA allocation should be guaranteed to succeed by definition, but,
> unfortunately, it would be failed sometimes. It is hard to track down
> the problem, because it is related to page reference manipulation and
> we don't have any facility to analyze it.

Reminds me of the PeterZ's VM_PINNED patchset. What happened to it?
https://lwn.net/Articles/600502/

> This patch adds tracepoints to track down page reference manipulation.
> With it, we can find exact reason of failure and can fix the problem.
> Following is an example of tracepoint output.
> 
> <...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
> <...>-9018  [004]    92.678378: kernel_stack:
>  => get_page_from_freelist (ffffffff81176659)
>  => __alloc_pages_nodemask (ffffffff81176d22)
>  => alloc_pages_vma (ffffffff811bf675)
>  => handle_mm_fault (ffffffff8119e693)
>  => __do_page_fault (ffffffff810631ea)
>  => trace_do_page_fault (ffffffff81063543)
>  => do_async_page_fault (ffffffff8105c40a)
>  => async_page_fault (ffffffff817581d8)
> [snip]
> <...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
> [snip]
> ...
> ...
> <...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
> [snip]
> <...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
>  => release_pages (ffffffff8117c9e4)
>  => free_pages_and_swap_cache (ffffffff811b0697)
>  => tlb_flush_mmu_free (ffffffff81199616)
>  => tlb_finish_mmu (ffffffff8119a62c)
>  => exit_mmap (ffffffff811a53f7)
>  => mmput (ffffffff81073f47)
>  => do_exit (ffffffff810794e9)
>  => do_group_exit (ffffffff81079def)
>  => SyS_exit_group (ffffffff81079e74)
>  => entry_SYSCALL_64_fastpath (ffffffff817560b6)
> 
> This output shows that problem comes from exit path. In exit path,
> to improve performance, pages are not freed immediately. They are gathered
> and processed by batch. During this process, migration cannot be possible
> and CMA allocation is failed. This problem is hard to find without this
> page reference tracepoint facility.

Yeah but when you realized it was this problem, what was the fix? Probably not
remove batching from exit path? Shouldn't CMA in this case just try waiting for
the pins to go away, which would eventually happen? And for long-term pins,
VM_PINNED would make sure the pages are migrated away from CMA pageblocks first?

So I'm worried that this is quite nontrivial change for a very specific usecase.

> Enabling this feature bloat kernel text 20 KB in my configuration.

It's not just that, see below.

[...]


>  static inline int page_ref_freeze(struct page *page, int count)
>  {
> -	return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> +	int ret = likely(atomic_cmpxchg(&page->_count, count, 0) == count);

The "likely" mean makes no sense anymore, doe it?

> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 957d3da..71d2399 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -28,3 +28,7 @@ config DEBUG_PAGEALLOC
>  
>  config PAGE_POISONING
>  	bool
> +
> +config DEBUG_PAGE_REF
> +	bool "Enable tracepoint to track down page reference manipulation"

So you should probably state the costs. Which is the extra memory, and also that
all the page ref manipulations are now turned to function calls, even if the
tracepoints are disabled. Patch 1 didn't change that many callsites, so maybe it
would be feasible to have the tracepoints inline, where being disabled has
near-zero overhead?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/