linux-kernel - Re: [PATCH v2] mm: page_alloc: dump migrate-failed pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YEefLYiX6rF3Uk4E@google.com>
Date:   Tue, 9 Mar 2021 08:15:41 -0800
From:   Minchan Kim <minchan@...nel.org>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        linux-mm <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        John Dias <joaodias@...gle.com>,
        David Hildenbrand <david@...hat.com>,
        Jason Baron <jbaron@...mai.com>
Subject: Re: [PATCH v2] mm: page_alloc: dump migrate-failed pages

On Tue, Mar 09, 2021 at 10:32:51AM +0100, Michal Hocko wrote:
> On Mon 08-03-21 12:20:47, Minchan Kim wrote:
> > alloc_contig_range is usually used on cma area or movable zone.
> > It's critical if the page migration fails on those areas so
> > dump more debugging message.
> 
> I disagree with this statement. alloc_contig_range is not a reliable
> allocator. Any user, be it CMA or direct users of alloc_contig_range
> have to deal with allocation failures. Debugging information can be
> still useful but considering migration failures critical is
> overstatement to say the least.

Fair enough. Let's change it.

"Currently, debugging CMA allocation failure is too hard
due to lacking of page information. alloc_contig_range is
proper place to dump them since it has migrate-failed page
list."


> 
> > page refcount, mapcount with page flags on dump_page are
> > helpful information to deduce the culprit. Furthermore,
> > dump_page_owner was super helpful to find long term pinner
> > who initiated the page allocation.
> > 
> > Admin could enable the dump like this(by default, disabled)
> > 
> > 	echo "func dump_migrate_failure_pages +p" > control
> > 
> > Admin could disable it.
> > 
> > 	echo "func dump_migrate_failure_pages =_" > control
> 
> My original idea was to add few pr_debug and -DDYNAMIC_DEBUG_MODULE for
> page_alloc.c. It makes sense to enable a whole bunch at once though.
> The naming should better reflect this is alloc_contig_rage related
> because the above sounds like a generic migration failure thing.

alloc_contig_dump_pages?

> 
> Somebody more familiar with the dynamic debugging infrastructure needs
> to have a look but from from a quick look it seems ok.
> 
> Do we really need all the ugly ifdefery, though? Don't we want to have
> this compiled in all the time and just rely on the static branch managed
> by the dynamic debugging framework?

I have no further idea to make it simple while we keep the flexibility
for arguments and print format.

#if defined(CONFIG_DYNAMIC_DEBUG) || \
        (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
static void alloc_contig_dump_pages(struct list_head *page_list)
{
        static DEFINE_RATELIMIT_STATE(_rs,
                                        DEFAULT_RATELIMIT_INTERVAL,
                                        DEFAULT_RATELIMIT_BURST);

        DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
                        "migrate failure");
        if (DYNAMIC_DEBUG_BRANCH(descriptor) && __ratelimit(&_rs)) {
                struct page *page;

                WARN(1, "failed callstack");
                list_for_each_entry(page, page_list, lru)
                        dump_page(page, "migration failure");
        }
}
#else
static inline void alloc_contig_dump_pages(struct list_head *page_list)
{
}
#endif

>  
> [...]
> > +void dump_migrate_failure_pages(struct list_head *page_list)
> > +{
> > +	DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
> > +			"migrate failure");
> > +	if (DYNAMIC_DEBUG_BRANCH(descriptor) &&
> > +			alloc_contig_ratelimit()) {
> > +		struct page *page;
> > +
> > +		WARN(1, "failed callstack");
> > +		list_for_each_entry(page, page_list, lru)
> > +			dump_page(page, "migration failure");
> > +	}
> 
> Apart from the above, do we have to warn for something that is a
> debugging aid? A similar concern wrt dump_page which uses pr_warn and

Make sense.

> page owner is using even pr_alert.
> Would it make sense to add a loglevel parameter both into __dump_page
> and dump_page_owner?

Let me try it.