[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b79a4cc4-34d4-a9ca-4d17-367be7d7cc1d@gmail.com>
Date: Wed, 1 Aug 2018 11:37:05 +1000
From: Rashmica <rashmica.g@...il.com>
To: John Allen <jallen@...ux.ibm.com>, linux-kernel@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org
Cc: mhocko@...e.cz, n-horiguchi@...jp.nec.com,
kamezawa.hiroyu@...fujitsu.com, mgorman@...e.de
Subject: Re: Infinite looping observed in __offline_pages
On 26/07/18 04:11, John Allen wrote:
> Hi All,
>
> Under heavy stress and constant memory hot add/remove, I have observed
> the following loop to occasionally loop infinitely:
>
> mm/memory_hotplug.c:__offline_pages
>
> repeat:
> /* start memory hot removal */
> ret = -EINTR;
> if (signal_pending(current))
> goto failed_removal;
>
> cond_resched();
> lru_add_drain_all();
> drain_all_pages(zone);
>
> pfn = scan_movable_pages(start_pfn, end_pfn);
> if (pfn) { /* We have movable pages */
> ret = do_migrate_range(pfn, end_pfn);
> goto repeat;
> }
>
What is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE set to for you?
I have also observed this when hot removing and adding memory. However I
only have only seen this when my kernel has
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n (when it is set to online
automatically I do not have this issue) so I assumed that I wasn't
onlining the memory properly...
> What appears to be happening in this case is that do_migrate_range
> returns a failure code which is being ignored. The failure is stemming
> from migrate_pages returning "1" which I'm guessing is the result of
> us hitting the following case:
>
> mm/migrate.c: migrate_pages
>
> default:
> /*
> * Permanent failure (-EBUSY, -ENOSYS, etc.):
> * unlike -EAGAIN case, the failed page is
> * removed from migration page list and not
> * retried in the next outer loop.
> */
> nr_failed++;
> break;
> }
>
> Does a failure in do_migrate_range indicate that the range is
> unmigratable and the loop in __offline_pages should terminate and goto
> failed_removal? Or should we allow a certain number of retrys before we
> give up on migrating the range?
>
> This issue was observed on a ppc64le lpar on a 4.18-rc6 kernel.
>
> -John
>
Powered by blists - more mailing lists