lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b79a4cc4-34d4-a9ca-4d17-367be7d7cc1d@gmail.com>
Date:   Wed, 1 Aug 2018 11:37:05 +1000
From:   Rashmica <rashmica.g@...il.com>
To:     John Allen <jallen@...ux.ibm.com>, linux-kernel@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org
Cc:     mhocko@...e.cz, n-horiguchi@...jp.nec.com,
        kamezawa.hiroyu@...fujitsu.com, mgorman@...e.de
Subject: Re: Infinite looping observed in __offline_pages



On 26/07/18 04:11, John Allen wrote:
> Hi All,
>
> Under heavy stress and constant memory hot add/remove, I have observed
> the following loop to occasionally loop infinitely:
>
> mm/memory_hotplug.c:__offline_pages
>
> repeat:
>        /* start memory hot removal */
>        ret = -EINTR;
>        if (signal_pending(current))
>                goto failed_removal;
>
>        cond_resched();
>        lru_add_drain_all();
>        drain_all_pages(zone);
>
>        pfn = scan_movable_pages(start_pfn, end_pfn);
>        if (pfn) { /* We have movable pages */
>                ret = do_migrate_range(pfn, end_pfn);
>                goto repeat;
>        }
>

What is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE set to for you?

I have also observed this when hot removing and adding memory. However I
only have only seen this when my kernel has
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n (when it is set to online
automatically I do not have this issue) so I assumed that I wasn't
onlining the memory properly...

> What appears to be happening in this case is that do_migrate_range
> returns a failure code which is being ignored. The failure is stemming
> from migrate_pages returning "1" which I'm guessing is the result of
> us hitting the following case:
>
> mm/migrate.c: migrate_pages
>
>     default:
>         /*
>          * Permanent failure (-EBUSY, -ENOSYS, etc.):
>          * unlike -EAGAIN case, the failed page is
>          * removed from migration page list and not
>          * retried in the next outer loop.
>          */
>         nr_failed++;
>         break;
>     }
>
> Does a failure in do_migrate_range indicate that the range is
> unmigratable and the loop in __offline_pages should terminate and goto
> failed_removal? Or should we allow a certain number of retrys before we
> give up on migrating the range?
>
> This issue was observed on a ppc64le lpar on a 4.18-rc6 kernel.
>
> -John
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ