[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21d138a5-13e4-9e83-d7fe-e0639a8d180a@collabora.com>
Date: Thu, 7 Mar 2019 09:16:20 +0000
From: Guillaume Tucker <guillaume.tucker@...labora.com>
To: Mike Rapoport <rppt@...ux.ibm.com>
Cc: Dan Williams <dan.j.williams@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...e.com>,
Mark Brown <broonie@...nel.org>,
Tomeu Vizoso <tomeu.vizoso@...labora.com>,
Matt Hart <matthew.hart@...aro.org>,
Stephen Rothwell <sfr@...b.auug.org.au>, khilman@...libre.com,
enric.balletbo@...labora.com, Nicholas Piggin <npiggin@...il.com>,
Dominik Brodowski <linux@...inikbrodowski.net>,
Masahiro Yamada <yamada.masahiro@...ionext.com>,
Kees Cook <keescook@...omium.org>,
Adrian Reber <adrian@...as.de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Johannes Weiner <hannes@...xchg.org>,
Linux MM <linux-mm@...ck.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Richard Guy Briggs <rgb@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>, info@...nelci.org
Subject: Re: next/master boot bisection: next-20190215 on beaglebone-black
On 06/03/2019 14:05, Mike Rapoport wrote:
> On Wed, Mar 06, 2019 at 10:14:47AM +0000, Guillaume Tucker wrote:
>> On 01/03/2019 23:23, Dan Williams wrote:
>>> On Fri, Mar 1, 2019 at 1:05 PM Guillaume Tucker
>>> <guillaume.tucker@...labora.com> wrote:
>>>
>>> Is there an early-printk facility that can be turned on to see how far
>>> we get in the boot?
>>
>> Yes, I've done that now by enabling CONFIG_DEBUG_AM33XXUART1 and
>> earlyprintk in the command line. Here's the result, with the
>> commit cherry picked on top of next-20190304:
>>
>> https://lava.collabora.co.uk/scheduler/job/1526326
>>
>> [ 1.379522] ti-sysc 4804a000.target-module: sysc_flags 00000222 != 00000022
>> [ 1.396718] Unable to handle kernel paging request at virtual address 77bb4003
>> [ 1.404203] pgd = (ptrval)
>> [ 1.406971] [77bb4003] *pgd=00000000
>> [ 1.410650] Internal error: Oops: 5 [#1] ARM
>> [...]
>> [ 1.672310] [<c07051a0>] (clk_hw_create_clk.part.21) from [<c06fea34>] (devm_clk_get+0x4c/0x80)
>> [ 1.681232] [<c06fea34>] (devm_clk_get) from [<c064253c>] (sysc_probe+0x28c/0xde4)
>>
>> It's always failing at that point in the code. Also when
>> enabling "debug" on the kernel command line, the issue goes
>> away (exact same binaries etc..):
>>
>> https://lava.collabora.co.uk/scheduler/job/1526327
>>
>> For the record, here's the branch I've been using:
>>
>> https://gitlab.collabora.com/gtucker/linux/tree/beaglebone-black-next-20190304-debug
>>
>> The board otherwise boots fine with next-20190304 (SMP=n), and
>> also with the patch applied but the shuffle configs set to n.
>>
>>> Were there any boot *successes* on ARM with shuffling enabled? I.e.
>>> clues about what's different about the specific memory setup for
>>> beagle-bone-black.
>>
>> Looking at the KernelCI results from next-20190215, it looks like
>> only the BeagleBone Black with SMP=n failed to boot:
>>
>> https://kernelci.org/boot/all/job/next/branch/master/kernel/next-20190215/
>>
>> Of course that's not all the ARM boards that exist out there, but
>> it's a fairly large coverage already.
>>
>> As the kernel panic always seems to originate in ti-sysc.c,
>> there's a chance it's only visible on that platform... I'm doing
>> a KernelCI run now with my test branch to double check that,
>> it'll take a few hours so I'll send an update later if I get
>> anything useful out of it.
Here's the result, there were a couple of failures but some were
due to infrastructure errors (nyan-big) and I'm not sure about
what was the problem with the meson boards:
https://staging.kernelci.org/boot/all/job/gtucker/branch/kernelci-local/kernel/next-20190304-1-g4f0b547b03da/
So there's no clear indicator that the shuffle config is causing
any issue on any other platform than the BeagleBone Black.
>> In the meantime, I'm happy to try out other things with more
>> debug configs turned on or any potential fixes someone might
>> have.
>
> ARM is the only arch that sets ARCH_HAS_HOLES_MEMORYMODEL to 'y'. Maybe the
> failure has something to do with it...
>
> Guillaume, can you try this patch:
Sure, it doesn't seem to be fixing the problem though:
https://lava.collabora.co.uk/scheduler/job/1527471
I've added the patch to the same branch based on next-20190304.
I guess this needs to be debugged a little further to see what
the panic really is about. I'll see if I can spend a bit more
time on it this week, unless there's any BeagleBone expert
available to help or if someone has another fix to try out.
Guillaume
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 3ce1248..4a04aac 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -58,7 +58,8 @@ module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400);
> * For two pages to be swapped in the shuffle, they must be free (on a
> * 'free_area' lru), have the same order, and have the same migratetype.
> */
> -static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> +static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order,
> + struct zone *z)
> {
> struct page *page;
>
> @@ -80,6 +81,9 @@ static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order)
> if (!PageBuddy(page))
> return NULL;
>
> + if (!memmap_valid_within(pfn, page, z))
> + return NULL;
> +
> /*
> * ...is the page on the same list as the page we will
> * shuffle it with?
> @@ -123,7 +127,7 @@ void __meminit __shuffle_zone(struct zone *z)
> * page_j randomly selected in the span @zone_start_pfn to
> * @spanned_pages.
> */
> - page_i = shuffle_valid_page(i, order);
> + page_i = shuffle_valid_page(i, order, z);
> if (!page_i)
> continue;
>
> @@ -137,7 +141,7 @@ void __meminit __shuffle_zone(struct zone *z)
> j = z->zone_start_pfn +
> ALIGN_DOWN(get_random_long() % z->spanned_pages,
> order_pages);
> - page_j = shuffle_valid_page(j, order);
> + page_j = shuffle_valid_page(j, order, z);
> if (page_j && page_j != page_i)
> break;
> }
>
>
Powered by blists - more mailing lists