linux-kernel - Re: Regression on ARMs in next-20170531

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170604113257.GA1095@cmpxchg.org>
Date:   Sun, 4 Jun 2017 07:32:57 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Russell King - ARM Linux <linux@...linux.org.uk>
Cc:     Tony Lindgren <tony@...mide.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Josef Bacik <josef@...icpanda.com>,
        Michal Hocko <mhocko@...e.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Rik van Riel <riel@...hat.com>,
        Mark Brown <broonie@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: Regression on ARMs in next-20170531

On Wed, May 31, 2017 at 06:43:33PM +0100, Russell King - ARM Linux wrote:
> On Wed, May 31, 2017 at 09:45:45AM -0700, Tony Lindgren wrote:
> > Mark Brown noticed that the so far the only booting
> > ARMs are all with CONFIG_SMP disabled and I just
> > confirmed that's the case.
> 
> > 8< --------------------
> > Unable to handle kernel paging request at virtual address 2e116007
> > pgd = c0004000
> > [2e116007] *pgd=00000000
> > Internal error: Oops: 5 [#1] SMP ARM
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-00153-gb6bc6724488a #200
> > Hardware name: Generic DRA74X (Flattened Device Tree)
> > task: c0d0adc0 task.stack: c0d00000
> > PC is at __mod_node_page_state+0x2c/0xc8
> > LR is at __per_cpu_offset+0x0/0x8
> > pc : [<c0271de8>]    lr : [<c0d07da4>]    psr: 600000d3
> > sp : c0d01eec  ip : 00000000  fp : c15782f4
> > r10: 00000000  r9 : c1591280  r8 : 00004000
> > r7 : 00000001  r6 : 00000006  r5 : 2e116000  r4 : 00000007
> > r3 : 00000007  r2 : 00000001  r1 : 00000006  r0 : c0dc27c0
> > Flags: nZCv  IRQs off  FIQs off  Mode SVC_32  ISA ARM  Segment none
> ...
> > Code: e79e5103 e28c3001 e0833001 e1a04003 (e19440d5)
> 
> This disassembles to:
> 
>    0:   e79e5103        ldr     r5, [lr, r3, lsl #2]
>    4:   e28c3001        add     r3, ip, #1
>    8:   e0833001        add     r3, r3, r1
>    c:   e1a04003        mov     r4, r3
>   10:   e19440d5        ldrsb   r4, [r4, r5]
> 
> I don't have a similarly configured kernel, but here I have for the
> start of this function:
> 
> 00000680 <__mod_node_page_state>:
>      680:       e1a0c00d        mov     ip, sp
>      684:       e92dd870        push    {r4, r5, r6, fp, ip, lr, pc}
>      688:       e24cb004        sub     fp, ip, #4
>      68c:       e590cc00        ldr     ip, [r0, #3072] ; 0xc00
>      690:       e1a0400d        mov     r4, sp
>      694:       ee1d6f90        mrc     15, 0, r6, cr13, cr0, {4}
>      698:       e08c5001        add     r5, ip, r1
>      69c:       e2855001        add     r5, r5, #1
>      6a0:       e1a03005        mov     r3, r5
>      6a4:       e196c0dc        ldrsb   ip, [r6, ip]
>      6a8:       e19630d3        ldrsb   r3, [r6, r3]
> 
> r5 in your code is the equivalent of r6, r4 => r3, r3 -> r5.
> lr is the __per_cpu_offset array, so the first instruction is
> trying to load the percpu offset.
> 
> The faulting code is:
> 
>         x = delta + __this_cpu_read(*p);
> 
> specifically "__this_cpu_read(*p)".
> 
> "ip" holds "pcp" from:
> 
>         struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
> 
> and you may notice that it's zero in the register dump.  So,
> pgdat->per_cpu_nodestats is NULL here.
> 
> This seems to be setup in setup_per_cpu_pageset(), which in the init
> order, happens way after mm_init() (which contains kmem_cache_init()).

Thanks for the analysis, Russell.

I think it's NULL because the slab allocation happens before even the
root_mem_cgroup is set up, and so root_mem_cgroup -> lruvec -> pgdat
gives us garbage.

Tony, Josef, since the patches are dropped from -next, could you test
the -mm tree at git://git.cmpxchg.org/linux-mmots.git and verify that
this patch below fixes the issue?

---

>From 47007dfcd7873cb93d11466a93b1f41f6a7a434f Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@...xchg.org>
Date: Sun, 4 Jun 2017 07:02:44 -0400
Subject: [PATCH] mm: memcontrol: per-lruvec stats infrastructure fix 2

Even with the previous fix routing !page->mem_cgroup stats to the root
cgroup, we still see crashes in certain configurations as the root is
not initialized for the earliest possible accounting sites in certain
configurations.

Don't track uncharged pages at all, not even in the root. This takes
care of early accounting as well as special pages that aren't tracked.

Because we still need to account at the pgdat level, we can no longer
implement the lruvec_page_state functions on top of the lruvec_state
ones. But that's okay, it was a little silly to look up the nodeinfo
and descend to the lruvec, only to container_of() back to the nodeinfo
where the lruvec_stat structure is sitting.

Signed-off-by: Johannes Weiner <hannes@...xchg.org>
---
 include/linux/memcontrol.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bea6f08e9e16..da9360885260 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -585,27 +585,27 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
 static inline void __mod_lruvec_page_state(struct page *page,
 					   enum node_stat_item idx, int val)
 {
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	/* Special pages in the VM aren't charged, use root */
-	memcg = page->mem_cgroup ? : root_mem_cgroup;
+	struct mem_cgroup_per_node *pn;
 
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-	__mod_lruvec_state(lruvec, idx, val);
+	__mod_node_page_state(page_pgdat(page), idx, val);
+	if (mem_cgroup_disabled() || !page->mem_cgroup)
+		return;
+	__mod_memcg_state(page->mem_cgroup, idx, val);
+	pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
+	__this_cpu_add(pn->lruvec_stat->count[idx], val);
 }
 
 static inline void mod_lruvec_page_state(struct page *page,
 					 enum node_stat_item idx, int val)
 {
-	struct mem_cgroup *memcg;
-	struct lruvec *lruvec;
-
-	/* Special pages in the VM aren't charged, use root */
-	memcg = page->mem_cgroup ? : root_mem_cgroup;
+	struct mem_cgroup_per_node *pn;
 
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
-	mod_lruvec_state(lruvec, idx, val);
+	mod_node_page_state(page_pgdat(page), idx, val);
+	if (mem_cgroup_disabled() || !page->mem_cgroup)
+		return;
+	mod_memcg_state(page->mem_cgroup, idx, val);
+	pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
+	this_cpu_add(pn->lruvec_stat->count[idx], val);
 }
 
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-- 
2.13.0