linux-kernel - Re: RIP: mem_cgroup_move

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131029142616.GA1548@cmpxchg.org>
Date:	Tue, 29 Oct 2013 10:26:16 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Greg Thelen <gthelen@...gle.com>
Cc:	含黛 <handai.szj@...baba-inc.com>,
	akpm@...ux-foundation.org, fbl@...hat.com,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Sat, Oct 26, 2013 at 08:41:11AM -0700, Greg Thelen wrote:
> On Sat, Oct 26 2013, 含黛 wrote:
> 
> > On 10/26/2013 11:39 AM, Johannes Weiner wrote:
> >> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
> >>> While playing with guests and net-next kernel, I've triggered
> >>> this with some frequency.  Even Fedora 19 kernel reproduces.
> >>>
> >>> It it a known issue?
> >>>
> >>> Thanks,
> >>> fbl
> >>>
> >>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
> >>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
> >>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.056281] device vnet0 left promiscuous mode
> >>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.178574] PGD 0
> >>> [ 7691.181042] Oops: 0000 [#1] SMP
> >>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth
> >>> openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast
> >>> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp
> >>> llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> >>> nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security
> >>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> >>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> >>> iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek
> >>> snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd
> >>> snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer
> >>> crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core
> >>> iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core
> >>> mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi
> >>> video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi
> >>> [last unloaded: openvswitch]
> >>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G          I  3.12.0-rc6-01188-gb45bd46 #1
> >>> [ 7691.295779] Hardware name:                  /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >>> [ 7691.306066] Workqueue: events css_killed_work_fn
> >>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>]  [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.329728] RSP: 0018:ffff88042957bcc8  EFLAGS: 00010002
> >>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >>> [ 7691.375738] FS:  0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >>> [ 7691.384755] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >>> [ 7691.399235] Stack:
> >>> [ 7691.401672]  ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >>> [ 7691.409968]  ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >>> [ 7691.418264]  ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >>> [ 7691.426578] Call Trace:
> >>> [ 7691.429513]  [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >>> [ 7691.435250]  [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >>> [ 7691.442874]  [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >>> [ 7691.449942]  [<ffffffff810da877>] offline_css+0x27/0x50
> >>> [ 7691.455874]  [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >>> [ 7691.462466]  [<ffffffff810821f5>] process_one_work+0x175/0x430
> >>> [ 7691.469041]  [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >>> [ 7691.475345]  [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >>> [ 7691.481919]  [<ffffffff81089860>] kthread+0xc0/0xd0
> >>> [ 7691.487478]  [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.494352]  [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >>> [ 7691.500464]  [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00
> >>> 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c>
> >>> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
> >> This is
> >>
> >> All code
> >> ========
> >>     0:   85 f6                   test   %esi,%esi
> >>     2:   48 8b 55 d0             mov    -0x30(%rbp),%rdx
> >>     6:   44 8b 4d c8             mov    -0x38(%rbp),%r9d
> >>     a:   4c 8b 45 c0             mov    -0x40(%rbp),%r8
> >>     e:   0f 85 b3 00 00 00       jne    0xc7
> >>    14:   41 8b 4c 24 18          mov    0x18(%r12),%ecx
> >>    19:   85 c9                   test   %ecx,%ecx
> >>    1b:   0f 88 a6 00 00 00       js     0xc7
> >>    21:   48 8b b2 30 02 00 00    mov    0x230(%rdx),%rsi
> >>    28:   45 89 ca                mov    %r9d,%r10d
> >>    2b:*  4c 39 56 18             cmp    %r10,0x18(%rsi)          <-- trapping instruction
> >>    2f:   0f 8c 36 01 00 00       jl     0x16b
> >>    35:   44 89 c9                mov    %r9d,%ecx
> >>    38:   f7 d9                   neg    %ecx
> >>    3a:   89 cf                   mov    %ecx,%edi
> >>    3c:   65                      gs
> >>    3d:   48                      rex.W
> >>    3e:   01                      .byte 0x1
> >>    3f:   7e                      .byte 0x7e
> >>
> >> which corresponds to
> >>
> >> 	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >>
> >> Humm.  from->stat is a percpu pointer...  This patch should fix it:
> >>
> >> ---
> >>  From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
> >> From: Johannes Weiner <hannes@...xchg.org>
> >> Date: Fri, 25 Oct 2013 23:23:31 -0400
> >> Subject: [patch] mm: memcg: fix percpu variable access crash
> >>
> >> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
> >> added a WARN_ON_ONCE() to sanity check the page statistics counter
> >> when moving charges.  Unfortunately, it dereferences the percpu
> >> counter directly, which may result in a crash like this:
> >>
> >> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.178574] PGD 0
> >> [ 7691.181042] Oops: 0000 [#1] SMP
> >> [...]
> >> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G          I  3.12.0-rc6-01188-gb45bd46 #1
> >> [ 7691.295779] Hardware name:                  /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >> [ 7691.306066] Workqueue: events css_killed_work_fn
> >> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>]  [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.329728] RSP: 0018:ffff88042957bcc8  EFLAGS: 00010002
> >> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >> [ 7691.375738] FS:  0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >> [ 7691.384755] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >> [ 7691.399235] Stack:
> >> [ 7691.401672]  ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >> [ 7691.409968]  ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >> [ 7691.418264]  ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >> [ 7691.426578] Call Trace:
> >> [ 7691.429513]  [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >> [ 7691.435250]  [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >> [ 7691.442874]  [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >> [ 7691.449942]  [<ffffffff810da877>] offline_css+0x27/0x50
> >> [ 7691.455874]  [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >> [ 7691.462466]  [<ffffffff810821f5>] process_one_work+0x175/0x430
> >> [ 7691.469041]  [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >> [ 7691.475345]  [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >> [ 7691.481919]  [<ffffffff81089860>] kthread+0xc0/0xd0
> >> [ 7691.487478]  [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.494352]  [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >> [ 7691.500464]  [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
> >> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
> >> f7 d9 89 cf 65 48 01 7e
> >> [ 7691.528638] RIP  [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>
> >> Add the required __this_cpu_read().
> >
> > Sorry for my mistake and thanks for the fix up, it looks good to me.
> >
> > Reviewed-by: Sha Zhengju <handai.szj@...bao.com>
> >
> >
> > Thanks,
> > Sha
> >>
> >> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> >> ---
> >>   mm/memcontrol.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 4097a78..a4864b6 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> >>   {
> >>   	/* Update stat data for mem_cgroup */
> >>   	preempt_disable();
> >> -	WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >> +	WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
> >>   	__this_cpu_add(from->stat->count[idx], -nr_pages);
> >>   	__this_cpu_add(to->stat->count[idx], nr_pages);
> >>   	preempt_enable();
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@...ck.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
> 
> I was just polishing up this following patch which I think is better
> because it should avoid spurious warnings.
> 
> ---8<---
> 
> >From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
> From: Greg Thelen <gthelen@...gle.com>
> Date: Fri, 25 Oct 2013 21:59:57 -0700
> Subject: [PATCH] memcg: remove incorrect underflow check
> 
> When a memcg is deleted mem_cgroup_reparent_charges() moves charged
> memory to the parent memcg.  As of v3.11-9444-g3ea67d0 "memcg: add per
> cgroup writeback pages accounting" there's bad pointer read.  The goal
> was to check for counter underflow.  The counter is a per cpu counter
> and there are two problems with the code:
> (1) per cpu access function isn't used, instead a naked pointer is
>     used which easily causes panic.
> (2) the check doesn't sum all cpus
> 
> Test:
>   $ cd /sys/fs/cgroup/memory
>   $ mkdir x
>   $ echo 3 > /proc/sys/vm/drop_caches
>   $ (echo $BASHPID >> x/tasks && exec cat) &
>   [1] 7154
>   $ grep ^mapped x/memory.stat
>   mapped_file 53248
>   $ echo 7154 > tasks
>   $ rmdir x
>   <PANIC>
> 
> The fix is to remove the check.  It's currently dangerous and isn't
> worth fixing it to use something expensive, such as
> percpu_counter_sum(), for each reparented page.  __this_cpu_read()
> isn't enough to fix this because there's no guarantees of the current
> cpus count.  The only guarantees is that the sum of all per-cpu
> counter is >= nr_pages.
> 
> Signed-off-by: Greg Thelen <gthelen@...gle.com>

Makes sense to me.  Thanks, Greg!

Acked-by: Johannes Weiner <hannes@...xchg.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/