netdev - Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180201225542.GA13072@castle>
Date:   Thu, 1 Feb 2018 22:55:50 +0000
From:   Roman Gushchin <guro@...com>
To:     Eric Dumazet <edumazet@...gle.com>
CC:     "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...com>,
        Johannes Weiner <hannes@...xchg.org>, Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH net] net: memcontrol: charge allocated memory after
 mem_cgroup_sk_alloc()

On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote:
> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@...com> wrote:
> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
> >> From: Roman Gushchin <guro@...com>
> >> Date: Wed, 31 Jan 2018 21:54:08 +0000
> >>
> >> > So I really start thinking that reverting 9f1c2674b328
> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
> >> > and fixing the original issue differently might be easier
> >> > and a proper way to go. Does it makes sense?
> >>
> >> You'll need to work that out with Eric Dumazet who added the
> >> change in question which you think we should revert.
> >
> > Eric,
> >
> > can you, please, provide some details about the use-after-free problem
> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
> >
> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
> > and makes it much more fragile in general. So, I wonder, if there are
> > solutions for the use-after-free problem.
> >
> > Thank you!
> >
> > Roman
> 
> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
> following this thread )
> 
> Our kernel has a debug feature on percpu_ref_get_many() which detects
> the typical use-after-free problem of
> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
> memory already freed.
> 
> Bug was serious because css_put() will release the css a second time.
> 
> Stack trace looked like :
> 
> Oct  8 00:23:14 lphh23 kernel: [27239.568098]  <IRQ>
> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
> Oct  8 00:23:14 lphh23 kernel: [27239.568108]  [<ffffffff906df6e3>] ?
> cgroup_get+0x43/0x50
> Oct  8 00:23:14 lphh23 kernel: [27239.568114]  [<ffffffff906f2f35>]
> warn_slowpath_common+0xac/0xc8
> Oct  8 00:23:14 lphh23 kernel: [27239.568117]  [<ffffffff906f2f6b>]
> warn_slowpath_null+0x1a/0x1c
> Oct  8 00:23:14 lphh23 kernel: [27239.568120]  [<ffffffff906df6e3>]
> cgroup_get+0x43/0x50
> Oct  8 00:23:14 lphh23 kernel: [27239.568123]  [<ffffffff906e07a4>]
> cgroup_sk_alloc+0x64/0x90

Hm, that looks strange... It's cgroup_sk_alloc(),
not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328.

I thought, that it's css_get() in mem_cgroup_sk_alloc(), which
you removed, but the stacktrace you've posted is different.

void mem_cgroup_sk_alloc(struct sock *sk) {
	/*
	 * Socket cloning can throw us here with sk_memcg already
	 * filled. It won't however, necessarily happen from
	 * process context. So the test for root memcg given
	 * the current task's memcg won't help us in this case.
	 *
	 * Respecting the original socket's memcg is a better
	 * decision in this case.
	 */
	if (sk->sk_memcg) {
		BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
>>>		css_get(&sk->sk_memcg->css);
		return;
	}

Is it possible to reproduce the issue on an upstream kernel?
Any ideas of what can trigger it?

Btw, with the following patch applied (below) and cgroup v2 enabled,
the issue, which I'm talking about, can be reproduced in seconds after reboot
by doing almost any network activity. Just sshing to a machine is enough.
The corresponding warning will be printed to dmesg.

What is a proper way to fix the socket memory accounting in this case,
what do you think?

Thank you!

Roman

--

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c51c589..55fb890 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -276,6 +276,8 @@ struct mem_cgroup {
 	struct list_head event_list;
 	spinlock_t event_list_lock;
 
+	atomic_t tcpcnt;
+
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19eea69..c69ff04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	seq_printf(m, "workingset_nodereclaim %lu\n",
 		   stat[WORKINGSET_NODERECLAIM]);
 
+	seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt));
+
 	return 0;
 }
 
@@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	gfp_t gfp_mask = GFP_KERNEL;
 
+	atomic_add(nr_pages, &memcg->tcpcnt);
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		struct page_counter *fail;
 
@@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
+	int v = atomic_sub_return(nr_pages, &memcg->tcpcnt);
+	if (v < 0) {
+		pr_info("@@@ %p %d \n", memcg, v);
+	}
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		page_counter_uncharge(&memcg->tcpmem, nr_pages);
 		return;