netdev - Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+mTRsYDikt9VhfbdHo1f5F5tXbZj54679nY_Fy1QS1Vg@mail.gmail.com>
Date:   Thu, 1 Feb 2018 15:27:14 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Roman Gushchin <guro@...com>
Cc:     "David S. Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...com>,
        Johannes Weiner <hannes@...xchg.org>, Tejun Heo <tj@...nel.org>
Subject: Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc()

Well, this memcg stuff is so confusing.

My recollection is that we had :


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6

And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well

Honestly bug was closed months ago for us, based on stack traces on the wild.

No C repro or whatever, but reproducing it would be a matter of
having a TCP listener constantly doing a
socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop,
while connections are attempted to the listening port.




On Thu, Feb 1, 2018 at 2:55 PM, Roman Gushchin <guro@...com> wrote:
> On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote:
>> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@...com> wrote:
>> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
>> >> From: Roman Gushchin <guro@...com>
>> >> Date: Wed, 31 Jan 2018 21:54:08 +0000
>> >>
>> >> > So I really start thinking that reverting 9f1c2674b328
>> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
>> >> > and fixing the original issue differently might be easier
>> >> > and a proper way to go. Does it makes sense?
>> >>
>> >> You'll need to work that out with Eric Dumazet who added the
>> >> change in question which you think we should revert.
>> >
>> > Eric,
>> >
>> > can you, please, provide some details about the use-after-free problem
>> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
>> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
>> >
>> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
>> > and makes it much more fragile in general. So, I wonder, if there are
>> > solutions for the use-after-free problem.
>> >
>> > Thank you!
>> >
>> > Roman
>>
>> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
>> following this thread )
>>
>> Our kernel has a debug feature on percpu_ref_get_many() which detects
>> the typical use-after-free problem of
>> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
>> memory already freed.
>>
>> Bug was serious because css_put() will release the css a second time.
>>
>> Stack trace looked like :
>>
>> Oct  8 00:23:14 lphh23 kernel: [27239.568098]  <IRQ>
>> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
>> Oct  8 00:23:14 lphh23 kernel: [27239.568108]  [<ffffffff906df6e3>] ?
>> cgroup_get+0x43/0x50
>> Oct  8 00:23:14 lphh23 kernel: [27239.568114]  [<ffffffff906f2f35>]
>> warn_slowpath_common+0xac/0xc8
>> Oct  8 00:23:14 lphh23 kernel: [27239.568117]  [<ffffffff906f2f6b>]
>> warn_slowpath_null+0x1a/0x1c
>> Oct  8 00:23:14 lphh23 kernel: [27239.568120]  [<ffffffff906df6e3>]
>> cgroup_get+0x43/0x50
>> Oct  8 00:23:14 lphh23 kernel: [27239.568123]  [<ffffffff906e07a4>]
>> cgroup_sk_alloc+0x64/0x90
>
> Hm, that looks strange... It's cgroup_sk_alloc(),
> not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328.
>
> I thought, that it's css_get() in mem_cgroup_sk_alloc(), which
> you removed, but the stacktrace you've posted is different.
>
> void mem_cgroup_sk_alloc(struct sock *sk) {
>         /*
>          * Socket cloning can throw us here with sk_memcg already
>          * filled. It won't however, necessarily happen from
>          * process context. So the test for root memcg given
>          * the current task's memcg won't help us in this case.
>          *
>          * Respecting the original socket's memcg is a better
>          * decision in this case.
>          */
>         if (sk->sk_memcg) {
>                 BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
>>>>             css_get(&sk->sk_memcg->css);
>                 return;
>         }
>
> Is it possible to reproduce the issue on an upstream kernel?
> Any ideas of what can trigger it?
>
> Btw, with the following patch applied (below) and cgroup v2 enabled,
> the issue, which I'm talking about, can be reproduced in seconds after reboot
> by doing almost any network activity. Just sshing to a machine is enough.
> The corresponding warning will be printed to dmesg.
>
> What is a proper way to fix the socket memory accounting in this case,
> what do you think?
>
> Thank you!
>
> Roman
>
> --
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c51c589..55fb890 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -276,6 +276,8 @@ struct mem_cgroup {
>         struct list_head event_list;
>         spinlock_t event_list_lock;
>
> +       atomic_t tcpcnt;
> +
>         struct mem_cgroup_per_node *nodeinfo[0];
>         /* WARNING: nodeinfo must be the last member here */
>  };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19eea69..c69ff04 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
>         seq_printf(m, "workingset_nodereclaim %lu\n",
>                    stat[WORKINGSET_NODERECLAIM]);
>
> +       seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt));
> +
>         return 0;
>  }
>
> @@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>         gfp_t gfp_mask = GFP_KERNEL;
>
> +       atomic_add(nr_pages, &memcg->tcpcnt);
> +
>         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
>                 struct page_counter *fail;
>
> @@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> +       int v = atomic_sub_return(nr_pages, &memcg->tcpcnt);
> +       if (v < 0) {
> +               pr_info("@@@ %p %d \n", memcg, v);
> +       }
> +
>         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
>                 page_counter_uncharge(&memcg->tcpmem, nr_pages);
>                 return;