netdev - Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200708153327.GA193647@roeck-us.net>
Date:   Wed, 8 Jul 2020 08:33:27 -0700
From:   Guenter Roeck <linux@...ck-us.net>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     netdev@...r.kernel.org, Cameron Berkenpas <cam@...-zeon.de>,
        Peter Geis <pgwipeout@...il.com>,
        Lu Fengqi <lufq.fnst@...fujitsu.com>,
        Daniël Sonck <dsonck92@...il.com>,
        Zhang Qiang <qiang.zhang@...driver.com>,
        Thomas Lamprecht <t.lamprecht@...xmox.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Zefan Li <lizefan@...wei.com>, Tejun Heo <tj@...nel.org>,
        Roman Gushchin <guro@...com>
Subject: Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

Hi,

On Thu, Jul 02, 2020 at 11:52:56AM -0700, Cong Wang wrote:
> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> copied, so the cgroup refcnt must be taken too. And, unlike the
> sk_alloc() path, sock_update_netprioidx() is not called here.
> Therefore, it is safe and necessary to grab the cgroup refcnt
> even when cgroup_sk_alloc is disabled.
> 
> sk_clone_lock() is in BH context anyway, the in_interrupt()
> would terminate this function if called there. And for sk_alloc()
> skcd->val is always zero. So it's safe to factor out the code
> to make it more readable.
> 
> The global variable 'cgroup_sk_alloc_disabled' is used to determine
> whether to take these reference counts. It is impossible to make
> the reference counting correct unless we save this bit of information
> in skcd->val. So, add a new bit there to record whether the socket
> has already taken the reference counts. This obviously relies on
> kmalloc() to align cgroup pointers to at least 4 bytes,
> ARCH_KMALLOC_MINALIGN is certainly larger than that.
> 
> This bug seems to be introduced since the beginning, commit
> d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
> tried to fix it but not compeletely. It seems not easy to trigger until
> the recent commit 090e28b229af
> ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
> 

This patch causes all my s390 boot tests to crash. Reverting it fixes
the problem. Please see bisect results and and crash log below.

Guenter

---
bisect results (from pending-fixes branch) in -next repository):

# bad: [1432f824c2db44ef35b26caa9f81dd05211a75fc] Merge remote-tracking branch 'drm-misc-fixes/for-linux-next-fixes'
# good: [dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258] Linux 5.8-rc4
git bisect start 'HEAD' 'v5.8-rc4'
# bad: [fe12f8184e7265e2d24e5ed5b255275dfe4c1c04] Merge remote-tracking branch 'net/master'
git bisect bad fe12f8184e7265e2d24e5ed5b255275dfe4c1c04
# good: [474112d57c70520ebd81a5ca578fee1d93fafd07] Documentation: networking: ipvs-sysctl: drop doubled word
git bisect good 474112d57c70520ebd81a5ca578fee1d93fafd07
# good: [6d12075ddeedc38d25c5b74e929e686158da728c] Merge tag 'mtd/fixes-for-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
git bisect good 6d12075ddeedc38d25c5b74e929e686158da728c
# good: [74478ea4ded519db35cb1f059948b1e713bb4abf] net: ipa: fix QMI structure definition bugs
git bisect good 74478ea4ded519db35cb1f059948b1e713bb4abf
# bad: [9c29e36152748fd623fcff6cc8f538550f9eeafc] mptcp: fix DSS map generation on fin retransmission
git bisect bad 9c29e36152748fd623fcff6cc8f538550f9eeafc
# good: [aea23c323d89836bcdcee67e49def997ffca043b] ipv6: Fix use of anycast address with loopback
git bisect good aea23c323d89836bcdcee67e49def997ffca043b
# bad: [28b18e4eb515af7c6661c3995c6e3c34412c2874] net: sky2: initialize return of gm_phy_read
git bisect bad 28b18e4eb515af7c6661c3995c6e3c34412c2874
# bad: [ad0f75e5f57ccbceec13274e1e242f2b5a6397ed] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
git bisect bad ad0f75e5f57ccbceec13274e1e242f2b5a6397ed
# first bad commit: [ad0f75e5f57ccbceec13274e1e242f2b5a6397ed] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

---
Crash log:

[   22.390674] Run /sbin/init as init process
[   22.497551] Unable to handle kernel pointer dereference in virtual kernel address space
[   22.497738] Failing address: 5010f0b45fa93000 TEID: 5010f0b45fa93803
[   22.497813] Fault in home space mode while using kernel ASCE.
[   22.497958] AS:0000000001774007 R3:0000000000000024
[   22.498300] Oops: 0038 ilc:3 [#1] SMP
[   22.498405] Modules linked in:
[   22.499027] CPU: 0 PID: 153 Comm: init Not tainted 5.8.0-rc4-00328-g1432f824c2db4 #1
[   22.499112] Hardware name: QEMU 2964 QEMU (KVM/Linux)
[   22.499261] Krnl PSW : 0704e00180000000 0000000000259be0 (cgroup_sk_free+0xa8/0x1e8)
[   22.499405]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[   22.499506] Krnl GPRS: 0000000048a38585 5010f0b45fa93094 0000000000000002 000000001c228bd8
[   22.499585]            000000001c228bb0 0000000000000000 0000000000000000 00000000011c2eda
[   22.499665]            fffffffffffff000 00000000011c1f72 00000000011deef0 0000000000014040
[   22.499744]            000000001c228100 0000000000e76bf0 0000000000259c82 000003e0002c3c00
[   22.500270] Krnl Code: 0000000000259bd2: a72a0001		ahi	%r2,1
[   22.500270]            0000000000259bd6: 502003a8		st	%r2,936
[   22.500270]           #0000000000259bda: e31003b80008	ag	%r1,952
[   22.500270]           >0000000000259be0: e32010000004	lg	%r2,0(%r1)
[   22.500270]            0000000000259be6: a7f40004		brc	15,0000000000259bee
[   22.500270]            0000000000259bea: b9040023		lgr	%r2,%r3
[   22.500270]            0000000000259bee: b9040032		lgr	%r3,%r2
[   22.500270]            0000000000259bf2: b9040042		lgr	%r4,%r2
[   22.500635] Call Trace:
[   22.500748]  [<0000000000259be0>] cgroup_sk_free+0xa8/0x1e8
[   22.500835] ([<0000000000259bb4>] cgroup_sk_free+0x7c/0x1e8)
[   22.500914]  [<0000000000b24e16>] __sk_destruct+0x196/0x260
[   22.500999]  [<0000000000cadc18>] unix_release_sock+0x358/0x460
[   22.501073]  [<0000000000cadd5a>] unix_release+0x3a/0x60
[   22.501149]  [<0000000000b1a63a>] __sock_release+0x62/0xf8
[   22.501223]  [<0000000000b1a6f8>] sock_close+0x28/0x38
[   22.501299]  [<000000000045101e>] __fput+0x126/0x2a8
[   22.501374]  [<000000000017e088>] task_work_run+0x78/0xc8
[   22.501449]  [<000000000010a596>] do_notify_resume+0x9e/0xa8
[   22.501526]  [<0000000000de555a>] system_call+0xe6/0x2d4
[   22.501657] INFO: lockdep is turned off.
[   22.501736] Last Breaking-Event-Address:
[   22.501814]  [<0000000000259c86>] cgroup_sk_free+0x14e/0x1e8
[   22.502169] Kernel panic - not syncing: Fatal exception: panic_on_oops