linux-kernel - Re: [PATCH] capabilities: add capability cgroup controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160708091332.GD3556@pathway.suse.cz>
Date:	Fri, 8 Jul 2016 11:13:32 +0200
From:	Petr Mladek <pmladek@...e.com>
To:	Topi Miettinen <toiwoton@...il.com>
Cc:	"Serge E. Hallyn" <serge@...lyn.com>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Tejun Heo <tj@...nel.org>, lkml <linux-kernel@...r.kernel.org>,
	luto@...nel.org, Kees Cook <keescook@...omium.org>,
	Jonathan Corbet <corbet@....net>,
	Li Zefan <lizefan@...wei.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Serge Hallyn <serge.hallyn@...onical.com>,
	James Morris <james.l.morris@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Howells <dhowells@...hat.com>,
	David Woodhouse <David.Woodhouse@...el.com>,
	Ard Biesheuvel <ard.biesheuvel@...aro.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
	"open list:CAPABILITIES" <linux-security-module@...r.kernel.org>
Subject: Re: [PATCH] capabilities: add capability cgroup controller

On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
> On 07/07/16 09:16, Petr Mladek wrote:
> > On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
> >> The attached patch would make any uses of capabilities generate audit
> >> messages. It works for simple tests as you can see from the commit
> >> message, but unfortunately the call to audit_cgroup_list() deadlocks the
> >> system when booting a full blown OS. There's no deadlock when the call
> >> is removed.
> >>
> >> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
> >> already held earlier before entering audit_cgroup_list(). Holding the
> >> locks is however required by task_cgroup_from_root(). Is there any way
> >> to avoid this? For example, only print some kind of cgroup ID numbers
> >> (are there unique and stable IDs, available without locks?) for those
> >> cgroups where the task is registered in the audit message?
> > 
> > I am not sure if anyone know what really happens here. I suggest to
> > enable lockdep. It might detect possible deadlock even before it
> > really happens, see Documentation/locking/lockdep-design.txt
> > 
> > It can be enabled by
> > 
> >    CONFIG_PROVE_LOCKING=y
> > 
> > It depends on
> > 
> >     CONFIG_DEBUG_KERNEL=y
> > 
> > and maybe some more options, see lib/Kconfig.debug
> 
> Thanks a lot! I caught this stack dump:
> 
> starting version 230
> [    3.416647] ------------[ cut here ]------------
> [    3.417310] WARNING: CPU: 0 PID: 95 at
> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
> lockdep_trace_alloc+0xb4/0xc0
> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [    3.417923] Modules linked in:
> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Debian-1.8.2-1 04/01/2014
> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
> ffffffff813c9c45
> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
> ffffffff81091e9b
> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
> 00000000ffffffff
> [    3.419374] Call Trace:
> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
> [    3.420170] ---[ end trace fb586899fb556a5e ]---
> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
> available
> [    4.014078] clocksource: Switched to clocksource tsc
> Begin: Loading essential drivers ... done.
> 
> This is with qemu and the boot continues normally. With real computer,
> there's no such output and system just seems to freeze.
> 
> Could it be possible that the deadlock happens because there's some IO
> towards /sys/fs/cgroup, which causes a capability check and that in turn
> causes locking problems when we try to print cgroup list?

The above warning is printed by the code from
kernel/locking/lockdep.c:2871

static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
{
[...]
	/* We're only interested __GFP_FS allocations for now */
	if (!(gfp_mask & __GFP_FS))
		return;

	/*
	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
	 */
	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
		return;


The backtrace shows that your new audit_log_cap_use() is called
from vfs_write(). You might try to use audit_log_start() with
GFP_NOFS instead of GFP_KERNEL.

Note that this is rather intuitive advice. I still need to learn a lot
about memory management and kernel in general to be more sure about
a correct solution.

Best Regards,
Petr