linux-kernel - Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87ipfp5au2.fsf@xmission.com>
Date:	Mon, 21 May 2012 18:16:37 -0600
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Pavel Emelyanov <xemul@...allels.com>,
	Cyrill Gorcunov <gorcunov@...nvz.org>,
	Louis Rilling <louis.rilling@...labs.com>,
	Mike Galbraith <efault@....de>
Subject: Re: [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped.

Oleg Nesterov <oleg@...hat.com> writes:

> On 05/18, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@...hat.com> writes:
>>
>> >> I think there is something very compelling about your solution,
>> >> we do need my bit about making the init process ignore SIGCHLD
>> >> so all of init's children self reap.
>> >
>> > Not sure I understand. This can work with or without 3/3 which
>> > changes zap_pid_ns_processes() to ignore SIGCHLD. And just in
>> > case, I think 3/3 is fine.
>>
>> The only issue I see is that without 3/3 we might have processes that
>> on one wait(2)s for and so will never have release_task called on.
>>
>> We do have the wait loop
>
> Yes, and we need this loop anyway, even if SIGCHLD is ignored.
> It is possible that we already have a EXIT_ZOMBIE child(s) when
> zap_pid_ns_processes().
>
>> but I think there is a race possible there.
>
> Hmm. I do not see any race, but perhaps I missed something.
> I think we can trust -ECHILD, or do_wait() is buggy.

Think about it some more you are right.  For some reason
I had forgotten that without WNOHANG we don't block forever
until a child exits.

> Hmm. But there is another (off-topic) problem, security_task_wait()
> can return an error if there are some security policy problems...
> OK, this shouldn't happen I hope.

Agreed.  We might be able to address that problem but that is indeed
another issue.

>> > And once again, this wait_event() + __wake_up_parent() is very
>> > simple and straightforward, we can cleanup this code later if
>> > needed.
>>
>> Yes, and it doesn't when you do an UNINTERRUPTIBLE sleep with
>> an INTERRUPTIBLE wake up unless I misread the code.
>
> Yes. so we need wait_event_interruptible() or __unhash_process()
> should use __wake_up_sync_key(wait_chldexit).
>
>> > Yes. This is the known oddity. We always notify the tracer if the
>> > leader exits, even if !thread_group_empty(). But after that the
>> > tracer can't detach, and it can't do do_wait(WEXITED).
>> >
>> > The problem is not that we can't "fix" this. Just any discussed
>> > fix adds the subtle/incompatible user-visible change.
>>
>> Yes and that is nasty.
>
> Agreed. ptrace API is nasty ;)
>
>> and moving detach_pid so we don't have to be super careful about
>> where we call task_active_pid_ns.
>
> Yes, I was thinking about this change too,
>
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>>  	} while (rc != -ECHILD);
>>
>> +	read_lock(&tasklist_lock);
>> +	for (;;) {
>> +		__set_current_state(TASK_INTERRUPTIBLE);
>> +		if (list_empty(&current->children))
>> +			break;
>> +		read_unlock(&tasklist_lock);
>> +		schedule();
>
> OK, but then it makes sense to add clear_thread_flag(TIF_SIGPENDING)
> before schedule, to avoid the busy-wait loop (like the sys_wait4 loop
> does). Or simply use TASK_UNINTERRUPTIBLE, I do not think it is that
> important to "fool" /proc/loadavg. But I am fine either way.

It can get darn strange when you hold a thread in stopped with ptrace
and your load mysteriously jumps.  But we already have this problem
with de_thread and people aren't yelling so shrug.

So at a practical level Idon't think it is fooling /proc/loadavg but at
this point if we want more accuraccy from /proc/loadavg we need to fix
the computation and distinguish short term disk sleeps from other
uninterruptible sleeps and thus fix how /proc/loadavg is computed,
rather than hacking around with code like this.

> Maybe you can also add "ifdef CONFIG_PID_NS" into __unhash_process(),
> but this is minor too.

An #ifdef just leads to weird build failures that in weird rare
configurations.  If we can hide it all away in a header fine, but
putting a bare #ifdef in the core of the code simply as a performance
optimization is ugly and a a major testing challenge.  Keeping track of
all of the flying pieces with this patch has been tricky enough as it
is.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/