lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091009215446.GB27627@count0.beaverton.ibm.com>
Date:	Fri, 9 Oct 2009 14:54:46 -0700
From:	Matt Helsley <matthltc@...ibm.com>
To:	Daniel Lezcano <dlezcano@...ibm.com>
Cc:	Sukadev Bhattiprolu <sukadev@...ux.vnet.ibm.com>,
	Linux Containers <containers@...ts.osdl.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Pavel Emelianov <xemul@...nvz.org>
Subject: Re: pidns memory leak

On Fri, Oct 09, 2009 at 03:18:23PM +0200, Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [dlezcano@...ibm.com] wrote:
>>> Sukadev Bhattiprolu wrote:
>>>> Still digging through some traces, but below I have some questions 
>>>> that I am still trying to answer.
>>>>
>>>>> I am not sure what you mean by 'struct pids' but what I observed is:
>>>> Ok, I see that too. If pids leak, then pid-namespace will leak too.
>>>> Do you see any leaks in proc_inode_cache ?
>>> Yes, right. It leaks too.
>>
>> Ok, some progress...
>>
>> Can you please verify these observations:
>>
>> - If the container exits normally, the leak does not seem to happen.
>>   (i.e reduce your sleep 3600 to say sleep 3 and remove the lxc-stop).
>>
>> - Revert the following commit and check if the leak happens:
>>
>> 	commit 7766755a2f249e7e0dabc5255a0a3d151ff79821
>> 	Author: Andrea Arcangeli <andrea@...e.de>
>> 	Date:   Mon Feb 4 22:29:21 2008 -0800
>>
>> (this commit added the check for PF_EXITING in proc_flush_task_mnt  
>> loosely explained below).
>
>
>
>> Incomplete analysis :-)
>>
>> If the container-init is terminated (by the lxc-stop), the container zaps
>> other processes in the container and waits for them. The leak happens in
>> this case.
>>
>> Following sequence of events occur:
>>
>> 	- container-init calls do_exit and sets PF_EXITING (in exit_signals())
>>
>> 	- container init calls zaps_pid_ns_processes() (exit_notify /
>> 	  forget_orignal_parent() / find_new_reaper())
>>
>> 	- In zap_pid_ns_processes() container-init sends SIGKILL to
>> 	  descendants and calls sys_wait().
>>
>> 	- The sys_wait() is expected to call release_task() which calls
>> 	  proc_flush_task_mnt().
>>
>> 	- proc_flush_task_mnt() looks up the dentry for the pid (2 in
>> 	  our example) and finds the dentry.
>>
>> 	  But since container-init is itself exiting (i.e PF_EXITING is
>> 	  set) it does NOT call the shrink_dcache_parent(), but,
>> 	  interestingly calls d_drop() and dput().
>>
>> 	  Now the d_drop() unhashes the dentry for the pid 2.
>>
>> 	- proc_flush_task_mnt() then tries to find the dentry for the
>> 	  tgid of the process. In our case, the tgid == pid == 2 and
>> 	  we just unhashed the dentry for "2".
>>
>> 	  So, we don't find the dentry for the leader either (and hence
>> 	  don't make the second shrink_dcache_parent() call in
>> 	  proc_flush_task_mnt() either).
>>
>> 	  Without a call to shrink_dcache_parent(), the proc inode
>> 	  for the process that was terminated by container init is
>> 	  not deleted (i.e we don't call proc_delete_inode() or
>> 	  the put_pid() inside it) causing us to leak proc_inodes,
>> 	  struct pid and hence struct pid_namespace.
>
> Ouch !
>
> Nice analysis :)
>
> Following your explanation I was able to reproduce a simple program  
> added in attachment. But there is something I do not understand is why  
> the leak does not appear if I do the 'lstat' (cf. test program) in the  
> pid 2 context.

I would suspect that lstat may cause the proc inode to be relinked such
that later cleanup paths can reach the struct pid and the pidns.

When it runs, lstate will lookup the dentry, fail, look up the process
tgid via tasklist, and relinks the proc inode back to a new dentry,
gets the stat info, and exits back to userspace. Since the proc inode is
properly relinked a different path can now reach the struct pid and pidns
again and properly cleans those up.

That's just my guess. And of course I hand waved the "different path" --
no idea what path that is...

Cheers,
	-Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ