linux-kernel - Re: fork: Resource temporarily unavailable / cant start new threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20080521163503.000afd61.randy.dunlap@oracle.com>
Date:	Wed, 21 May 2008 16:35:03 -0700
From:	Randy Dunlap <randy.dunlap@...cle.com>
To:	mark <markkicks@...il.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: fork: Resource temporarily unavailable / cant start new threads

On Wed, 21 May 2008 15:51:55 -0700 mark wrote:

> On Wed, May 21, 2008 at 2:32 PM, Randy Dunlap <randy.dunlap@...cle.com> wrote:
> > On Wed, 21 May 2008 14:08:53 -0700 mark wrote:
> >
> >> On Wed, May 21, 2008 at 1:50 PM, Randy Dunlap <randy.dunlap@...cle.com> wrote:
> >> > mark wrote:
> >> >>
> >> >> On Wed, May 21, 2008 at 1:28 PM, Randy Dunlap <randy.dunlap@...cle.com>
> >> >> wrote:
> >> >>>
> >> >>> On Tue, 20 May 2008 11:26:47 -0700 mark wrote:
> >> >>>>
> >> >>>> I upgraded to 2.6.25.3-18.fc9.x86_64 fedora core 9, now I get this
> >> >>>> error when I try to login to the box, kill a pr start a python app, or
> >> >>>> do anything on a regular basis.
> >> >>>>
> >> >>>> fork: Resource temporarily unavailable

[snip]

> >> >>> The only place that fork() returns EAGAIN is for number of
> >> >>> processes being >= its limit.  Does this user already have >= 1024
> >> >>> processes?
> >> >>
> >> >> No, it is around 400
> >> >
> >> > Well, my comment was wrong anyway.  There are several other tests just
> >> > below number of user processes that also return EAGAIN, like:
> >> >
> >> > - total number of threads being too large
> >
> > Total number of threads currently running is in /proc/loadavg:
> >
> >> cat /proc/loadavg
> > 1.56 0.58 0.27 2/203 28500
> >
> > It's the number following the '/', e.g., 203 on my desktop system.
> >
> > max_threads allowed is a sysctl, so you can tune it if needed.
> > It's in /proc/sys/kernel/threads-max:
> >
> >> cat /proc/sys/kernel/threads-max
> > 32624
> > I sort of doubt that one is the problem, but you can tell us.
> 
> cat /proc/loadavg
> 0.39 0.45 0.57 1/1412 12032
> cat /proc/sys/kernel/threads-max
> 274432
> you are right, i guess this is not the problem.
> 
> 
> >> > - error on grabbing a module reference count (?)
> >> > - error on grabbing a binfmt module reference
> >>
> >> as a user how do i identify what is wrong, and fix this? for total
> >> number of threads -> is there anyway i can find out if this is causing
> >> the problem? my system is running around 80 multi-threaded python web
> >> apps.
> >
> > I can send you some debug patches that will print out the specific
> > problem area.  Do you want to do that?  Can you rebuild and install
> > a new kernel?
> Is it possible to get this debug messages by turning on some flags?
> If not yes, pl. send debug patches. its a live box and  I will try to do it!
> 
> This is my system / kernel info:
> uname -a
> Linux XXX 2.6.25.3-18.fc9.x86_64 #1 SMP Tue May 13 04:54:47 EDT 2008
> x86_64 x86_64 x86_64 GNU/Linux

I made a small patch to a vanilla kernel.org 2.6.25.3 kernel tree.
Hopefully it applies cleanly to that fc9 kernel source, but check/verify
that first before going any further.

After building and booting with this patch, there will be kernel
messages whenever fork's "copy_process" function fails with -EAGAIN (-11),
which is reported to userspace as errno = 11 (Resource temporarily
unavailable).  Hopefully this will identify which test is failing,
but there's a chance that something else is going on and that this
patch does not find the problem.

Anyway, good luck and please report back on it.

---

---
 kernel/fork.c |   16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

--- linux-2.6.25.3.orig/kernel/fork.c
+++ linux-2.6.25.3/kernel/fork.c
@@ -1049,8 +1049,10 @@ static struct task_struct *copy_process(
 	if (atomic_read(&p->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
 		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
-		    p->user != current->nsproxy->user_ns->root_user)
+		    p->user != current->nsproxy->user_ns->root_user) {
+			printk(KERN_INFO "%s: error on #processes\n", __func__);
 			goto bad_fork_free;
+		}
 	}
 
 	atomic_inc(&p->user->__count);
@@ -1062,14 +1064,20 @@ static struct task_struct *copy_process(
 	 * triggers too late. This doesn't hurt, the check is only there
 	 * to stop root fork bombs.
 	 */
-	if (nr_threads >= max_threads)
+	if (nr_threads >= max_threads) {
+		printk(KERN_INFO "%s: error on #threads\n", __func__);
 		goto bad_fork_cleanup_count;
+	}
 
-	if (!try_module_get(task_thread_info(p)->exec_domain->module))
+	if (!try_module_get(task_thread_info(p)->exec_domain->module)) {
+		printk(KERN_INFO "%s: error on exec_domain->module\n", __func__);
 		goto bad_fork_cleanup_count;
+	}
 
-	if (p->binfmt && !try_module_get(p->binfmt->module))
+	if (p->binfmt && !try_module_get(p->binfmt->module)) {
+		printk(KERN_INFO "%s: error on binfmt->module\n", __func__);
 		goto bad_fork_cleanup_put_domain;
+	}
 
 	p->did_exec = 0;
 	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/