[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20080521163503.000afd61.randy.dunlap@oracle.com>
Date: Wed, 21 May 2008 16:35:03 -0700
From: Randy Dunlap <randy.dunlap@...cle.com>
To: mark <markkicks@...il.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: fork: Resource temporarily unavailable / cant start new threads
On Wed, 21 May 2008 15:51:55 -0700 mark wrote:
> On Wed, May 21, 2008 at 2:32 PM, Randy Dunlap <randy.dunlap@...cle.com> wrote:
> > On Wed, 21 May 2008 14:08:53 -0700 mark wrote:
> >
> >> On Wed, May 21, 2008 at 1:50 PM, Randy Dunlap <randy.dunlap@...cle.com> wrote:
> >> > mark wrote:
> >> >>
> >> >> On Wed, May 21, 2008 at 1:28 PM, Randy Dunlap <randy.dunlap@...cle.com>
> >> >> wrote:
> >> >>>
> >> >>> On Tue, 20 May 2008 11:26:47 -0700 mark wrote:
> >> >>>>
> >> >>>> I upgraded to 2.6.25.3-18.fc9.x86_64 fedora core 9, now I get this
> >> >>>> error when I try to login to the box, kill a pr start a python app, or
> >> >>>> do anything on a regular basis.
> >> >>>>
> >> >>>> fork: Resource temporarily unavailable
[snip]
> >> >>> The only place that fork() returns EAGAIN is for number of
> >> >>> processes being >= its limit. Does this user already have >= 1024
> >> >>> processes?
> >> >>
> >> >> No, it is around 400
> >> >
> >> > Well, my comment was wrong anyway. There are several other tests just
> >> > below number of user processes that also return EAGAIN, like:
> >> >
> >> > - total number of threads being too large
> >
> > Total number of threads currently running is in /proc/loadavg:
> >
> >> cat /proc/loadavg
> > 1.56 0.58 0.27 2/203 28500
> >
> > It's the number following the '/', e.g., 203 on my desktop system.
> >
> > max_threads allowed is a sysctl, so you can tune it if needed.
> > It's in /proc/sys/kernel/threads-max:
> >
> >> cat /proc/sys/kernel/threads-max
> > 32624
> > I sort of doubt that one is the problem, but you can tell us.
>
> cat /proc/loadavg
> 0.39 0.45 0.57 1/1412 12032
> cat /proc/sys/kernel/threads-max
> 274432
> you are right, i guess this is not the problem.
>
>
> >> > - error on grabbing a module reference count (?)
> >> > - error on grabbing a binfmt module reference
> >>
> >> as a user how do i identify what is wrong, and fix this? for total
> >> number of threads -> is there anyway i can find out if this is causing
> >> the problem? my system is running around 80 multi-threaded python web
> >> apps.
> >
> > I can send you some debug patches that will print out the specific
> > problem area. Do you want to do that? Can you rebuild and install
> > a new kernel?
> Is it possible to get this debug messages by turning on some flags?
> If not yes, pl. send debug patches. its a live box and I will try to do it!
>
> This is my system / kernel info:
> uname -a
> Linux XXX 2.6.25.3-18.fc9.x86_64 #1 SMP Tue May 13 04:54:47 EDT 2008
> x86_64 x86_64 x86_64 GNU/Linux
I made a small patch to a vanilla kernel.org 2.6.25.3 kernel tree.
Hopefully it applies cleanly to that fc9 kernel source, but check/verify
that first before going any further.
After building and booting with this patch, there will be kernel
messages whenever fork's "copy_process" function fails with -EAGAIN (-11),
which is reported to userspace as errno = 11 (Resource temporarily
unavailable). Hopefully this will identify which test is failing,
but there's a chance that something else is going on and that this
patch does not find the problem.
Anyway, good luck and please report back on it.
---
---
kernel/fork.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
--- linux-2.6.25.3.orig/kernel/fork.c
+++ linux-2.6.25.3/kernel/fork.c
@@ -1049,8 +1049,10 @@ static struct task_struct *copy_process(
if (atomic_read(&p->user->processes) >=
p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
- p->user != current->nsproxy->user_ns->root_user)
+ p->user != current->nsproxy->user_ns->root_user) {
+ printk(KERN_INFO "%s: error on #processes\n", __func__);
goto bad_fork_free;
+ }
}
atomic_inc(&p->user->__count);
@@ -1062,14 +1064,20 @@ static struct task_struct *copy_process(
* triggers too late. This doesn't hurt, the check is only there
* to stop root fork bombs.
*/
- if (nr_threads >= max_threads)
+ if (nr_threads >= max_threads) {
+ printk(KERN_INFO "%s: error on #threads\n", __func__);
goto bad_fork_cleanup_count;
+ }
- if (!try_module_get(task_thread_info(p)->exec_domain->module))
+ if (!try_module_get(task_thread_info(p)->exec_domain->module)) {
+ printk(KERN_INFO "%s: error on exec_domain->module\n", __func__);
goto bad_fork_cleanup_count;
+ }
- if (p->binfmt && !try_module_get(p->binfmt->module))
+ if (p->binfmt && !try_module_get(p->binfmt->module)) {
+ printk(KERN_INFO "%s: error on binfmt->module\n", __func__);
goto bad_fork_cleanup_put_domain;
+ }
p->did_exec = 0;
delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists