[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m1zmdcty4i.fsf@ebiederm.dsl.xmission.com>
Date: Wed, 06 Sep 2006 16:43:09 -0600
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Jean Delvare <jdelvare@...e.de>
Cc: Andrew Morton <akpm@...l.org>, Oleg Nesterov <oleg@...sign.ru>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
linux-kernel@...r.kernel.org, ak@...e.de
Subject: Re: [PATCH] proc: readdir race fix (take 3)
Jean Delvare <jdelvare@...e.de> writes:
> On Wednesday 6 September 2006 11:01, Jean Delvare wrote:
>> Eric, Kame, thanks a lot for working on this. I'll be giving some good
>> testing to this patch today, and will return back to you when I'm done.
>
> The original issue is indeed fixed, but there's a problem with the patch.
> When stressing /proc (to verify the bug was fixed), my test machine ended
> up crashing. Here are the 2 traces I found in the logs:
Ugh.
So the death in __put_task_struct() is from:
WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
So it appears we have something that is decrementing but not
incrementing the count on the task struct.
Now what is interesting is that there are a couple of other failure modes
present here.
free_uid called from __put_task_struct is failing
And you seem to have a recursive page fault going on somewhere.
I suspect the triggering of this bug is the result of an earlier oops,
that left some process half cleaned up.
Have you tested 2.6.18-rc6 without my patch?
If not can you please test the same 2.6.18-rc6 configuration with my patch?
> Sometimes the machine just hung, with nothing in the logs. The machine is
> a Sony laptop (i686).
>
> I have been testing the patch on another machine (x86_64) and had no
> problem at all, so the reproduceability of the bug might depend on the
> arch or some config option. I'll help nailing down this issue if I can,
> just tell me what to do.
So I don't know what is going on with your laptop. It feels nasty.
I think my patch is just tripping on the problem, rather than causing
it. The previous version of fs/proc/base.c should have tripped over
this problem as well if it happened to have hit the same process.
I'm staring at the patch and I can not think of anything that would
explain your problem. The reference counting is simple and the only
bug I had in a posted version was a failure to decrement the count
on the task_struct.
I guess the practical question is what was your test methodology to
reproduce this problem? A couple of more people running the same
test on a few more machines might at least give us confidence in what
is going on.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists