[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.11.1411232235591.6439@nanos>
Date:	Sun, 23 Nov 2014 22:38:03 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Chris Mason <clm@...com>
cc:	Borislav Petkov <bp@...en8.de>, torvalds@...ux-foundation.org,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
	Stanislaw Gruszka <sgruszka@...hat.com>
Subject: Re: New crashes walking proc with Saturday's git
On Sun, 23 Nov 2014, Chris Mason wrote:
> On Sun, Nov 23, 2014 at 4:05 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> > On Sun, 23 Nov 2014, Chris Mason wrote:
> > >  On Sun, Nov 23, 2014 at 11:32 AM, Borislav Petkov <bp@...en8.de> wrote:
> > >  > On Sun, Nov 23, 2014 at 11:16:51AM -0500, Chris Mason wrote:
> > >  > >  It must be:
> > >  > >
> > >  > >  commit 6e998916dfe327e785e7c2447959b2c1a3ea4930
> > >  > >  Author: Stanislaw Gruszka <sgruszka@...hat.com>
> > >  > >  Date:   Wed Nov 12 16:58:44 2014 +0100
> > >  > >
> > >  > >     sched/cputime: Fix clock_nanosleep()/clock_gettime()
> > > inconsistency
> > >  > >
> > >  > >  I'll do two runs to confirm, but it's the only related patch between
> > > rc5
> > >  > > and
> > >  > >  now.
> > > 
> > >  I've adding Ingo and Stanislaw to the cc.  With
> > >  6e998916dfe327e785e7c2447959b2c1a3ea4930 reverted, I'm no longer
> > > crashing.
> > > 
> > >  Repeating the stack trace for the new cc list.  I see the crash with atop
> > > or
> > >  similar walkers of /proc racing against exiting programs.  Given the NULL
> > > rip,
> > >  this line from the patch is probably broken, but it really feels like we
> > >  should be falling over on p->sched_class and not on the update_curr func.
> > > 
> > >  +               p->sched_class->update_curr(rq);
> > > 
> > >  I'm leaving my fork bomb running on two machines with the patch reverted
> > > to
> > >  make sure.
> > 
> > The sched_class instances which do not have update_curr are stop_task
> > and idle. Patch below.
> > 
> > I'm sure nobody thought about the stats read code path here.
> > 
> > [ 1053.759741]  [<ffffffff81208348>] do_task_stat+0x8b8/0xb00
> > 
> > do_task_stat(()
> >  thread_group_cputime_adjusted()
> >    thread_group_cputime()
> >      task_cputime()
> >        task_sched_runtime()
> > 	if (task_current(rq, p) && task_on_rq_queued(p)) {
> >                 update_rq_clock(rq);
> >                 p->sched_class->update_curr(rq);
> >         }
> > 
> > Now if the stats are read for a stomp machine task, aka 'migration/N'
> > and that task is current on its cpu. Ooops.
> > 
> > I added the callback for idle tasks as well for completeness sake.
> 
> This does make sense, but it doesn't match with the crash being much more
> likely during the fork bomb.  The difference is crashing within a few hours vs
> crashing within 5 minutes.
The fork bomb will kick the migration task pretty often into life, so
the probablity of do_task_stat() to hit a running migration thread is
higher than on a normaly loaded machine.
Thanks,
	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
