linux-kernel - Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable to handle kernel NULL pointer dereference at 0000000000000598 [ 1.478005] IP: [<ffffffff8107a6c4>] queue_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120104122040.5a0fa6a3@notabene.brown>
Date:	Wed, 4 Jan 2012 12:20:40 +1100
From:	NeilBrown <neilb@...e.de>
To:	John Stultz <john.stultz@...aro.org>
Cc:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Sander Eikelenboom <linux@...elenboom.it>,
	stefan.bader@...onical.com, rjw@...k.pl,
	Thomas Gleixner <tglx@...utronix.de>,
	linux-kernel@...r.kernel.org
Subject: Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable
 to handle kernel NULL pointer dereference at 0000000000000598 [   
 1.478005] IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30

On Tue, 03 Jan 2012 16:53:00 -0800 John Stultz <john.stultz@...aro.org> wrote:

> On Wed, 2012-01-04 at 11:31 +1100, NeilBrown wrote:
> > On Tue, 03 Jan 2012 15:09:48 -0800 John Stultz <john.stultz@...aro.org> wrote:
> > > >From the stack trace, we've kicked off a rtc_timer_do_work, probably
> > > from the rtc_initialize_alarm() schedule_work call added in Neil's
> > > patch.  From there, we call __rtc_set_alarm -> cmos_set_alarm ->
> > > cmos_rq_disable -> cmos_checkintr -> rtc_update_irq -> schedule_work.
> > > 
> > > So, what it looks to me is that in cmos_checkintr, we grab the cmos->rtc
> > > and pass that along. Unfortunately, since the cmos->rtc value isn't set
> > > until after rtc_device_register() returns its null at that point. So
> > > your patch isn't really fixing the issue, but just reducing the race
> > > window for the second cpu to schedule the work.
> > > 
> > > Sigh. I'd guess dropping the schedule_work call from
> > > rtc_initialize_alarm() is the right approach (see below). When reviewing
> > > Neil's patch it seemed like a good idea there, but it seems off to me
> > > now.
> > > 
> > > Neil, any thoughts on the following? Can you expand on the condition you
> > > were worried about in around that call?
> > 
> > If you set an alarm in the future, then shutdown and boot again after that
> > time, then you will end up with a timer_queue node which is in the past.
> 
> Thanks for explaining this again.
> 
> Hrm. It seems the easy answer is to simply not add alarms that are in
> the past.  Further, I'm a bit perplexed, as if they are in the past, the
> enabled flag shouldn't be set.   __rtc_read_alarm() does check the
> current time, so maybe we can make sure we don't return old values? I
> guess I assumed __rtc_read_alarm() avoided returning stale values, but
> apparently not.

That would probably be a more robust approach.  Also it might make sense to
clean out old alarms whenever we are about to add a new one.

> 
> > When this happens the queue gets stuck.  That entry-in-the-past won't get
> > removed until and interrupt happens and an interrupt won't happen because the
> > RTC only triggers an interrupt when the alarm is "now".
> > 
> > So you'll find that e.g. "hwclock" will always tell you that 'select' timed
> > out.
> > 
> > So we force the interrupt work to happen at the start just in case.
> 
> Unfortunately its too early. 
> 
> > Did you see my proposed patch which converted those calls to do the work
> > in-process rather than passing it to a worker-thread?  I think that is a
> > clean fix.
> 
> I don't think I saw it today. Was it from before the holidays?

About 4 hours ago:
  Subject: Re: Patch Upstream: rtc: Expire alarms after the time is set.

> 
> Even so, at this point, I don't know if we have enough time for testing,
> so I'm thinking we either just drop the problematic sched_work call or
> revert the whole thing and try again for 3.3

I wouldn't object to that.  The bug only triggers in unusual circumstances
and is quite easy to work around so it is safer  to wait until we have a
really good fix.

Thanks,
NeilBrown


> 
> thanks
> -john
> 


Download attachment "signature.asc" of type "application/pgp-signature" (829 bytes)