linux-kernel - Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable to handle kernel NULL pointer dereference at 0000000000000598 1.478005] IP: [<ffffffff8107a6c4>] queue_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4F046C42.30903@canonical.com>
Date:	Wed, 04 Jan 2012 16:12:02 +0100
From:	Stefan Bader <stefan.bader@...onical.com>
To:	NeilBrown <neilb@...e.de>
CC:	John Stultz <john.stultz@...aro.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	Sander Eikelenboom <linux@...elenboom.it>, rjw@...k.pl,
	Thomas Gleixner <tglx@...utronix.de>,
	linux-kernel@...r.kernel.org
Subject: Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable
 to handle kernel NULL pointer dereference at 0000000000000598  1.478005]
 IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 04.01.2012 02:20, NeilBrown wrote:
> On Tue, 03 Jan 2012 16:53:00 -0800 John Stultz <john.stultz@...aro.org>
> wrote:
> 
>> On Wed, 2012-01-04 at 11:31 +1100, NeilBrown wrote:
>>> On Tue, 03 Jan 2012 15:09:48 -0800 John Stultz <john.stultz@...aro.org>
>>> wrote:
>>>>> From the stack trace, we've kicked off a rtc_timer_do_work,
>>>>> probably
>>>> from the rtc_initialize_alarm() schedule_work call added in Neil's 
>>>> patch.  From there, we call __rtc_set_alarm -> cmos_set_alarm -> 
>>>> cmos_rq_disable -> cmos_checkintr -> rtc_update_irq ->
>>>> schedule_work.
>>>> 
>>>> So, what it looks to me is that in cmos_checkintr, we grab the
>>>> cmos->rtc and pass that along. Unfortunately, since the cmos->rtc
>>>> value isn't set until after rtc_device_register() returns its null at
>>>> that point. So your patch isn't really fixing the issue, but just
>>>> reducing the race window for the second cpu to schedule the work.
>>>> 
>>>> Sigh. I'd guess dropping the schedule_work call from 
>>>> rtc_initialize_alarm() is the right approach (see below). When
>>>> reviewing Neil's patch it seemed like a good idea there, but it seems
>>>> off to me now.
>>>> 
>>>> Neil, any thoughts on the following? Can you expand on the condition
>>>> you were worried about in around that call?
>>> 
>>> If you set an alarm in the future, then shutdown and boot again after
>>> that time, then you will end up with a timer_queue node which is in the
>>> past.
>> 
>> Thanks for explaining this again.
>> 
>> Hrm. It seems the easy answer is to simply not add alarms that are in the
>> past.  Further, I'm a bit perplexed, as if they are in the past, the 
>> enabled flag shouldn't be set.   __rtc_read_alarm() does check the 
>> current time, so maybe we can make sure we don't return old values? I 
>> guess I assumed __rtc_read_alarm() avoided returning stale values, but 
>> apparently not.
> 
> That would probably be a more robust approach.  Also it might make sense
> to clean out old alarms whenever we are about to add a new one.
> 
>> 
>>> When this happens the queue gets stuck.  That entry-in-the-past won't
>>> get removed until and interrupt happens and an interrupt won't happen
>>> because the RTC only triggers an interrupt when the alarm is "now".
>>> 
>>> So you'll find that e.g. "hwclock" will always tell you that 'select'
>>> timed out.
>>> 
>>> So we force the interrupt work to happen at the start just in case.
>> 
>> Unfortunately its too early.
>> 
>>> Did you see my proposed patch which converted those calls to do the
>>> work in-process rather than passing it to a worker-thread?  I think
>>> that is a clean fix.
>> 
>> I don't think I saw it today. Was it from before the holidays?
> 
> About 4 hours ago: Subject: Re: Patch Upstream: rtc: Expire alarms after
> the time is set.
> 
>> 
>> Even so, at this point, I don't know if we have enough time for testing, 
>> so I'm thinking we either just drop the problematic sched_work call or 
>> revert the whole thing and try again for 3.3
> 
> I wouldn't object to that.  The bug only triggers in unusual circumstances 
> and is quite easy to work around so it is safer  to wait until we have a 
> really good fix.
> 
> Thanks, NeilBrown
> 
Neil, this variation would also keep the kernel from crashing in my test
environment. What do you think of that?

- -Stefan

> 
>> 
>> thanks -john
>> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCgAGBQJPBGw4AAoJEOhnXe7L7s6jg5YQAJ1PrNGhuXJSAevurIorc+l5
Xbyj96MRk2Z7y7c0f/OQCCKzmReNMOQ91pgok2FIamHBh55JtbgdldyTWG80Jr0M
N1jwkus1SJTIcLAeIrzoXSLU5uTvmcCcW0GBUuK98m0LDwMMJr0P1a5ilsyorRFY
ekdFIL732Cs6hn3BAlg6zVucG/kMOi3OEOtHlbKdw83VdxJ1OmjIkW9tlj/6268j
ZImuXh1XZmCJiTuCwdvdi3RVU3qZvQJi1AW0G0zsQcc2r13OQkSSoi/Wo5+6dNit
7WxSbYqeZ4hPmk67yDKjOK67kDU094end++ePtQNke0d9IePB3fRvDIi+eEkrjc5
0rJi2gLkpSyE8MJ3pBO9Bt8JCVjpY1QZnQsDT68fIHUApMnrwRK2N371JzbSim52
8VYgTfubRRdwiLlMAL7ACilFV2FKc9PcTf7FrrR2j+GZfhKTD5iy+bNvqpWaJVVT
jUrfCS79z+Z9v9/bjZ55n23gIJQn3jLk3B79plnFlwmxp9CDaAITmfzKsSB0EoZB
RF0ndDkd1Oii/XuFdqaVaMeZojPCT2rVHas2H26mH+Ihb0n9EvSBgx2L64DbRrAt
duLngf+IAO6T9ngaqnhmHwJXdwKiKGxUhWVdo92YTd16jdANd/711/B0kzHHWw4H
FT46+Os2UQJR+HsedKZg
=Kgpc
-----END PGP SIGNATURE-----

View attachment "0001-rtc-Do-not-schedule-work-in-rtc_initialize_alarm.patch" of type "text/x-diff" (2312 bytes)