lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <alpine.LFD.2.00.0812222319260.20902@localhost.localdomain>
Date:	Mon, 22 Dec 2008 23:28:26 -0500 (EST)
From:	Len Brown <lenb@...nel.org>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Frans Pop <elendil@...net.nl>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Yinghai Lu <yinghai@...nel.org>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	"Maciej W. Rozycki" <macro@...ux-mips.org>,
	"Pallipadi, Venkatesh" <venkatesh.pallipadi@...el.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>, Greg KH <greg@...ah.com>,
	jbarnes@...tuousgeek.org,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	tiwai@...e.de, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: "APIC error on CPU1: 00(40)" during resume

On Sun, 21 Dec 2008, Ingo Molnar wrote:

> 
> * Frans Pop <elendil@...net.nl> wrote:
> 
> > On Wednesday 10 December 2008, Ingo Molnar wrote:
> > > regarding those APIC error messages:
> > > >        ACPI: Waking up from system sleep state S3
> > > >        APIC error on CPU1: 00(40)
> > > >        ACPI: EC: non-query interrupt received, switching to interrupt
> > >
> > > that does suggest that the APIC was re-enabled (we dont get any APIC
> > > error exceptions otherwise!), and its LVT was programmed as well, but
> > > somehow we got an erroneous APIC message from an illegal vector.
> > 
> > I wonder if this may help tracing the cause. Today I got a KERN_ERR in the 
> > middle of those messages:
> > 
> > ACPI: Waking up from system sleep state S3
> > BUG: sleeping function called from invalid context at kernel/sched.c:5571
> > in_atomic(): 0, irqs_disabled(): 1, pid: 70, name: kacpid
> > Pid: 70, comm: kacpid Not tainted 2.6.28-rc7-rjw #77
> > Call Trace:
> >  [<ffffffff80345013>] ? acpi_os_release_object+0x9/0xd
> >  [<ffffffff8022d588>] __might_sleep+0xcf/0xd1
> >  [<ffffffff80234e32>] __cond_resched+0x15/0x4b
> >  [<ffffffff8043b1d7>] _cond_resched+0x2d/0x38
> >  [<ffffffff80357bc9>] acpi_ps_complete_op+0x235/0x24b
> >  [<ffffffff803582de>] acpi_ps_parse_loop+0x6ff/0x859
> >  [<ffffffff803574db>] acpi_ps_parse_aml+0x7c/0x2bb
> >  [<ffffffff80358a35>] acpi_ps_execute_method+0x144/0x213
> >  [<ffffffff80354e9e>] acpi_ns_evaluate+0x152/0x230
> >  [<ffffffff803452d0>] ? acpi_os_execute_deferred+0x0/0x39
> >  [<ffffffff8034c6a6>] acpi_ev_asynch_execute_gpe_method+0xc1/0x119
> >  [<ffffffff803452fc>] acpi_os_execute_deferred+0x2c/0x39
> >  [<ffffffff802480c7>] run_workqueue+0x95/0x12a
> >  [<ffffffff80248251>] worker_thread+0xf5/0x109
> >  [<ffffffff8024b9cb>] ? autoremove_wake_function+0x0/0x38
> >  [<ffffffff8024815c>] ? worker_thread+0x0/0x109
> >  [<ffffffff8024b678>] kthread+0x49/0x76
> >  [<ffffffff8020d009>] child_rip+0xa/0x11
> >  [<ffffffff8022da41>] ? pick_next_task_fair+0x8b/0x93
> >  [<ffffffff8024b62f>] ? kthread+0x0/0x76
> >  [<ffffffff8020cfff>] ? child_rip+0x0/0x11
> > APIC error on CPU1: 00(40)
> > ACPI: EC: non-query interrupt received, switching to interrupt mode
> > 
> > This is the first time I've seen this error. Kernel is based on commit 
> > f6f7b52e2f61 (just after -rc7) and includes the final versions of the 
> > patches Rafael posted in this thread [1].
> > 
> > More complete log available on request.
> 
> hm, that warning seems to show an ACPI bug (Len Cc:-ed): we preempt in an 
> atomic section - right during executing an AML scriptlet. Executing ACPI 
> AMLs is a rather fragile moment of the kernel: they are used by the BIOS 
> to indirectly instruct the kernel to tweak lowlevel chipset registers and 
> other platform details.
> 
> The kernel executes AMLs 'blindly' - they tweak details that Linux 
> typically has no knowledge about via any driver - so these things must 
> absolutely run atomic, and scheduling away in the wrong moment (which 
> means implicitly re-enabling interrupts) can leave the system in an 
> inconsistent state.
> 
> This 'blindness' and opaqueness of AML execution is perhaps the nastiest 
> aspect of the whole ACPI engine (because their opacity makes them 
> undebuggable and unfixable in essence). Nevertheless, it still might be 
> some unrelated phenomenon to your APIC illegal vector errors.

I believe that it is a bug that run_workqueue() is called
with interrupts off.  I've seen this reported on one other machine,
and Rui debugged it down to the BIOS returning from an SMM entry
with interrupts off:-(  So who knows, maybe run_workqueue()
had interrupts enabled and some AML triggered an SMM
to have the same failure here?

Re: executing AML's blindly.
Yes, and no.  We do know exactly what we interpret --
but yes, it is the BIOS writer that wrote the AML:-(

Above a ACPI interrupt has been deferred to non-interrupt
context for the OS to run its AML handler.  Unfortunately,
we somehow found ourselves running the work-queue with
interrupts off...

Also, the assert above is about to be hidden by a real
bug fix which I sent upstream today -- we'll not call cond_resched()
when  interrupts are off.
This is for the benefit of irqrouter_resume() 
which runs AML with interrupts off.

-Len

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ