[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <alpine.LFD.2.00.0812222319260.20902@localhost.localdomain>
Date: Mon, 22 Dec 2008 23:28:26 -0500 (EST)
From: Len Brown <lenb@...nel.org>
To: Ingo Molnar <mingo@...e.hu>
Cc: Frans Pop <elendil@...net.nl>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Yinghai Lu <yinghai@...nel.org>,
Suresh Siddha <suresh.b.siddha@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
"Maciej W. Rozycki" <macro@...ux-mips.org>,
"Pallipadi, Venkatesh" <venkatesh.pallipadi@...el.com>,
"Rafael J. Wysocki" <rjw@...k.pl>, Greg KH <greg@...ah.com>,
jbarnes@...tuousgeek.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
tiwai@...e.de, Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: "APIC error on CPU1: 00(40)" during resume
On Sun, 21 Dec 2008, Ingo Molnar wrote:
>
> * Frans Pop <elendil@...net.nl> wrote:
>
> > On Wednesday 10 December 2008, Ingo Molnar wrote:
> > > regarding those APIC error messages:
> > > > ACPI: Waking up from system sleep state S3
> > > > APIC error on CPU1: 00(40)
> > > > ACPI: EC: non-query interrupt received, switching to interrupt
> > >
> > > that does suggest that the APIC was re-enabled (we dont get any APIC
> > > error exceptions otherwise!), and its LVT was programmed as well, but
> > > somehow we got an erroneous APIC message from an illegal vector.
> >
> > I wonder if this may help tracing the cause. Today I got a KERN_ERR in the
> > middle of those messages:
> >
> > ACPI: Waking up from system sleep state S3
> > BUG: sleeping function called from invalid context at kernel/sched.c:5571
> > in_atomic(): 0, irqs_disabled(): 1, pid: 70, name: kacpid
> > Pid: 70, comm: kacpid Not tainted 2.6.28-rc7-rjw #77
> > Call Trace:
> > [<ffffffff80345013>] ? acpi_os_release_object+0x9/0xd
> > [<ffffffff8022d588>] __might_sleep+0xcf/0xd1
> > [<ffffffff80234e32>] __cond_resched+0x15/0x4b
> > [<ffffffff8043b1d7>] _cond_resched+0x2d/0x38
> > [<ffffffff80357bc9>] acpi_ps_complete_op+0x235/0x24b
> > [<ffffffff803582de>] acpi_ps_parse_loop+0x6ff/0x859
> > [<ffffffff803574db>] acpi_ps_parse_aml+0x7c/0x2bb
> > [<ffffffff80358a35>] acpi_ps_execute_method+0x144/0x213
> > [<ffffffff80354e9e>] acpi_ns_evaluate+0x152/0x230
> > [<ffffffff803452d0>] ? acpi_os_execute_deferred+0x0/0x39
> > [<ffffffff8034c6a6>] acpi_ev_asynch_execute_gpe_method+0xc1/0x119
> > [<ffffffff803452fc>] acpi_os_execute_deferred+0x2c/0x39
> > [<ffffffff802480c7>] run_workqueue+0x95/0x12a
> > [<ffffffff80248251>] worker_thread+0xf5/0x109
> > [<ffffffff8024b9cb>] ? autoremove_wake_function+0x0/0x38
> > [<ffffffff8024815c>] ? worker_thread+0x0/0x109
> > [<ffffffff8024b678>] kthread+0x49/0x76
> > [<ffffffff8020d009>] child_rip+0xa/0x11
> > [<ffffffff8022da41>] ? pick_next_task_fair+0x8b/0x93
> > [<ffffffff8024b62f>] ? kthread+0x0/0x76
> > [<ffffffff8020cfff>] ? child_rip+0x0/0x11
> > APIC error on CPU1: 00(40)
> > ACPI: EC: non-query interrupt received, switching to interrupt mode
> >
> > This is the first time I've seen this error. Kernel is based on commit
> > f6f7b52e2f61 (just after -rc7) and includes the final versions of the
> > patches Rafael posted in this thread [1].
> >
> > More complete log available on request.
>
> hm, that warning seems to show an ACPI bug (Len Cc:-ed): we preempt in an
> atomic section - right during executing an AML scriptlet. Executing ACPI
> AMLs is a rather fragile moment of the kernel: they are used by the BIOS
> to indirectly instruct the kernel to tweak lowlevel chipset registers and
> other platform details.
>
> The kernel executes AMLs 'blindly' - they tweak details that Linux
> typically has no knowledge about via any driver - so these things must
> absolutely run atomic, and scheduling away in the wrong moment (which
> means implicitly re-enabling interrupts) can leave the system in an
> inconsistent state.
>
> This 'blindness' and opaqueness of AML execution is perhaps the nastiest
> aspect of the whole ACPI engine (because their opacity makes them
> undebuggable and unfixable in essence). Nevertheless, it still might be
> some unrelated phenomenon to your APIC illegal vector errors.
I believe that it is a bug that run_workqueue() is called
with interrupts off. I've seen this reported on one other machine,
and Rui debugged it down to the BIOS returning from an SMM entry
with interrupts off:-( So who knows, maybe run_workqueue()
had interrupts enabled and some AML triggered an SMM
to have the same failure here?
Re: executing AML's blindly.
Yes, and no. We do know exactly what we interpret --
but yes, it is the BIOS writer that wrote the AML:-(
Above a ACPI interrupt has been deferred to non-interrupt
context for the OS to run its AML handler. Unfortunately,
we somehow found ourselves running the work-queue with
interrupts off...
Also, the assert above is about to be hidden by a real
bug fix which I sent upstream today -- we'll not call cond_resched()
when interrupts are off.
This is for the benefit of irqrouter_resume()
which runs AML with interrupts off.
-Len
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists