[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090130140620.GD17401@elte.hu>
Date: Fri, 30 Jan 2009 15:06:20 +0100
From: Ingo Molnar <mingo@...e.hu>
To: "Rafael J. Wysocki" <rjw@...k.pl>
Cc: Frédéric Weisbecker <fweisbec@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Maciej Rutecki <maciej.rutecki@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [Linux 2.6.29-rc2] BUG: using smp_processor_id() in preemptible
* Rafael J. Wysocki <rjw@...k.pl> wrote:
> On Thursday 29 January 2009, Ingo Molnar wrote:
> >
> > * Rafael J. Wysocki <rjw@...k.pl> wrote:
> >
> > > On Tuesday 27 January 2009, Ingo Molnar wrote:
> > > >
> > > > * Rafael J. Wysocki <rjw@...k.pl> wrote:
> > > >
> > > > > > In fact whatever check you put in it's _always_ going to be
> > > > > > fundamentally more fragile than direct instrumentation: you cannot
> > > > > > possibly check all possible places that enable interrupts. (they could
> > > > > > be disabling interrupts as a _restore_irqs() sequence for example)
> > > > >
> > > > > In this particular case, I'm not really interested in that. What I'm
> > > > > interested in is which driver's ->suspend_late() or ->resume_early() (or
> > > > > the equivalents for sysdevs) has enabled interrupts, which is quite easy
> > > > > to check directly.
> > > >
> > > > But this is exactly what it does - without any need for debug checks
> > > > spread around!
> > > >
> > > > You'll get a _full stack dump_ from the very driver that is enabling
> > > > interrupts! You dont get a trace - you get a stack dump of the very place
> > > > that is buggy. It does not get any better than that.
> > >
> > > I'm not going to argue.
> > >
> > > Nevertheless, IMO something like the patch below should be sufficient to catch
> > > these bugs.
> > >
> > > Thanks,
> > > Rafael
> > >
> > >
> > > ---
> > > drivers/base/power/main.c | 12 ++++++++++++
> > > drivers/base/sys.c | 21 ++++++++++++++++-----
> > > include/linux/pm.h | 18 ++++++++++++++++++
> > > 3 files changed, 46 insertions(+), 5 deletions(-)
> >
> > hm, so now you sprinkle debug checks all around the code, instead of
> > putting in a single pair of:
> >
> > force_irqs_off_start();
> > ...
> > force_irqs_off_end();
>
> And what debug options exactly would that require to be set to work?
hm, if you worry about that aspect: we could make it seemlessly enabled if
PM_DEBUG is enabled.
> > which would catch everything that your checks would catch - and it
> > would catch more.
>
> Except that the checks trigger in specific places, so if a check
> triggers you know precisely where the bug happened regardless of what
> garbage is in the call trace.
This argument is 100% mystery to me. Do you really not see the quality
difference between a stack trace generated _right at the buggy piece of
code_ and a warning later on that might (or might not) trigger?
Especially considering that your approach wont catch such bugs:
...
spin_unlock_irq();
...
spin_lock_irq();
...
Or such bugs:
local_irq_enable();
...
local_irq_disable();
Or such bugs:
spin_lock_irq_save(&lock1, flags);
...
spin_lock_irqsave(&lock2, flags);
...
spin_unlock_irq(&lock2); /* accidental bug */
...
spin_unlock_irq_restore(&lock1, flags);
Such types of bugs might be especially hard to find in practice, if the
window where irqs are enabled is small. There is no guarantee at all that
accidental irq enabling survives a critical section - it can be
re-disabled in the normal flow of things very easily.
And even if we are lucky and if the irqs stay enabled by the time the
callback returns, what if your warning flags some big and complex driver,
one line of which is buggy?
If you had the choice, what would you prefer - a stack dump done at the
point of incident (pinpointing the driver, the subsystem and the buggy
function with its full callframe), or your "this driver is buggy" generic
warning with no specificity about where that bug might be?
Stacktraces _at the point of incident_, and a _guaranteed_ facility that
_enforces_ that irqs are off during the whole resume cycle are just about
the highest quality debug info and debug protection we can get in such
situations.
I really dont understand your points about this.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists