[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.63.0707231858420.9759@localhost.localdomain>
Date: Mon, 23 Jul 2007 19:07:37 -0700 (PDT)
From: Jeremy Katz <jeremy.katz@...driver.com>
To: Oleg Nesterov <oleg@...sign.ru>
cc: Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...l.org>, Ingo Molnar <mingo@...e.hu>,
Stable Team <stable@...nel.org>
Subject: Re: [PATCH] posix-timer: fix deletion race
On Fri, 20 Jul 2007, Oleg Nesterov wrote:
> On 07/18, Jeremy Katz wrote:
>>
>> On Wed, 18 Jul 2007, Oleg Nesterov wrote:
>>
>>> Jeremy, I agree with Thomas that your patch should not be right, but it
>>> does make a difference. Perhaps this is just the timing, but who knows.
>>> Could you add some printk's to be sure that lock_timer() actually fails
>>> while it never should?
>>
>> Agreed.
>>
>> Unfortunately, adding any significant output appears to alter the
>> situation to the point where the issue either does not occur, or takes
>> significantly longer to surface.
>
> No, no, I didn't mean any significant output. You changed itimer_delete()
>
> > - spin_lock_irqsave(&timer->it_lock, flags);
> > + /* timer already deleted? */
> > + if (lock_timer(timer->it_id, &flags) == NULL)
> > + return;
>
> This change should not help, lock_timer() should always succeed here.
> But since it makes a difference, we can make something like
>
> if (lock_timer(timer->it_id, &flags) == NULL) {
> printk("Impossible! but it happened.\n");
> return;
> }
>
> The same for posix_timer_fn().
Ahh, of course. I did try that at some point, and remember seeing at
least the occasional failure. This time, taking the spinlock and then
checking for a valid timer ID, I did not see the locking fail. I did see
the attempt to use a freed sigqueue, further suggesting my 'fix' merely
altered the timing sufficiently to hide the issue.
> I still can't believe we have a double-free problem, this looks imposiible.
> Do you see the
>
> "idr_remove called for id=%d which is not allocated.\n"
>
> in syslog?
No. I also added some accounting with atomic counters, and don't see
evidence of a second call to release_posix_timer.
> Could you try the patch below? Perhaps we have some wierd problem with
> ->sigq corruption.
Tried, with apparent effect.
To add to the body of data: Turning off hyperthreading in the hardware
does not resolve the issue. Limiting the system to one CPU does appear to
work.
Jeremy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists