linux-kernel - System lockup with 2.6.26.8-rt16 on ARM9

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <3efb10970908230744l4cfaf2el61abce6229796194@mail.gmail.com>
Date:	Sun, 23 Aug 2009 16:44:54 +0200
From:	Remy Bohmer <linux@...mer.net>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	linux-rt-users <linux-rt-users@...r.kernel.org>
Subject: System lockup with 2.6.26.8-rt16 on ARM9

Hello Thomas,

We are running into a bug on a ARM926 processor with 2.6.26.8-rt16
where the system stops scheduling any thread while the hard-irq
handlers keep on running (and waking up tasks with wake_up_process())
What happens is that the system is idle and is continuously polling on
need_resched() in the cpu_idle() routine (arch/arm/kernel/process.c)
But, need_resched() keeps on returning false, so the system stays in
the idle loop and __schedule is never called.

While debugging this we noticed/tried several things already:
* default_idle() wakes up properly when an interrupt occurs.
* Interrupt handling keeps on running, but no interrupt threads are
scheduled either.
* Triggering a different interrupt source sometimes get the system out
of the lockup condition.
* removing the loop with need_resched() to force the system to call
__schedule() after waking up from default_idle() has no effect, the
scheduler does not schedule any threads that are clearly runnable. See
logging below.
* preempt_schedule_irq() is not called after an irq during the lockup.
Probably it is because the check on preempt_count!=0 in __irq_svc (in
arch/arm/kernel/entry-armv.S).
* We have excluded any relation with any user application by adding
root_delay=<1-month-in-seconds> to the kernel commandline.
* We stripped down the kernel as much as possible, excluding almost
all drivers, including removing the networking layer. (but with
networking and a ping flood it seems easier to reproduce)
* Problem exists with NO_HZ, HRT as well as without it.

Is this problem already known, and is there maybe/hopefully already a
fix for it?
Or do you have any other good suggestion for debugging/tracing this?

void cpu_idle(void)
{
        local_fiq_enable();

        /* endless idle loop with no priority at all */
        while (1) {
                void (*idle)(void) = pm_idle;

#ifdef CONFIG_HOTPLUG_CPU
                if (cpu_is_offline(smp_processor_id())) {
                        leds_event(led_idle_start);
                        cpu_die();
                }
#endif

                if (!idle)
                        idle = default_idle;
                leds_event(led_idle_start);
                tick_nohz_stop_sched_tick(1);
--> System keeps looping in this need_resched() loop.
                while (!need_resched())
                        idle();
                leds_event(led_idle_end);
                tick_nohz_restart_sched_tick();
                local_irq_disable();
                __preempt_enable_no_resched();
--> Even calling this (after removing the above 'while
(!need_resched())') has no effect.
                __schedule();
                preempt_disable();
                local_irq_enable();
        }
}


We call 'print_cpu()' when a lockup occurs. We call it directly from a
hard-irq handler.
(canary_check is a kernel thread that is part of our debugging code.
It is a thread that is directly scheduled via wake_up_process() from
the system timer irq. We check from the hard-irq handler if this
thread is being scheduled, if not the system is locked up and we dump
this logging)

[ 8931.475755]
[ 8931.475769] cpu#0
[ 8931.479186]   .nr_running                    : 3
[ 8931.483809]   .load                          : 532566
[ 8931.488862]   .nr_switches                   : 37438443
[ 8931.494089]   .nr_load_updates               : 860943
[ 8931.499142]   .nr_uninterruptible            : 1
[ 8931.503761]   .jiffies                       : 830943
[ 8931.508816]   .next_balance                  : 0.000000
[ 8931.514043]   .curr->pid                     : 0
[ 8931.518664]   .clock                         : 8931475.737362
[ 8931.524411]   .cpu_load[0]                   : 532566
[ 8931.529464]   .cpu_load[1]                   : 532420
[ 8931.534516]   .cpu_load[2]                   : 516298
[ 8931.539569]   .cpu_load[3]                   : 433653
[ 8931.544621]   .cpu_load[4]                   : 310689
[ 8931.549674]   .rt.rt_nr_running              : 3
[ 8931.554293]   .rt.rt_nr_uninterruptible      : 4294967293
[ 8931.559694]
[ 8931.559703] cfs_rq[0]:
[ 8931.563548]   .exec_clock                    : 0.000000
[ 8931.568778]   .MIN_vruntime                  : 0.000001
[ 8931.574007]   .min_vruntime                  : 1254.953603
[ 8931.579496]   .max_vruntime                  : 0.000001
[ 8931.584724]   .spread                        : 0.000000
[ 8931.589952]   .spread0                       : 0.000000
[ 8931.595178]   .nr_running                    : 0
[ 8931.599796]   .load                          : 0
[ 8931.604414]   .nr_spread_over                : 0
[ 8931.609048]
[ 8931.609056] runnable tasks:
[ 8931.609067]             task   PID         tree-key  switches  prio
    exec-runtime         sum-exec        sum-sleep
[ 8931.609094] ----------------------------------------------------------------------------------------------------------
[ 8931.634702]     sirq-timer/0     5         6.652438   1003780    59
              0               0               0.000000
0.000000               0.000000
[ 8931.649795]     canary_check    33        78.141669    860433     0
              0               0               0.000000
0.000000               0.000000
[ 8931.664887]           IRQ-57   243       832.211283  13874273    59
              0               0               0.000000
0.000000               0.000000
[ 8931.764591]


Kind regards,

Remy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/