linux-kernel - Availability issues during kernel load and kernel error handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <21264295.1104761357661734261.JavaMail.defaultUser@defaultHost>
Date:	Tue, 8 Jan 2013 17:15:34 +0100 (CET)
From:	"larsson.leif@...ia.com" <larsson.leif@...ia.com>
To:	linux-kernel@...r.kernel.org
Subject: Availability issues during kernel load and kernel error handling

Hi,

I send this mail as information/suggestion for improvements of Linux 
kernel. I have identified and solved the problem below some years ago. 
I think that it would be great if the changes could be part of generic 
distributions in the future as the affects availability when using 
Linux in embedded system:

1. PROBLEM: The low level kernel functions exit(), panic(), poweroff() 
and halt() are implemented with endless loops.
2. DESCRIPTION: When a problem occurs during kernel initialisation or 
during operation where the functions are called, the only way to resume 
operation is to power off and on the unit. This works when the unit is 
in places with personnel, but installed at remote locations it is a 
problem.
3. KEYWORDS: exit(), halt(), panic(), poweroff(), error()
4. KERNEL VERSION: 2.4.20 to at least 2.6.9 
5. OUTPUT: -
6. EXAMPLE: Give command "halt" alt. "shutdown -h" at prompt or 
introduce a problem during boot process like checksum error of packed 
kernel load image.
7. ENVIRONMENT: All
8. WORKAROUND: In our system, activation of board whatchdog that 
reboots the unit as the CPU reset pin wasn't routed to a curcuit that 
could be written from CPU.
My changes in arch/ppc/boot/common/misc-common.c:

void reset_eb860(void)
{
#ifdef CONFIG_EB860
    puts("\n Rebooting in 5 seconds \n");
    *(unsigned char *) EB860_REG_WDT_CTL    |=  0x09;
    *(unsigned char *) EB860_REG_WDT_CTL    &= ~0x60;
    *(unsigned char *) EB860_REG_CTRL_STAT1 |=  EB860_TR_WDG | 
EB860_EN_WDG;
#else
        puts("EB860 HW watchdog reboot not available! \n");
#endif
}

void exit(void)
{
    puts("exit \n");
    reset_eb860();
    while(1);
}

9. SUGGESTION: Make it possible to configure alternative error 
resolving behavior instead of waiting for power-off in endless loops. 
This could be make a core dump and then wait a configurable time before 
calling restart() or just directly call restart().
10: CODING EXAMPLES: from linux-2.6.9 distribution.
arch/ppc/boot/common/misc-common.c

void exit(void)
{
    puts("exit\n");
    while(1);
}

void error(char *x)
{
    puts("\n\n");
    puts(x);
    puts("\n\n -- System halted");

    while(1);    /* Halt */
}

from linux-2.6.9/arch/kernel/exit.c

asmlinkage NORET_TYPE void do_exit(long code)
{
    struct task_struct *tsk = current;

    profile_task_exit(tsk);

    if (unlikely(in_interrupt()))
        panic("Aiee, killing interrupt handler!");
    if (unlikely(!tsk->pid))
        panic("Attempted to kill the idle task!");
    if (unlikely(tsk->pid == 1))
        panic("Attempted to kill init!");
    if (tsk->io_context)
        exit_io_context();
    tsk->flags |= PF_EXITING;
    del_timer_sync(&tsk->real_timer);

    if (unlikely(in_atomic()))
        printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
                current->comm, current->pid,
                preempt_count());

    if (unlikely(current->ptrace & PT_TRACE_EXIT)) {
        current->ptrace_message = code;
        ptrace_notify((PTRACE_EVENT_EXIT << 8) | SIGTRAP);
    }

    acct_process(code);
    __exit_mm(tsk);

    exit_sem(tsk);
    __exit_files(tsk);
    __exit_fs(tsk);
    exit_namespace(tsk);
    exit_thread();

    if (tsk->signal->leader)
        disassociate_ctty(1);

    module_put(tsk->thread_info->exec_domain->module);
    if (tsk->binfmt)
        module_put(tsk->binfmt->module);

    tsk->exit_code = code;
    exit_notify(tsk);
#ifdef CONFIG_NUMA
    mpol_free(tsk->mempolicy);
    tsk->mempolicy = NULL;
#endif
    schedule();
    BUG();
    /* Avoid "noreturn function does return".  */
    for (;;) ;
}

Have also seen an issue where the kernel code loaded into RAM was 
affected by a random memory error that resulted in hang-up of a unit. 
The problem is that there isn't any board supported supervision that 
can reboot a unit from the time where the unit is powered on until the 
kernel has enabled the timer interrupt handler. When the timer 
interrupt is operational, a SW watchdog can be used to monitor that the 
kernel and applications doesn't get stuck or enter eternal loops. To 
solve this boot issue, an external timer relay is triggered by power 
on. If the unit boots correctly to run level 0, an output signal is 
used to deactivate the timer relay. If not deactivated, the relay 
triggers a power off - on reset.

Best regards,
Leif Larsson

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/