[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57D9A1A9.8050506@hpe.com>
Date: Wed, 14 Sep 2016 15:14:49 -0400
From: Waiman Long <waiman.long@....com>
To: Theodore Ts'o <tytso@....edu>
CC: Waiman Long <Waiman.Long@....com>, Arnd Bergmann <arnd@...db.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
<linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Scott J Norton <scott.norton@....com>,
Douglas Hatch <doug.hatch@....com>
Subject: Re: [PATCH] random: Fix kernel panic due to system_wq use before
init
On 09/14/2016 03:03 PM, Waiman Long wrote:
> While booting a 4.8-rc6 kernel on a 16-socket 768-thread Broadwell-EX
> system, the kernel panic'ed with the following log:
>
> [ 51.837010] BUG: unable to handle kernel NULL pointer dereference at 0000000000000102
> [ 51.845635] IP: [<ffffffff810a49d2>] __queue_work+0x32/0x420
> [ 52.004366] Call Trace:
> [ 52.007053]<IRQ>
> [ 52.009171] [<ffffffff810a4de7>] queue_work_on+0x27/0x40
> [ 52.015306] [<ffffffff8146ebd7>] credit_entropy_bits+0x1d7/0x2a0
> [ 52.022002] [<ffffffff8146f339>] ? add_interrupt_randomness+0x1b9/0x210
> [ 52.029366] [<ffffffff8146f339>] add_interrupt_randomness+0x1b9/0x210
> [ 52.036544] [<ffffffff810ea670>] handle_irq_event_percpu+0x40/0x80
> [ 52.043430] [<ffffffff810ea6eb>] handle_irq_event+0x3b/0x60
> [ 52.049655] [<ffffffff810ede48>] handle_level_irq+0x88/0x100
> [ 52.055968] [<ffffffff8103004b>] handle_irq+0xab/0x130
> [ 52.061708] [<ffffffff81092861>] ? _local_bh_enable+0x21/0x50
> [ 52.068125] [<ffffffff81037bf5>] ? __exit_idle+0x5/0x30
> [ 52.073965] [<ffffffff816fc9ed>] do_IRQ+0x4d/0xd0
> [ 52.079229] [<ffffffff816fa88c>] common_interrupt+0x8c/0x8c
> [ 52.085444]<EOI>
> [ 52.087568] [<ffffffff8106bea9>] ? try_to_free_pmd_page+0x9/0x40
> [ 52.094462] [<ffffffff8106bea5>] ? try_to_free_pmd_page+0x5/0x40
> [ 52.101157] [<ffffffff8106bf2a>] ? __unmap_pmd_range.part.5+0x4a/0x70
> [ 52.108330] [<ffffffff8106c330>] unmap_pmd_range+0x130/0x250
> [ 52.114644] [<ffffffff8106cb7b>] __cpa_process_fault+0x47b/0x5a0
> [ 52.121339] [<ffffffff8106d7cb>] __change_page_attr+0x78b/0x9e0
> [ 52.127946] [<ffffffff81065e45>] ? __raw_callee_save___native_queued_spin_unlock+0x15/0x30
> [ 52.137124] [<ffffffff8106da98>] __change_page_attr_set_clr+0x78/0x300
> [ 52.144404] [<ffffffff81229180>] ? __slab_alloc+0x4d/0x5c
> [ 52.150436] [<ffffffff8106f15f>] kernel_map_pages_in_pgd+0x8f/0xd0
> [ 52.157333] [<ffffffff81dc0fb0>] efi_setup_page_tables+0xcc/0x1d9
> [ 52.164124] [<ffffffff81dc0a8d>] efi_enter_virtual_mode+0x35e/0x4af
> [ 52.171117] [<ffffffff81d9e109>] start_kernel+0x41f/0x4c8
> [ 52.177142] [<ffffffff81d9dad8>] ? set_init_arg+0x55/0x55
> [ 52.183168] [<ffffffff81d9d120>] ? early_idt_handler_array+0x120/0x120
> [ 52.190440] [<ffffffff81d9d5d6>] x86_64_start_reservations+0x2a/0x2c
> [ 52.197516] [<ffffffff81d9d724>] x86_64_start_kernel+0x14c/0x16f
> [ 52.204214] Code: 89 e5 41 57 41 56 49 89 f6 41 55 41 89 fd 41 54 49 89 d4 53 48 83 ec 10 89 7d d4 ff 14 25 a0 a7 c2 81 f6 c4 02 0f 85 0d 03 00 00<41> f6 86 02 01 00 00 01 0f 85 ae 02 00 00 49 c7 c7 18 41 01 00
> [ 52.225516] RIP [<ffffffff810a49d2>] __queue_work+0x32/0x420
> [ 52.231838] RSP<ffff88bd7f403de8>
> [ 52.235667] CR2: 0000000000000102
> [ 52.239667] ---[ end trace 2ee7ea9d2908eb72 ]---
> [ 52.244743] Kernel panic - not syncing: Fatal exception in interrupt
>
> Looking at the panic'ed instruction indicated that system_wq was
> used by credit_entropy_bits() before it it was initialized in an
> early_initcall.
>
> This patch prevents the schedule_work() call from being made before
> system_wq is initialized.
>
> Signed-off-by: Waiman Long<Waiman.Long@....com>
> ---
> drivers/char/random.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 3efb3bf..3afc519 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -730,8 +730,12 @@ retry:
> r->entropy_total>= 2*random_read_wakeup_bits) {
> struct entropy_store *other =&blocking_pool;
>
> - if (other->entropy_count<=
> - 3 * other->poolinfo->poolfracbits / 4) {
> + /*
> + * We cannot call schedule_work() before system_wq
> + * is initialized.
> + */
> + if (system_wq&& (other->entropy_count<=
> + 3 * other->poolinfo->poolfracbits / 4)) {
> schedule_work(&other->push_work);
> r->entropy_total = 0;
> }
This patch fixed the kernel panic, but the test system still seemed to
hang after the following log messages:
[ 0.276735] random: fast init done
[ 6.230775] random: crng init done
In the stack backtrace above, the kernel hadn't even reached SMP boot
after about 50s. That was extremely slow. I tried the 4.7.3 kernel and
it booted up fine. So I suspect that there may be too many interrupts
going on and it consumes most of the CPU cycles. The prime suspect is
the random driver, I think.
I would like to hear your thought on that.
Cheers,
Longman
Powered by blists - more mailing lists