[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877dnxm5ju.fsf@jogness.linutronix.de>
Date: Thu, 28 Jan 2021 16:48:29 +0106
From: John Ogness <john.ogness@...utronix.de>
To: Petr Mladek <pmladek@...e.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
kernel test robot <oliver.sang@...el.com>,
LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
lkp@...el.com, zhengjun.xing@...ux.intel.com
Subject: Re: [printk] b031a684bf: INFO:rcu_tasks_detected_stalls_on_tasks
On 2021-01-28, Petr Mladek <pmladek@...e.com> wrote:
> If you are able to reproduce this, it might be interesting to see
> backtraces from all CPUs when RCU stall gets detected. Or even
> printing all processes, even the sleeping ones.
OK, I now have lkp running on my intel core i5 machine and it reliably
reproduces this. First let me say a word about lkp. The instructions in
the report email are a bit misleading. There are lots of steps that it
doesn't mention.
First I needed to install lkp and its dependencies. I did all of this as
root because lkp seems to get confused when sudo is involved.
# git clone https://github.com/intel/lkp-tests.git
# cd lkp-tests
# make install
# lkp install
Then I built the kernel and modules using a helper script to setup the
environment for me:
----- BEGIN /tmp/mk.sh -----
#!/bin/sh
export INSTALL_MOD_PATH=`pwd`
export HOSTCC=gcc-9
export CC=gcc-9
export ARCH=x86_64
exec make "$@"
----- END /tmp/mk.sh -----
# cd linux
# git checkout -b lkp b031a684bfd01d633c79d281bd0cf11c2f834ada
# cp /tmp/config-5.10.0-rc5-gb031a684bfd0 .config
# /tmp/mk.sh -j `nproc` olddefconfig prepare modules_prepare bzImage modules modules_install
Then I created the modules.cgz:
# find lib/modules | cpio -H newc -o | gzip -9c > modules.cgz
I put kernel and modules in /tmp/:
# cp arch/x86/boot/bzImage modules.cgz /tmp/
And ran lkp inside script(1) so I have a logfile:
# script
# lkp qemu -k /tmp/bzImage -m /tmp/modules.cgz /tmp/job-script
And reliably I see the first stall at about 925 seconds:
[ 926.386441][ T22] INFO: rcu_tasks detected stalls on tasks:
[ 926.387310][ T22] 0000000041eb5240: .. nvcsw: 0/0 holdout: 1 idle_cpu: -1/1
[ 926.388196][ T22] task:dmesg state:R running task stack: 0 pid: 1752 ppid: 511 flags:0x2002000
0
[ 926.389509][ T22] Call Trace:
[ 926.389962][ T22] __schedule+0xa12/0xab5
[ 926.390534][ T22] ? firmware_map_remove+0xd1/0xd1
[ 926.391190][ T22] ? ksys_read+0x116/0x150
[ 926.391773][ T22] schedule+0x16c/0x1df
[ 926.392317][ T22] exit_to_user_mode_loop+0x28/0x84
[ 926.392977][ T22] exit_to_user_mode_prepare+0x1d/0x4a
[ 926.393671][ T22] syscall_exit_to_user_mode+0x41/0x4f
[ 926.394357][ T22] entry_INT80_compat+0x71/0x76
[ 926.394982][ T22] RIP: 0023:0xf7f0fa02
[ 926.395528][ T22] RSP: 002b:00000000ff850364 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 926.396590][ T22] RAX: ffffffffffffffe0 RBX: 0000000000000004 RCX: 00000000565c8234
[ 926.397607][ T22] RDX: 0000000000001fff RSI: 00000000000001e0 RDI: 00000000565c8234
[ 926.398617][ T22] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
[ 926.399629][ T22] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 926.400643][ T22] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
But actually it is not the rcu stall that interests me so much. I am
more interested in some bizarre console output I am seeing. Here is a
snippet:
[ 903.008486][ T356] [ 778.732883] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.008496][ T356]
[ 903.014733][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.030232][ T356] [ 778.735197] [ 655.203106] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.030242][ T356]
[ 903.040809][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.044327][ T356] [ 778.746898] [ 655.214214] [ 531.584847] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.044337][ T356]
[ 903.048732][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.066039][ T356] [ 778.751162] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.066050][ T356]
[ 903.070123][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.077137][ T356] [ 778.768744] [ 655.227204] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.077147][ T356]
[ 903.092792][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.095313][ T356] [ 778.773836] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.095323][ T356]
[ 903.106788][ T356] [ 778.781780] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.106798][ T356]
[ 903.121986][ T356] [ 778.791183] [ 655.230287] [ 531.589973] [ 407.116952] [ 283.068815] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.121997][ T356]
[ 903.134120][ T356] [ 778.797950] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.134130][ T356]
[ 903.148741][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.151626][ T356] [ 778.805164] [ 655.241173] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.151637][ T356]
[ 903.160126][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.166936][ T356] [ 778.818740] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.166946][ T356]
[ 903.175785][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.185955][ T528] tasks-torture:torture_onoff task: online 0 failed: errno -5
[ 903.189448][ T356] [ 778.825864] [ 655.250559] [ 531.607066] [ 407.120936] tasks-torture:torture_onoff task: online 0 failed: errno -5
These multi-timestamp lines are odd. And they are almost exactly 124
seconds apart. What is going on there? The investigation continues...
John Ogness
Powered by blists - more mailing lists