linux-kernel - hung task in 4.14 (syzbot bug from 2018 April 17)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOTh5U3srTK_-OeESDL_-=M8petsorTDfzr6Ycte9f7p+5ZLCQ@mail.gmail.com>
Date:   Wed, 3 Oct 2018 17:29:23 -0600
From:   Sebastian Kuzminsky <seb.kuzminsky@...il.com>
To:     linux-kernel@...r.kernel.org
Subject: hung task in 4.14 (syzbot bug from 2018 April 17)

I think i've run into the bug described here:

https://lkml.org/lkml/2018/4/18/188

I've got a binder-free system that reports a hung task with this backtrace:

[76800.726654] INFO: task systemd:1 blocked for more than 60 seconds.
[76800.726657]       Tainted: G           OE   4.14.67-solidfire1 #1
[76800.726657] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[76800.726659] systemd         D    0     1      0 0x00000000
[76800.726662] Call Trace:
[76800.726673]  ? __schedule+0x27f/0x870
[76800.726676]  schedule+0x28/0x80
[76800.726679]  schedule_timeout+0x1e7/0x340
[76800.726685]  ? check_preempt_wakeup+0x102/0x230
[76800.726687]  ? wait_for_completion+0xb0/0x120
[76800.726689]  wait_for_completion+0xb0/0x120
[76800.726693]  ? wake_up_q+0x70/0x70
[76800.726698]  flush_work+0x10d/0x1c0
[76800.726700]  ? worker_detach_from_pool+0xa0/0xa0
[76800.726706]  fsnotify_destroy_group+0x34/0xa0
[76800.726708]  ? SyS_epoll_ctl+0x1d4/0xe50
[76800.726710]  inotify_release+0x1a/0x50
[76800.726714]  __fput+0xd8/0x220
[76800.726717]  task_work_run+0x8a/0xb0
[76800.726721]  exit_to_usermode_loop+0xb9/0xc0
[76800.726723]  do_syscall_64+0x10b/0x120
[76800.726727]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[76800.726730] RIP: 0033:0x7fb6957ff900
[76800.726731] RSP: 002b:00007ffc685fdd60 EFLAGS: 00000293 ORIG_RAX:
0000000000000003
[76800.726733] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 00007fb6957ff900
[76800.726735] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012
[76800.726736] RBP: 00007fb697167088 R08: 000055ae6c9224c0 R09: 000055ae6ace92ad
[76800.726737] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[76800.726738] R13: 0000000000000000 R14: 0000000000079de4 R15: 0000000000000000
[76800.727130] INFO: task kworker/u113:1:29214 blocked for more than 60 seconds.
[76800.727132]       Tainted: G           OE   4.14.67-solidfire1 #1
[76800.727132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[76800.727134] kworker/u113:1  D    0 29214      2 0x80000000
[76800.727139] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[76800.727141] Call Trace:
[76800.727144]  ? __schedule+0x27f/0x870
[76800.727146]  schedule+0x28/0x80
[76800.727148]  schedule_timeout+0x1e7/0x340
[76800.727151]  ? __switch_to_asm+0x40/0x70
[76800.727153]  ? update_curr+0xe1/0x1a0
[76800.727156]  ? wait_for_completion+0xb0/0x120
[76800.727157]  wait_for_completion+0xb0/0x120
[76800.727160]  ? wake_up_q+0x70/0x70
[76800.727164]  __synchronize_srcu.part.13+0x76/0x90
[76800.727167]  ? trace_raw_output_rcu_utilization+0x40/0x40
[76800.727169]  ? try_to_wake_up+0x44/0x460
[76800.727172]  ? fsnotify_mark_destroy_workfn+0x67/0xb0
[76800.727174]  fsnotify_mark_destroy_workfn+0x67/0xb0
[76800.727177]  process_one_work+0x1da/0x3d0
[76800.727179]  worker_thread+0x21f/0x3f0
[76800.727181]  ? process_one_work+0x3d0/0x3d0
[76800.727184]  kthread+0x119/0x130
[76800.727186]  ? kthread_create_on_node+0x40/0x40
[76800.727188]  ret_from_fork+0x35/0x40

The kernel is a stock 4.14.67, plus some minor local patches mostly
related to fibre channel, which i believe is not implicated here.

I have a crash dump of this failure, the reaper_work struct has these contents:

        crash> print reaper_work
        $2 = {
          work = {
            data = {
              counter = -108686497013755
            },
            entry = {
              next = 0xffff9d2671a395f0,
              prev = 0xffffb2624006fdf8
            },
            func = 0xffffffffb5249df0
          },
          timer = {
            entry = {
              next = 0xdead000000000200,
              pprev = 0x0
            },
            expires = 4302608557,
            function = 0xffffffffb50778c0,
            data = 18446744072468618080,
            flags = 195035137
          },
          wq = 0xffff9d2680411400,
          cpu = 128
        }

I'd appreciate help or pointers on how to debug and fix this problem.


-- 
Sebastian Kuzminsky