lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 30 Jun 2018 22:12:33 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     Chris Boot <bootc@....tc>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
        linux-block@...r.kernel.org
Subject: Re: Hard lockup in blk_mq_free_request() / wbt_done() / wake_up_all()

On 2018/06/20 21:45, Chris Boot wrote:
> Hi Jens,
> 
> I got an opportunity yesterday to do some testing. I can't get this
> system to crash with blk-mq disabled, or with blk-mq enabled but wbt
> disabled. I have a reproducer workload I can launch against the system
> and it seems to crash reliably with this, but I doubt I can share it
> with you.
> 
> I do, however, have a task state dump (SysRq+T) that I managed to get
> out of the server once it started locking up. It's pretty large, so I
> uploaded it to my Dropbox for now:
> 
> https://www.dropbox.com/s/fyo1ab6mmcqk8fq/crash-1.log.gz?dl=0
> 
> Hope this helps!

I'm not familiar with block layer, but checking Workqueue entries in SysRq-t.
blk_mq_timeout_work was stuck at RCU synchronization and wb_workfn was
stuck at wbt_wait() ?

----------------------------------------
[ 5376.011390] kworker/73:1H   D    0  1803      2 0x80000000
[ 5376.017418] Workqueue: kblockd blk_mq_timeout_work
[ 5376.022770] Call Trace:
[ 5376.025805]  ? __schedule+0x3dc/0x870
[ 5376.030009]  schedule+0x32/0x80
[ 5376.033737]  schedule_timeout+0x1e5/0x350
[ 5376.038298]  ? scsi_dispatch_cmd+0xe5/0x220 [scsi_mod]
[ 5376.043976]  ? scsi_queue_rq+0x4d2/0x600 [scsi_mod]
[ 5376.049425]  wait_for_completion+0x123/0x190
[ 5376.054270]  ? wake_up_q+0x70/0x70
[ 5376.058262]  __wait_rcu_gp+0x10f/0x140
[ 5376.062736]  synchronize_sched+0x59/0x80
[ 5376.067208]  ? call_rcu_bh+0x20/0x20
[ 5376.071350]  ? trace_raw_output_rcu_utilization+0x40/0x40
[ 5376.077312]  blk_mq_timeout_work+0x138/0x200
[ 5376.082152]  process_one_work+0x177/0x350
[ 5376.086692]  worker_thread+0x4d/0x3c0
[ 5376.090903]  kthread+0xf8/0x130
[ 5376.094588]  ? rescuer_thread+0x350/0x350
[ 5376.099128]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 5376.104745]  ? SyS_exit_group+0x10/0x10
[ 5376.109189]  ret_from_fork+0x35/0x40
----------------------------------------

----------------------------------------
[ 5391.381393] kworker/u225:0  D    0  9248      2 0x80000000
[ 5391.387404] Workqueue: writeback wb_workfn (flush-254:6)
[ 5391.393337] Call Trace:
[ 5391.396347]  ? __schedule+0x3dc/0x870
[ 5391.400533]  schedule+0x32/0x80
[ 5391.404253]  io_schedule+0x12/0x40
[ 5391.408188]  wbt_wait+0x277/0x350
[ 5391.412045]  ? remove_wait_queue+0x60/0x60
[ 5391.416706]  blk_mq_make_request+0x107/0x5f0
[ 5391.421510]  ? __split_and_process_bio+0x131/0x190 [dm_mod]
[ 5391.427687]  generic_make_request+0x123/0x2f0
[ 5391.432627]  ? submit_bio+0x6c/0x140
[ 5391.436734]  ? _cond_resched+0x16/0x40
[ 5391.441051]  submit_bio+0x6c/0x140
[ 5391.444987]  ? kmem_cache_alloc+0xda/0x1b0
[ 5391.449645]  ext4_io_submit+0x48/0x60 [ext4]
[ 5391.454489]  ext4_writepages+0x49c/0xe60 [ext4]
[ 5391.459594]  ? scsi_queue_rq+0x4d2/0x600 [scsi_mod]
[ 5391.465015]  ? update_load_avg+0x64f/0x6e0
[ 5391.469651]  ? do_writepages+0x1a/0x60
[ 5391.473932]  do_writepages+0x1a/0x60
[ 5391.478059]  __writeback_single_inode+0x3d/0x320
[ 5391.483226]  writeback_sb_inodes+0x221/0x4b0
[ 5391.488060]  __writeback_inodes_wb+0x87/0xb0
[ 5391.492897]  wb_writeback+0x288/0x320
[ 5391.497110]  ? cpumask_next+0x16/0x20
[ 5391.501346]  ? wb_workfn+0x37c/0x450
[ 5391.505461]  wb_workfn+0x37c/0x450
[ 5391.509461]  process_one_work+0x177/0x350
[ 5391.514002]  worker_thread+0x4d/0x3c0
[ 5391.518197]  kthread+0xf8/0x130
[ 5391.521870]  ? rescuer_thread+0x350/0x350
[ 5391.526403]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 5391.532029]  ? SyS_exit_group+0x10/0x10
[ 5391.536405]  ret_from_fork+0x35/0x40
----------------------------------------

Impossible to tell whether these threads made progress over time.
Please try https://akari.osdn.jp/capturing-kernel-messages.html#Tips5 .

But synchronize_rcu() in blk_mq_timeout_work() seems to be removed by
commit 12f5b93145450c75 ("blk-mq: Remove generation seqeunce") which went to
4.18-rc1. Thus, trying to reproduce with latest linux.git would be helpful.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ