lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 8 Sep 2022 11:21:59 +0300
From:   Alexey Lyahkov <alexey.lyashkov@...il.com>
To:     "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc:     linux-ext4 <linux-ext4@...r.kernel.org>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger@...ger.ca>,
        Artem Blagodarenko <artem.blagodarenko@...il.com>,
        Andrew Perepechko <anserper@...ru>
Subject: Re: [PATCH] jbd2: wake up journal waiters in FIFO order, not  LIFO



> On 8 Sep 2022, at 09:11, Ritesh Harjani (IBM) <ritesh.list@...il.com> wrote:
> 
> On 22/09/08 08:51AM, Alexey Lyahkov wrote:
>> Hi Ritesh,
>> 
>> This was hit on the Lustre OSS node when we have ton’s of short write with sync/(journal commit) in parallel.
>> Each write was done from own thread (like 1k-2k threads in parallel).
>> It caused a situation when only few/some threads make a wakeup and enter to the transaction until it will be T_LOCKED.
>> In our’s observation all handles from head was waked and it’s handles added recently, while old handles still in list and
> 
> Thanks Alexey for providing the details.
> 
>> It caused a soft lockup messages on console.
> 
> Did you mean hung task timeout? I was wondering why will there be soft lockup
> warning, because these old handles are anyway in a waiting state right.
> Am I missing something?
> 
Oh. I asked a colleges about details. It was internal lustre hung detector not a kernel side

[ 2221.036503] Lustre: ll_ost_io04_080: service thread pid 55122 was inactive for 80.284 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 2221.036677] Pid: 55212, comm: ll_ost_io05_074 4.18.0-305.10.2.x6.1.010.19.x86_64 #1 SMP Thu Jun 30 13:42:51 MDT 2022
[ 2221.056673] Lustre: Skipped 2 previous similar messages
[ 2221.067821] Call Trace TBD:
[ 2221.067855] [<0>] wait_transaction_locked+0x89/0xc0 [jbd2]
[ 2221.099175] [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[ 2221.105266] [<0>] start_this_handle+0x10a/0x520 [jbd2]
[ 2221.110904] [<0>] jbd2__journal_start+0xea/0x1f0 [jbd2]
[ 2221.116679] [<0>] __ldiskfs_journal_start_sb+0x6e/0x130 [ldiskfs]
[ 2221.123316] [<0>] osd_trans_start+0x13b/0x4f0 [osd_ldiskfs]
[ 2221.129417] [<0>] ofd_commitrw_write+0x620/0x1830 [ofd]
[ 2221.135147] [<0>] ofd_commitrw+0x731/0xd80 [ofd]
[ 2221.140420] [<0>] obd_commitrw+0x1ac/0x370 [ptlrpc]
[ 2221.145858] [<0>] tgt_brw_write+0x1913/0x1d50 [ptlrpc]
[ 2221.151561] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
[ 2221.157622] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[ 2221.164454] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
[ 2221.169860] [<0>] kthread+0x116/0x130
[ 2221.174033] [<0>] ret_from_fork+0x1f/0x40


Other logs have shown this thread can’t take a handle, but other threads able to do it many times.
Kernel detector don’t hit because thread have wakeup many times but it have seen T_LOCKED and go to sleep again.

Alex



> -ritesh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ