[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f2cbe809-4337-bf9b-d003-7e47ff893248@redhat.com>
Date: Fri, 24 Jun 2022 22:23:23 -0400
From: Waiman Long <longman@...hat.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
Juri Lelli <juri.lelli@...hat.com>,
Mike Stowell <mstowell@...hat.com>
Subject: Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT
waiters
On 6/24/22 02:42, Sebastian Andrzej Siewior wrote:
> On 2022-06-23 10:41:17 [-0400], Waiman Long wrote:
>> On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
>>> Do you have more insight on how this was tested/ created? Based on that,
>>> systemd and a random kworker waited on a lock for more than 10 minutes.
>> The hang happens when our QE team run thier kernel tier 1 test which, I
>> think, lasts several hours. The hang happens in some runs but not all of
>> them. So it is kind of opportunistic. Mike should be able to provide a
>> better idea about frequency and so on.
> So we talk here about 64+ CPU or more than that?
The bug report that I saw showed 2 hangs in two different one-socket
4-core 8-thread systems. So they are not 64+ cpus. Of course, the more
cpus that we have, the more chance we will get into trouble.
>
>>> I added a trace-printk each time a non-RT waiter got the lock stolen,
>>> kicked a kernel build and a package upgrade and took a look at the stats
>>> an hour later:
>>> - sh got its lock stolen 3416 times. I didn't lock the pid so I can't
>>> look back and check how long it waited since the first time.
>>> - the median average of stolen locks is 173.
>> Maybe we should also more lock stealing per waiter than the 10 that I used
>> in the patch. I am open to suggestion to what is a good value to use.
> I have no idea either. I just looked at a run to see what the number
> actually are. I have no numbers in terms of performance. So what most
> likely happens is that on an unlock operation the waiter gets a wake-up
> but before he gets a chance to acquire the lock, it is already taken and
> he goes back to sleep again. While this looks painful it might be better
> performance wise because the other task was able to acquire the lock
> without waiting. But then it is not fair and this happens.
> One thing that I'm curious about is, what lock is it (one or two global
> hot spots or many). And how to benchmark this…
The hang can happen in different places. Like
[ 1873.205398] INFO: task pool-restraintd:122206 blocked for more than
622 seconds.
[ 1873.205400] Tainted: G W OE X --------- ---
5.14.0-104.rt21.104.el9.x86_64 #1
[ 1873.205401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1873.205402] task:pool-restraintd state:D stack: 0 pid:122206
ppid: 1 flags:0x00000000
[ 1873.205405] Call Trace:
[ 1873.205407] __schedule+0x21c/0x5b0
[ 1873.205410] ? migrate_enable+0xcd/0x140
[ 1873.205413] schedule+0x46/0xc0
[ 1873.205415] rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1873.205418] __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1873.205421] __rwbase_read_lock+0x9c/0xe0
[ 1873.205425] do_madvise+0x109/0x9c0
[ 1873.205429] ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205432] ? do_syscall_64+0x69/0x80
[ 1873.205434] __x64_sys_madvise+0x28/0x30
[ 1873.205436] do_syscall_64+0x5c/0x80
[ 1873.205438] ? fpregs_restore_userregs+0x53/0x100
[ 1873.205442] ? exit_to_user_mode_prepare+0xda/0xe0
[ 1873.205445] ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205447] ? do_syscall_64+0x69/0x80
[ 1873.205449] ? do_syscall_64+0x69/0x80
[ 1873.205452] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1873.205455] RIP: 0033:0x7f35f5a93f5b
[ 1873.205457] RSP: 002b:00007f35f4a48e18 EFLAGS: 00000206 ORIG_RAX:
000000000000001c
[ 1873.205460] RAX: ffffffffffffffda RBX: 00007f35f4a49640 RCX:
00007f35f5a93f5b
[ 1873.205461] RDX: 0000000000000004 RSI: 00000000007fb000 RDI:
00007f35f4249000
[ 1873.205463] RBP: 00007f35f4249000 R08: 00007f35ec0026c0 R09:
0000000000000005
[ 1873.205464] R10: 0000000000000008 R11: 0000000000000206 R12:
00007f35f4a49640
[ 1873.205466] R13: 0000000000000000 R14: 00007f35f5af4560 R15:
0000000000000000
or
[ 1872.559562] INFO: task auditd:926 blocked for more than 622 seconds.
[ 1872.559563] Tainted: G W --------- ---
5.14.0-103.rt21.103.el9.x86_64 #1
[ 1872.559563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1872.559564] task:auditd state:D stack: 0 pid: 926
ppid: 1 flags:0x00000000
[ 1872.559565] Call Trace:
[ 1872.559566] __schedule+0x21c/0x5b0
[ 1872.559568] schedule+0x46/0xc0
[ 1872.559569] rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1872.559570] __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1872.559572] __rwbase_read_lock+0x9c/0xe0
[ 1872.559574] xlog_cil_commit+0x30/0x190 [xfs]
[ 1872.559631] __xfs_trans_commit+0xb0/0x320 [xfs]
[ 1872.559686] xfs_vn_update_time+0x12c/0x240 [xfs]
[ 1872.559741] file_update_time+0xea/0x140
[ 1872.559744] ? rwbase_write_lock+0xde/0x1b0
[ 1872.559746] xfs_file_write_checks+0x247/0x310 [xfs]
[ 1872.559799] xfs_file_buffered_write+0x76/0x2c0 [xfs]
[ 1872.559852] ? update_load_avg+0x80/0x7d0
[ 1872.559855] new_sync_write+0x11f/0x1b0
[ 1872.559858] vfs_write+0x25a/0x2f0
[ 1872.559860] ksys_write+0x65/0xe0
[ 1872.559862] do_syscall_64+0x5c/0x80
[ 1872.559864] ? fpregs_restore_userregs+0x12/0x100
[ 1872.559866] ? exit_to_user_mode_prepare+0xda/0xe0
[ 1872.559868] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1872.559870] RIP: 0033:0x7f4bf7545d9f
[ 1872.559872] RSP: 002b:00007ffce95c2920 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[ 1872.559874] RAX: ffffffffffffffda RBX: 0000000000000111 RCX:
00007f4bf7545d9f
[ 1872.559874] RDX: 0000000000000111 RSI: 0000559015de18f0 RDI:
0000000000000005
[ 1872.559875] RBP: 0000559015de18f0 R08: 0000000000000000 R09:
00007f4bf75b74e0
[ 1872.559876] R10: 00007f4bf75b73e0 R11: 0000000000000293 R12:
0000000000000111
[ 1872.559877] R13: 0000559015dd9310 R14: 0000000000000111 R15:
00007f4bf75fb9e0
Since it is causing 25-50% of our test jobs to fail, we really want to
get it fixed ASAP.
Regards,
Longman
Powered by blists - more mailing lists