linux-kernel - Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f2cbe809-4337-bf9b-d003-7e47ff893248@redhat.com>
Date:   Fri, 24 Jun 2022 22:23:23 -0400
From:   Waiman Long <longman@...hat.com>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mike Stowell <mstowell@...hat.com>
Subject: Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT
 waiters

On 6/24/22 02:42, Sebastian Andrzej Siewior wrote:
> On 2022-06-23 10:41:17 [-0400], Waiman Long wrote:
>> On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
>>> Do you have more insight on how this was tested/ created? Based on that,
>>> systemd and a random kworker waited on a lock for more than 10 minutes.
>> The hang happens when our QE team run thier kernel tier 1 test which, I
>> think, lasts several hours. The hang happens in some runs but not all of
>> them. So it is kind of opportunistic. Mike should be able to provide a
>> better idea about frequency and so on.
> So we talk here about 64+ CPU or more than that?

The bug report that I saw showed 2 hangs in two different one-socket 
4-core 8-thread systems. So they are not 64+ cpus. Of course, the more 
cpus that we have, the more chance we will get into trouble.


>
>>> I added a trace-printk each time a non-RT waiter got the lock stolen,
>>> kicked a kernel build and a package upgrade and took a look at the stats
>>> an hour later:
>>> - sh got its lock stolen 3416 times. I didn't lock the pid so I can't
>>>     look back and check how long it waited since the first time.
>>> - the median average of stolen locks is 173.
>> Maybe we should also more lock stealing per waiter than the 10 that I used
>> in the patch. I am open to suggestion to what is a good value to use.
> I have no idea either. I just looked at a run to see what the number
> actually are. I have no numbers in terms of performance. So what most
> likely happens is that on an unlock operation the waiter gets a wake-up
> but before he gets a chance to acquire the lock, it is already taken and
> he goes back to sleep again. While this looks painful it might be better
> performance wise because the other task was able to acquire the lock
> without waiting. But then it is not fair and this happens.
> One thing that I'm curious about is, what lock is it (one or two global
> hot spots or many). And how to benchmark this…

The hang can happen in different places. Like

[ 1873.205398] INFO: task pool-restraintd:122206 blocked for more than 
622 seconds.
[ 1873.205400]       Tainted: G        W  OE  X --------- --- 
5.14.0-104.rt21.104.el9.x86_64 #1
[ 1873.205401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1873.205402] task:pool-restraintd state:D stack:    0 pid:122206 
ppid:     1 flags:0x00000000
[ 1873.205405] Call Trace:
[ 1873.205407]  __schedule+0x21c/0x5b0
[ 1873.205410]  ? migrate_enable+0xcd/0x140
[ 1873.205413]  schedule+0x46/0xc0
[ 1873.205415]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1873.205418]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1873.205421]  __rwbase_read_lock+0x9c/0xe0
[ 1873.205425]  do_madvise+0x109/0x9c0
[ 1873.205429]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205432]  ? do_syscall_64+0x69/0x80
[ 1873.205434]  __x64_sys_madvise+0x28/0x30
[ 1873.205436]  do_syscall_64+0x5c/0x80
[ 1873.205438]  ? fpregs_restore_userregs+0x53/0x100
[ 1873.205442]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1873.205445]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205447]  ? do_syscall_64+0x69/0x80
[ 1873.205449]  ? do_syscall_64+0x69/0x80
[ 1873.205452]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1873.205455] RIP: 0033:0x7f35f5a93f5b
[ 1873.205457] RSP: 002b:00007f35f4a48e18 EFLAGS: 00000206 ORIG_RAX: 
000000000000001c
[ 1873.205460] RAX: ffffffffffffffda RBX: 00007f35f4a49640 RCX: 
00007f35f5a93f5b
[ 1873.205461] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 
00007f35f4249000
[ 1873.205463] RBP: 00007f35f4249000 R08: 00007f35ec0026c0 R09: 
0000000000000005
[ 1873.205464] R10: 0000000000000008 R11: 0000000000000206 R12: 
00007f35f4a49640
[ 1873.205466] R13: 0000000000000000 R14: 00007f35f5af4560 R15: 
0000000000000000

or

[ 1872.559562] INFO: task auditd:926 blocked for more than 622 seconds.
[ 1872.559563]       Tainted: G        W        --------- --- 
5.14.0-103.rt21.103.el9.x86_64 #1
[ 1872.559563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1872.559564] task:auditd          state:D stack:    0 pid:  926 
ppid:     1 flags:0x00000000
[ 1872.559565] Call Trace:
[ 1872.559566]  __schedule+0x21c/0x5b0
[ 1872.559568]  schedule+0x46/0xc0
[ 1872.559569]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1872.559570]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1872.559572]  __rwbase_read_lock+0x9c/0xe0
[ 1872.559574]  xlog_cil_commit+0x30/0x190 [xfs]
[ 1872.559631]  __xfs_trans_commit+0xb0/0x320 [xfs]
[ 1872.559686]  xfs_vn_update_time+0x12c/0x240 [xfs]
[ 1872.559741]  file_update_time+0xea/0x140
[ 1872.559744]  ? rwbase_write_lock+0xde/0x1b0
[ 1872.559746]  xfs_file_write_checks+0x247/0x310 [xfs]
[ 1872.559799]  xfs_file_buffered_write+0x76/0x2c0 [xfs]
[ 1872.559852]  ? update_load_avg+0x80/0x7d0
[ 1872.559855]  new_sync_write+0x11f/0x1b0
[ 1872.559858]  vfs_write+0x25a/0x2f0
[ 1872.559860]  ksys_write+0x65/0xe0
[ 1872.559862]  do_syscall_64+0x5c/0x80
[ 1872.559864]  ? fpregs_restore_userregs+0x12/0x100
[ 1872.559866]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1872.559868]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1872.559870] RIP: 0033:0x7f4bf7545d9f
[ 1872.559872] RSP: 002b:00007ffce95c2920 EFLAGS: 00000293 ORIG_RAX: 
0000000000000001
[ 1872.559874] RAX: ffffffffffffffda RBX: 0000000000000111 RCX: 
00007f4bf7545d9f
[ 1872.559874] RDX: 0000000000000111 RSI: 0000559015de18f0 RDI: 
0000000000000005
[ 1872.559875] RBP: 0000559015de18f0 R08: 0000000000000000 R09: 
00007f4bf75b74e0
[ 1872.559876] R10: 00007f4bf75b73e0 R11: 0000000000000293 R12: 
0000000000000111
[ 1872.559877] R13: 0000559015dd9310 R14: 0000000000000111 R15: 
00007f4bf75fb9e0

Since it is causing 25-50% of our test jobs to fail, we really want to 
get it fixed ASAP.

Regards,
Longman