linux-kernel - Re: more hangs in the tty layer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKU3ayUOb33moxyXZKxMHRyo4PRLHJBUxrYuyAiZKigtNYNuog@mail.gmail.com>
Date:   Thu, 3 Nov 2016 18:38:25 -0600
From:   Peter Hurley <peter@...leysoftware.com>
To:     Nikolay Borisov <kernel@...p.com>
Cc:     "Linux-Kernel@...r. Kernel. Org" <linux-kernel@...r.kernel.org>
Subject: Re: more hangs in the tty layer

On Tue, Oct 4, 2016 at 2:10 AM, Nikolay Borisov <kernel@...p.com> wrote:
> Hello Peter,
>
> I've emailed you before re. spurious hangs in the TTY layer, but
> at that time I was running a rather old (but LTS) 3.12 kernel. Now,
> I'm running a 4.4.10 and I still observe the following lock-ups.
> I have multiple processes which hang with the following callstack:
>
> [4592064.967445] sulogin         D ffff8839bca53b28     0 38279  38274 0x00000006
> [4592064.967448]  ffff8839bca53b28 ffff881ff2ed44c0 ffff883748d85280 ffff8839bca54000
> [4592064.967452]  ffff883748d85280 ffff881ff07c988c 00000000ffffffff ffff881ff07c9890
> [4592064.967454]  ffff8839bca53b40 ffffffff8161020c ffff881ff07c9888 ffff8839bca53b50
> [4592064.967458] Call Trace:
> [4592064.967460]  [<ffffffff8161020c>] schedule+0x3c/0x90
> [4592064.967463]  [<ffffffff81610505>] schedule_preempt_disabled+0x15/0x20
> [4592064.967466]  [<ffffffff816121e0>] __mutex_lock_slowpath+0xb0/0x130
> [4592064.967468]  [<ffffffff8161228d>] mutex_lock+0x2d/0x40
> [4592064.967471]  [<ffffffff81614388>] tty_lock+0x38/0x90
> [4592064.967474]  [<ffffffff8138e6a5>] tty_release+0x55/0x590
> [4592064.967478]  [<ffffffff8119a8ea>] __fput+0xea/0x1f0
> [4592064.967481]  [<ffffffff8119aa2e>] ____fput+0xe/0x10
> [4592064.967485]  [<ffffffff8106fc6f>] task_work_run+0x7f/0xb0
> [4592064.967488]  [<ffffffff810554d3>] do_exit+0x373/0xb70
> [4592064.967490]  [<ffffffff816120c2>] ? __mutex_unlock_slowpath+0x32/0x40
> [4592064.967493]  [<ffffffff816120eb>] ? mutex_unlock+0x1b/0x20
> [4592064.967495]  [<ffffffff81390e5b>] ? n_tty_read+0x31b/0x9e0
> [4592064.967498]  [<ffffffff81055d53>] do_group_exit+0x43/0xd0
> [4592064.967500]  [<ffffffff81061bd1>] get_signal+0x2e1/0x710
> [4592064.967503]  [<ffffffff810053c8>] do_signal+0x28/0x770
> [4592064.967505]  [<ffffffff81394d36>] ? tty_ldisc_deref+0x16/0x20
> [4592064.967508]  [<ffffffff8138cfbc>] ? tty_read+0x9c/0xf0
> [4592064.967511]  [<ffffffff811987a8>] ? __vfs_read+0x28/0xd0
> [4592064.967514]  [<ffffffff812a7d53>] ? security_file_permission+0xa3/0xc0
> [4592064.967517]  [<ffffffff810021da>] exit_to_usermode_loop+0x8a/0xd0
> [4592064.967520]  [<ffffffff81002a90>] syscall_return_slowpath+0x80/0xf0
> [4592064.967523]  [<ffffffff816145f4>] int_ret_from_sys_call+0x25/0x9f
>
> After debugging the memory dump of the server it turned out the
> following task is holding the mutex on which the sulogin is stuck:
>
> PID: 38274  TASK: ffff88395a109b80  CPU: 13  COMMAND: "sh"
>  #0 [ffff8837aba27ac8] __schedule at ffffffff8160fa9a
>  #1 [ffff8837aba27b10] schedule at ffffffff8161020c
>  #2 [ffff8837aba27b28] schedule_timeout at ffffffff81613472
>  #3 [ffff8837aba27ba8] ldsem_down_write at ffffffff81613cc5
>  #4 [ffff8837aba27c10] tty_ldisc_lock at ffffffff816142e6
>  #5 [ffff8837aba27c28] tty_ldisc_hangup at ffffffff8139561a
>  #6 [ffff8837aba27c50] __tty_hangup at ffffffff8138cd03
>  #7 [ffff8837aba27ca0] disassociate_ctty at ffffffff8138d2b9
>  #8 [ffff8837aba27cd0] disassociate_ctty at ffffffff8138d5d9
>  #9 [ffff8837aba27ce0] do_exit at ffffffff81055908
> #10 [ffff8837aba27d48] do_group_exit at ffffffff81055d53
> #11 [ffff8837aba27d70] get_signal at ffffffff81061bd1
> #12 [ffff8837aba27de8] do_signal at ffffffff810053c8
> #13 [ffff8837aba27ee0] exit_to_usermode_loop at ffffffff810021da
> #14 [ffff8837aba27f18] syscall_return_slowpath at ffffffff81002a90
> #15 [ffff8837aba27f50] int_ret_from_sys_call at ffffffff816145f4
>     RIP: 00007f18fde88aba  RSP: 00007ffc47a1fa78  RFLAGS: 00000246
>     RAX: fffffffffffffe00  RBX: 0000561242611d20  RCX: 00007f18fde88aba
>     RDX: 0000000000000000  RSI: 00007ffc47a1fa9c  RDI: 00000000ffffffff
>     RBP: 000056124260f948   R8: 0000000000000000   R9: 00007f18fe3a5740
>     R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc47a1fb20
>     R13: 00007ffc47a1faa0  R14: 0000000000000000  R15: 0000000000000001
>     ORIG_RAX: 000000000000003d  CS: 0033  SS: 002b
>
> But I also observed also the following hung-up task,
> waiting for read access:
>
> [4592064.968193] index.pl        D ffff881430387ce8     0 47168      1 0x00000004
> [4592064.968196]  ffff881430387ce8 ffff881ff2ed0dc0 ffff8808a0960000 ffff881430388000
> [4592064.968200]  ffff881ff07c9828 ffff881ff07c9830 ffff8808a0960000 7fffffffffffffff
> [4592064.968203]  ffff881430387d00 ffffffff8161020c ffffffff00000004 ffff881430387d80
> [4592064.968205] Call Trace:
> [4592064.968208]  [<ffffffff8161020c>] schedule+0x3c/0x90
> [4592064.968211]  [<ffffffff81613472>] schedule_timeout+0x202/0x260
> [4592064.968216]  [<ffffffff812a6a48>] ? security_capable+0x48/0x60
> [4592064.968219]  [<ffffffff81613ab2>] ldsem_down_read+0x122/0x240
> [4592064.968222]  [<ffffffff813992cc>] ? vt_ioctl+0x6c/0x1460
> [4592064.968225]  [<ffffffff81394c90>] tty_ldisc_ref_wait+0x20/0x50
> [4592064.968228]  [<ffffffff8138da37>] tty_ioctl+0x417/0xd40
> [4592064.968232]  [<ffffffff8115a976>] ? handle_mm_fault+0x19b6/0x19c0
> [4592064.968236]  [<ffffffff811abe8f>] do_vfs_ioctl+0x30f/0x560
> [4592064.968239]  [<ffffffff811af8f2>] ? dput+0x192/0x230
> [4592064.968242]  [<ffffffff811ac159>] SyS_ioctl+0x79/0x90
> [4592064.968245]  [<ffffffff8161449b>] entry_SYSCALL_64_fastpath+0x16/0x6e
>
> I checked and all the tasks are working on the same tty_struct. Here
> is the state of the struct: http://paste.ubuntu.com/23273973/
>
> To me this seems like a lost wakeup event, especially that I've
> seen patches flying around fixing such issues. Unfortunately I
> cannot reliably reproduce this and booting a production machine
> with lockdep would be problematic.

Too bad because lockdep would definitively tell you what process is
holding the ldsem for this tty.
If you paste *all the tasks*, I can try to see if I recognize the task
stack that is holding the ldsem.

Regards,
Peter Hurley