linux-kernel - Re: [BUG] 4.11.0-rc3 xterm hung in D state on exit, wchan is tty_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170323073018.GI802@shells.gnugeneration.com>
Date:   Thu, 23 Mar 2017 00:30:18 -0700
From:   lkml@...garu.com
To:     lkml@...garu.com
Cc:     linux-kernel <linux-kernel@...r.kernel.org>, robh@...nel.org
Subject: Re: [BUG] 4.11.0-rc3 xterm hung in D state on exit, wchan is
 tty_release_struct

On Wed, Mar 22, 2017 at 11:44:18PM -0700, lkml@...garu.com wrote:
> On Wed, Mar 22, 2017 at 07:08:46PM -0700, lkml@...garu.com wrote:
> > Hello list,
> > 
> > After approximately one day day of running 4.11.0-rc3 with 7e54d9d reverted to
> > enable regular use, this happened upon destroying an xterm:
> > 
> > [80817.525112] BUG: unable to handle kernel paging request at 0000000000002260
> > [80817.525239] IP: n_tty_receive_buf_common+0x68/0xab0
> > [80817.525312] PGD 0 
> > 
> > [80817.525387] Oops: 0000 [#1] PREEMPT SMP
> > [80817.525452] CPU: 0 PID: 9532 Comm: kworker/u4:3 Not tainted 4.11.0-rc3-00001-gc56a355 #53
> > [80817.525564] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
> > [80817.525673] Workqueue: events_unbound flush_to_ldisc
> > [80817.525752] task: ffff967d91d80000 task.stack: ffff9add81f40000
> > [80817.525839] RIP: 0010:n_tty_receive_buf_common+0x68/0xab0
> > [80817.525917] RSP: 0018:ffff9add81f43d38 EFLAGS: 00010297
> > [80817.525992] RAX: 0000000000000000 RBX: ffff967d91c98c00 RCX: 0000000000000001
> > [80817.526035] RDX: ffff967e73bba58d RSI: ffff967e73bba48d RDI: ffff967d91c98cc0
> > [80817.526035] RBP: ffff9add81f43dd0 R08: 0000000000000001 R09: 0000000000000000
> > [80817.526035] R10: 00004980cbe001e0 R11: 0000000000000000 R12: ffff967d87aacf20
> > [80817.526035] R13: ffff967e73bba58d R14: 0000000000000001 R15: ffff967e74aa8008
> > [80817.526035] FS:  0000000000000000(0000) GS:ffff967e7bc00000(0000) knlGS:0000000000000000
> > [80817.526035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [80817.526035] CR2: 0000000000002260 CR3: 0000000099009000 CR4: 00000000000006f0
> > [80817.526035] Call Trace:
> > [80817.526035]  ? update_curr+0xbb/0x1a0
> > [80817.526035]  n_tty_receive_buf2+0xf/0x20
> > [80817.526035]  tty_ldisc_receive_buf+0x1d/0x50
> > [80817.526035]  tty_port_default_receive_buf+0x40/0x60
> > [80817.526035]  flush_to_ldisc+0x94/0xa0
> > [80817.526035]  process_one_work+0x13b/0x3e0
> > [80817.526035]  worker_thread+0x64/0x4a0
> > [80817.526035]  kthread+0x10f/0x150
> > [80817.526035]  ? process_one_work+0x3e0/0x3e0
> > [80817.526035]  ? __kthread_create_on_node+0x150/0x150
> > [80817.526035]  ret_from_fork+0x29/0x40
> > [80817.526035] Code: 85 70 ff ff ff e8 59 75 57 00 48 8d 83 00 02 00 00 c7 45 c8 00 00 00 00 48 89 45 98 48 8d 83 28 02 00 00 48 89 45 90 48 8b 45 b8 <48> 8b b0 60 22 00 00 48 8b 08 89 f0 29 c8 f6 83 10 01 00 00 08 
> > [80817.526035] RIP: n_tty_receive_buf_common+0x68/0xab0 RSP: ffff9add81f43d38
> > [80817.526035] CR2: 0000000000002260
> > [80817.526035] ---[ end trace 640aec4765d350f2 ]---
> > 
> > 
> > That xterm process is stuck, and I am unable to start any new xterms, switching to virtual consoles proves useless, presumably there's an important lock held.
> > 
> <snip>
> 
> At a casual glance of the v4.10..v4.11-rc3 changes affecting drivers/tty, the
> commit c3485e looks suspicious to me, these hunks in particular:
> 
> @@ -465,16 +465,6 @@ static void flush_to_ldisc(struct work_struct *work)
>  {
>         struct tty_port *port = container_of(work, struct tty_port, buf.work);
>         struct tty_bufhead *buf = &port->buf;
> -       struct tty_struct *tty;
> -       struct tty_ldisc *disc;
> -
> -       tty = READ_ONCE(port->itty);
> -       if (tty == NULL)
> -               return;
> -
> -       disc = tty_ldisc_ref(tty);
> -       if (disc == NULL)
> -               return;
>  
>         mutex_lock(&buf->lock);
>  
> @@ -504,7 +494,7 @@ static void flush_to_ldisc(struct work_struct *work)
>                         continue;
>                 }
>  
> -               count = receive_buf(disc, head, count);
> +               count = receive_buf(port, head, count);
>                 if (!count)
>                         break;
>                 head->read += count;
> @@ -512,7 +502,6 @@ static void flush_to_ldisc(struct work_struct *work)
>  
>         mutex_unlock(&buf->lock);
>  
> -       tty_ldisc_deref(disc);
>  }
>  
>  /**
> 
> <snip>
> 
> I'm not familiar with this code at all, but port->buf is part of port, and if
> the port is destroyed as part of the tty, then perhaps port->buf (and
> port->buf->lock) may become invalid on us without these:
> 
> -       tty = READ_ONCE(port->itty);
> -       if (tty == NULL)
> -               return;
> -
> -       disc = tty_ldisc_ref(tty);
> -       if (disc == NULL)
> -               return;
> 
> Added Rob Herring, author of c3485ee to CC list.
> 

I suspect this part was a mistake:

 -       tty = READ_ONCE(port->itty);
 -       if (tty == NULL)
 -               return;

Note release_tty() tty->port->itty is assigned NULL before calling
tty_buffer_cancel_work():

static void release_tty(struct tty_struct *tty, int idx)
{
        /* This should always be true but check for the moment */
        WARN_ON(tty->index != idx);
        WARN_ON(!mutex_is_locked(&tty_mutex));
        if (tty->ops->shutdown)
                tty->ops->shutdown(tty);
        tty_free_termios(tty);
        tty_driver_remove_tty(tty->driver, tty);
        tty->port->itty = NULL;
        if (tty->link)
                tty->link->port->itty = NULL;
        tty_buffer_cancel_work(tty->port);

        tty_kref_put(tty->link);
        tty_kref_put(tty);
}

I'm also unfamiliar with the kernel work queues, but this looks like an
intentional barrier of sorts, with the READ_ONCE atomic read of port->itty.

Maybe just an oversight while shuffling the ldisc stuff around?

Regards,
Vito Caputo