linux-kernel - Re: [PATCH] tty: tty_buffer: Fix the softlockup issue in flush_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YWQX0SyDBDhG9wdR@kroah.com>
Date:   Mon, 11 Oct 2021 12:54:09 +0200
From:   Greg KH <gregkh@...uxfoundation.org>
To:     "guanghui.fgh" <guanghuifeng@...ux.alibaba.com>
Cc:     jirislaby@...nel.org, baolin.wang@...ux.alibaba.com,
        linux-kernel@...r.kernel.org, zhuo.song@...ux.alibaba.com,
        zhangliguang@...ux.alibaba.com
Subject: Re: [PATCH] tty: tty_buffer: Fix the softlockup issue in
 flush_to_ldisc

On Mon, Oct 11, 2021 at 06:18:41PM +0800, guanghui.fgh wrote:
> 
> 在 2021/10/11 15:50, Greg KH 写道:
> > On Mon, Oct 11, 2021 at 03:42:38PM +0800, guanghui.fgh wrote:
> > > 在 2021/10/10 21:18, Greg KH 写道:
> > > > On Fri, Oct 08, 2021 at 03:50:15PM +0800, guanghui.fgh wrote:
> > > > > 在 2021/9/30 13:38, Greg KH 写道:
> > > > > > On Thu, Sep 30, 2021 at 11:11:38AM +0800, Guanghui Feng wrote:
> > > > > > > When I run ltp testcase(ltp/testcases/kernel/pty/pty04.c) with arm64, there is a soft lockup,
> > > > > > > which look like this one:
> > > > > > > 
> > > > > > >      watchdog: BUG: soft lockup - CPU#41 stuck for 67s! [kworker/u192:2:106867]
> > > > > > >      CPU: 41 PID: 106867 Comm: kworker/u192:2 Kdump: loaded Tainted: G           OE     5.10.23 #1
> > > > > > >      Hardware name: H3C R4960 G3/BC82AMDDA, BIOS 1.70 01/07/2021
> > > > > > >      Workqueue: events_unbound flush_to_ldisc
> > > > > > >      pstate: 00c00009 (nzcv daif +PAN +UAO -TCO BTYPE=--)
> > > > > > >      pc : slip_unesc+0x80/0x214 [slip]
> > > > > > >      lr : slip_receive_buf+0x84/0x100 [slip]
> > > > > > >      sp : ffff80005274bce0
> > > > > > >      x29: ffff80005274bce0 x28: 0000000000000000
> > > > > > >      x27: ffff00525626fcc8 x26: ffff800011921078
> > > > > > >      x25: 0000000000000000 x24: 0000000000000004
> > > > > > >      x23: ffff00402b4059c0 x22: ffff00402b405940
> > > > > > >      x21: ffff205d87b81e21 x20: ffff205d87b81b9b
> > > > > > >      x19: 0000000000000000 x18: 0000000000000000
> > > > > > >      x17: 0000000000000000 x16: 0000000000000000
> > > > > > >      x15: 0000000000000000 x14: 5f5f5f5f5f5f5f5f
> > > > > > >      x13: 5f5f5f5f5f5f5f5f x12: 5f5f5f5f5f5f5f5f
> > > > > > >      x11: 5f5f5f5f5f5f5f5f x10: 5f5f5f5f5f5f5f5f
> > > > > > >      x9 : ffff8000097d7628 x8 : ffff205d87b85e20
> > > > > > >      x7 : 0000000000000000 x6 : 0000000000000001
> > > > > > >      x5 : ffff8000097dc008 x4 : ffff8000097d75a4
> > > > > > >      x3 : ffff205d87b81e1f x2 : 0000000000000005
> > > > > > >      x1 : 000000000000005f x0 : ffff00402b405940
> > > > > > >      Call trace:
> > > > > > >       slip_unesc+0x80/0x214 [slip]
> > > > > > >       tty_ldisc_receive_buf+0x64/0x80
> > > > > > >       tty_port_default_receive_buf+0x50/0x90
> > > > > > >       flush_to_ldisc+0xbc/0x110
> > > > > > >       process_one_work+0x1d4/0x4b0
> > > > > > >       worker_thread+0x180/0x430
> > > > > > >       kthread+0x11c/0x120
> > > > > > >      Kernel panic - not syncing: softlockup: hung tasks
> > > > > > >      CPU: 41 PID: 106867 Comm: kworker/u192:2 Kdump: loaded Tainted: G           OEL    5.10.23 #1
> > > > > > >      Hardware name: H3C R4960 G3/BC82AMDDA, BIOS 1.70 01/07/2021
> > > > > > >      Workqueue: events_unbound flush_to_ldisc
> > > > > > >      Call trace:
> > > > > > >       dump_backtrace+0x0/0x1ec
> > > > > > >       show_stack+0x24/0x30
> > > > > > >       dump_stack+0xd0/0x128
> > > > > > >       panic+0x15c/0x374
> > > > > > >       watchdog_timer_fn+0x2b8/0x304
> > > > > > >       __run_hrtimer+0x88/0x2c0
> > > > > > >       __hrtimer_run_queues+0xa4/0x120
> > > > > > >       hrtimer_interrupt+0xfc/0x270
> > > > > > >       arch_timer_handler_phys+0x40/0x50
> > > > > > >       handle_percpu_devid_irq+0x94/0x220
> > > > > > >       __handle_domain_irq+0x88/0xf0
> > > > > > >       gic_handle_irq+0x84/0xfc
> > > > > > >       el1_irq+0xc8/0x180
> > > > > > >       slip_unesc+0x80/0x214 [slip]
> > > > > > >       tty_ldisc_receive_buf+0x64/0x80
> > > > > > >       tty_port_default_receive_buf+0x50/0x90
> > > > > > >       flush_to_ldisc+0xbc/0x110
> > > > > > >       process_one_work+0x1d4/0x4b0
> > > > > > >       worker_thread+0x180/0x430
> > > > > > >       kthread+0x11c/0x120
> > > > > > >      SMP: stopping secondary CPUs
> > > > > > > 
> > > > > > > In the testcase pty04, there are multple processes and we only pay close attention to the
> > > > > > > first three actually. The first process call the write syscall to send data to the pty master
> > > > > > > with all one's strength(tty_write->file_tty_write->do_tty_write->n_tty_write call chain).
> > > > > > > The second process call the read syscall to receive data by the pty slave(with PF_PACKET socket).
> > > > > > > The third process will wait a moment in which the first two processes will do there work and then
> > > > > > > it call ioctl to hangup the pty pair which will cease the first two process read/write to the pty.
> > > > > > > Before hangup the pty, the first process send data to the pty buffhead with high speed. At the same
> > > > > > > time if the workqueue is waken up, the workqueue will do the flush_to_ldisc to pop data from pty
> > > > > > > master's buffhead to line discipline in a loop until there is no more data left without any on one's
> > > > > > > own schedule which will result in doing work in flush_to_ldisc for a long time. As kernel configured
> > > > > > > without CONFIG_PREEMPT, there maybe occurs softlockup in the flush_to_ldisc. So I add cond_resched
> > > > > > > in the flush_to_ldisc while loop to avoid it.
> > > > > > Please properly wrap your changelog text at 72 columns.
> > > > > When I run ltp testcase(ltp/testcases/kernel/pty/pty04.c) with arm64, there is a soft lockup,
> > > > > which look like this one:
> > > > > Call trace:
> > > > >      dump_backtrace+0x0/0x1ec
> > > > >      show_stack+0x24/0x30
> > > > >      dump_stack+0xd0/0x128
> > > > >      panic+0x15c/0x374
> > > > >      watchdog_timer_fn+0x2b8/0x304
> > > > >      __run_hrtimer+0x88/0x2c0
> > > > >      __hrtimer_run_queues+0xa4/0x120
> > > > >      hrtimer_interrupt+0xfc/0x270
> > > > >      arch_timer_handler_phys+0x40/0x50
> > > > >      handle_percpu_devid_irq+0x94/0x220
> > > > >      __handle_domain_irq+0x88/0xf0
> > > > >      gic_handle_irq+0x84/0xfc
> > > > >      el1_irq+0xc8/0x180
> > > > >      slip_unesc+0x80/0x214 [slip]
> > > > >      tty_ldisc_receive_buf+0x64/0x80
> > > > >      tty_port_default_receive_buf+0x50/0x90
> > > > >      flush_to_ldisc+0xbc/0x110
> > > > >      process_one_work+0x1d4/0x4b0
> > > > >      worker_thread+0x180/0x430
> > > > >      kthread+0x11c/0x120
> > > > > 
> > > > > In the testcase pty04, The first process call the write syscall to send data to the pty master.
> > > > > At the same time if the workqueue is waken up, the workqueue will do the flush_to_ldisc to pop data
> > > > > in a loop until there is no more data left which will result in doing work in flush_to_ldisc for a
> > > > > long time. As kernel configured without CONFIG_PREEMPT, there maybe occurs softlockup in the flush_to_ldisc.
> > > > Is this a "real" test for something that you have seen in a normal
> > > > workload?  ltp is known for having buggy/confusing tests in it in the
> > > > past, you might wish to consult with the authors of that test.
> > > Firstly, thanks for your response.
> > > 
> > > I have check the ltp pty testcase. At the same time, I find the pty
> > > softlockup in arm64, and it is similar to others.
> > > 
> > > https://github.com/victronenergy/venus/issues/350
> > > 
> > > https://groups.google.com/g/syzkaller-lts-bugs/c/SpkH8yH26js/m/3aifBl_GAwAJ
> > Hm, ok, can you please resubmit this based on the changes discussed in
> > thread and I will re-review it again.
> > 
> > thanks,
> > 
> > greg k-h
> 
> When I run the ltp testcase(ltp/testcases/kernel/pty/pty04.c) with arm64, there is a soft lockup,
> which look like this one:
> Call trace:
>     dump_backtrace+0x0/0x1ec
>     show_stack+0x24/0x30
>     dump_stack+0xd0/0x128
>     panic+0x15c/0x374
>     watchdog_timer_fn+0x2b8/0x304
>     __run_hrtimer+0x88/0x2c0
>     __hrtimer_run_queues+0xa4/0x120
>     hrtimer_interrupt+0xfc/0x270
>     arch_timer_handler_phys+0x40/0x50
>     handle_percpu_devid_irq+0x94/0x220
>     __handle_domain_irq+0x88/0xf0
>     gic_handle_irq+0x84/0xfc
>     el1_irq+0xc8/0x180
>     slip_unesc+0x80/0x214 [slip]
>     tty_ldisc_receive_buf+0x64/0x80
>     tty_port_default_receive_buf+0x50/0x90
>     flush_to_ldisc+0xbc/0x110
>     process_one_work+0x1d4/0x4b0
>     worker_thread+0x180/0x430
>     kthread+0x11c/0x120
> 
> In the testcase pty04, The first process call the write syscall to send data to the pty master.
> At the same time, the workqueue will do the flush_to_ldisc to pop data in a loop until there is
> no more data left. When the sender and workqueue running in different core, the sender sends data
> fastly in full time which will result in workqueue doing work in loop for a long time and occuring
> softlockup in flush_to_ldisc with kernel configured without CONFIG_PREEMPT. So I add need_resched
> check and cond_resched in the flush_to_ldisc loop to avoid it.
> 
> Signed-off-by: Guanghui Feng<guanghuifeng@...ux.alibaba.com>
> 
> 
> diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
> index bd2d915..77b92f9 100644
> --- a/drivers/tty/tty_buffer.c
> +++ b/drivers/tty/tty_buffer.c
> @@ -534,6 +534,8 @@ static void flush_to_ldisc(struct work_struct *work)
>  		if (!count)
>  			break;
>  		head->read += count;
> +		if (need_resched())
> +			cond_resched();
>  	}
>  	mutex_unlock(&buf->lock);
> -- 
> 1.8.3.1
> 

This is in no format that I can apply it in.  You ignored my request to
properly wrap your changelog text lines, and your signed-off-by needs a
space after the name, and it's in response to another patch and is not
stand alone, and it is a v2 patch with no list of changes from the
previous one.

Please fix up.

thanks,

greg k-h