linux-kernel - Re: [PATCH resend v2] tty: n_tty -- Add new TIOCPEEKRAW ioctl to peek unread data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <570BDCC8.5060104@hurleysoftware.com>
Date:	Mon, 11 Apr 2016 10:20:08 -0700
From:	Peter Hurley <peter@...leysoftware.com>
To:	Cyrill Gorcunov <gorcunov@...il.com>
Cc:	LKML <linux-kernel@...r.kernel.org>, Jiri Slaby <jslaby@...e.com>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Andrey Vagin <avagin@...tuozzo.com>,
	Pavel Emelianov <xemul@...tuozzo.com>,
	Vladimir Davydov <vdavydov@...tuozzo.com>,
	Konstantin Khorenko <khorenko@...tuozzo.com>
Subject: Re: [PATCH resend v2] tty: n_tty -- Add new TIOCPEEKRAW ioctl to peek
 unread data

On 04/11/2016 02:01 AM, Cyrill Gorcunov wrote:
> On Sun, Apr 10, 2016 at 10:43:28PM -0700, Peter Hurley wrote:
>> Hi Cyrill,
>>
>> On 04/07/2016 05:53 AM, Cyrill Gorcunov wrote:
>>> Currently when we checkpoint PTY connections we ignore data queued inside
>>> read buffer of ldisk, such data is barely lost after the restore. So we
>>> would like to have an ability to dump and restore it.
> 
> Hi Peter! First of all, thanks a huge for comments!
> 
>> Do you want to be able to restore the exact state (like column and
>> newline markers)?
>>
>> I'm wondering if a for-purpose checkpoint/restore ioctl would be more
>> appropriate?

My thoughts here are two-fold.

1. I'd like whatever functionality is to be added to support checkpoint/
restore to be a restrictive as possible while still accomplishing your
objectives _exactly because I don't want it to grow other uses_.
That's why I suggested asserting the task state (and maybe adding
ptrace state checks) so that the interface can only be used for this
purpose.

2. If we want to make changes because the existing read()/write()
interface is inadequate, let's add a complete interface.

> Strictly speaking saving complete state would be ideal but this require
> exporting internal n_tty structure into user space as api (sure we
> can name the members in any way we like, but this would force then
> to keep backward compatibility when n_tty code get changed in future).
> I think loosing some information like column and marker is acceptable
> and fetching unread buffers is more general. I can try to implement
> c/r'ing ioctl though which would not only fetch buffers but column
> and such, and see how it goes, sounds ok?

My thinking here was to use an opaque buffer; it could be sized in the
same manner as the peek interface; ie., provide a buffer address with
size, and the call fails with an updated size required if inadequate
(or whatever, but I think you get the idea).


>> A generic peek() may find other uses which could make future improvements
>> difficult.
>>
>> Plus this ioctl() is only reading the 4k line discipline read buffer, and
>> not the tty buffers which could contain up to another 8k of unprocessed
>> input.
> 
> Which is scheduled for flush into port and then landed into ld when
> write() called. True, I managed to miss this moment, thanks! So the
> proper general peek() handling should be implemented on tty code level
> rather than ld, and fetch both -- start with tty buffer and proceed with
> ld buffer then, right?

To safely copy the tty buffers, it needs to be implemented like
tty_buffer_flush():

1. get an ldisc reference (ioctls don't automatically do this)

	ld = tty_ldisc_ref(tty);
	if (!ld)
		return -EIO;

2. lock out any other consumers of tty buffers:

	tty_buffer_lock_exclusive(tty->port);

   This will abort any in-progress input kworker and ensures neither
   buf->head nor head->read advance.

   * This does not lock out parallel producers so it's important to   *
   * guarantee there are no parallel writers (see my lock discussion) *

3. walk the list of tty buffers, copying each one:

	struct tty_buffer *head = port->buf->head;
	while (head) {
		count = smp_load_acquire(&head->commit) - head->read)
		if (count)
			/* copy tty buffer */
		head = smp_load_acquire(head->next);
	}

4. Now call into the line discipline to copy its data:

	if (ld->ops->checkpoint)
		ld->ops->checkpoint(....)

5. Restart kworker

	tty_buffer_unlock_exclusive(tty->port);

6. Release ldisc reference

	tty_ldisc_deref(ld);



>>> Here is a new ioctl code which simply copies data from read buffer into
>>> the userspace without any additional processing (just like terminal is
>>> sitting in a raw mode).
>>>
>>> The idea behind is when we checkpointing PTY pairs we peek this unread
>>> data into images on disk and on restore phase we find the linked peer
>>> and write it back thus reading peer will see it in read buffer.
>>
>> Have you frozen every process with either end open? I could see how this would
>> be easy for an entire process group but what if the processes are unrelated?
>> How do you ensure you have the correct linked peer with multiple devpts instances?
> 
> Currently we freeze all processes which are requested for dumping and then
> we walk over all files. Once tty peer is met (actually we support pty
> and vt only at the moment) we remember its index via TIOCGPTN (for master
> peers) and expect to find appropriate slave peer. You know, having multiple
> devpts mounted may be a problem actually in current criu code, I track indices
> via one global array but rather should do that via per-sb basis.

Ok.


> Still we support that named external peers, where one of the ends do not
> belong process tree we're dumping. In such scenario it's allowed to not
> dump queued data because we don't control external peer anyhow.

Ok.


>>> I tried several approaches (including simply reuse of canon_copy_from_read_buf
>>> and copy_from_read_buf) but it makes code more complicated, so I end up
>>> in plain copying of buffer data.
>>>
>>> Strictly speaking, we could try handle it inside CRIU itself, as
>>> Peter Hurley proposed (switch terminal to raw mode, read data,
>>> and return original termios back) but in this case we may hit
>>> the scenario when we read data and failed to restore it back
>>> (due to any error) leaving the task we're dumping without
>>> read buffer, thus peeking data looks like be the only
>>> safe way.
>>
>> Code comments below.
>>
>>
>>> v2:
>>>  - Use @char in ioctl definition.
>>>
>>> Signed-off-by: Cyrill Gorcunov <gorcunov@...tuozzo.com>
>>> CC: Jiri Slaby <jslaby@...e.com>
>>> CC: Peter Hurley <peter@...leysoftware.com>
>>> CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
>>> CC: Andrey Vagin <avagin@...tuozzo.com>
>>> CC: Pavel Emelianov <xemul@...tuozzo.com>
>>> CC: Vladimir Davydov <vdavydov@...tuozzo.com>
>>> CC: Konstantin Khorenko <khorenko@...tuozzo.com>
>>> ---
>>>  drivers/tty/n_tty.c               |   22 ++++++++++++++++++++++
>>>  include/uapi/asm-generic/ioctls.h |    1 +
>>>  2 files changed, 23 insertions(+)
>>>
>>> Index: linux-ml.git/drivers/tty/n_tty.c
>>> ===================================================================
>>> --- linux-ml.git.orig/drivers/tty/n_tty.c
>>> +++ linux-ml.git/drivers/tty/n_tty.c
>>> @@ -2420,6 +2420,26 @@ static unsigned long inq_canon(struct n_
>>>  	return nr;
>>>  }
>>>  
>>> +static ssize_t n_tty_peek_raw(struct tty_struct *tty, unsigned char __user *buf)
>>
>> function name is redundant; n_tty_peek() is sufficient.
>>
>> Needs a buffer-length parameter; could return the size required to completely
>> copy the ldisc read buffer, if smaller than required.
> 
> Seems better to provide some kind of structure with data address
> and size would be kept, on read we can update @size and put there
> the number of bytes unread if buffer is too small.
> 
>>> +{
>>> +	struct n_tty_data *ldata = tty->disc_data;
>>> +	ssize_t retval;
>>> +
>>> +	if (!mutex_trylock(&ldata->atomic_read_lock))
>>> +		return -EAGAIN;
>>> +
>>> +	down_read(&tty->termios_rwsem);
>>
>> Why take these two locks if parallel operations are not expected?
>> Conversely, I would expect this to assert current state is TASK_TRACED.
> 
> If someone is already reading the data it will either exit soon with
> data read or will sit here waiting for data to appear. So I thought
> if we're fetching unread data we should not interfere with any other
> readers? If someone is sitting in read procedure it implies that there
> either no data on buffer and reader is waiting for it, or the read
> procedure will complete soon, but indeed I rather should be checking
> for read_cnt locklessly and return -EAGAIN if > 0 and can't lock
> the atomic_read_lock. I implied that task can be in non-traced state.
> If require traced indeed we won't need the lock. For criu lockless
> + tracing state is fine but i thought someone else might be needing
> to peek data without tracing the task. Hm?

Well, I'd rather not have other uses for peeking data. In fact, I need
to check if this interface needs to be CAP_SYSADMIN; I'm pretty sure
that ioctl() ignores file permissions and write-only file permissions
(used by /usr/bin/write to send messages to other users) should not
allow tty reads.

Ok, so we want to defend against parallel operations in case not
all tty users are currently stopped by ptrace (such as when unrelated
tasks are potentially reading or writing to either end and were not
specified for dumping)?

In that case, I think just write trylocking the termios rwsem should
prevent any parallel i/o, at least for the N_TTY line discipline.

This should only interfere with processes not being dumped because
ptrace signalling should have ejected any process to be dumped out
of their i/o loops (readers or writers).


Also, I think we should further limit the interface based on what is
supported currently; IOW, check that the driver is either pty or vt,
assert that the line discipline is N_TTY at both ends, etc.


>> Assuming the other pty end is halted, this ioctl would also need to
>> wait for the input kworker to stop:
> 
> Yes, thanks!
> 
>>
>> --- >% ---
>> diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
>> index a946e49..744cb87 100644
>> --- a/drivers/tty/tty_buffer.c
>> +++ b/drivers/tty/tty_buffer.c
>> @@ -614,3 +614,8 @@ bool tty_buffer_cancel_work(struct tty_port *port)
>>  {
>>  	return cancel_work_sync(&port->buf.work);
>>  }
>> +
>> +void tty_buffer_flush_work(struct tty_port *port)
>> +{
>> +	flush_work(&port->buf.work);
>> +}
>> diff --git a/include/linux/tty.h b/include/linux/tty.h
>> index bf1bcdb..a541132 100644
>> --- a/include/linux/tty.h
>> +++ b/include/linux/tty.h
>> @@ -480,6 +480,7 @@ extern void tty_buffer_init(struct tty_port *port);
>>  extern void tty_buffer_set_lock_subclass(struct tty_port *port);
>>  extern bool tty_buffer_restart_work(struct tty_port *port);
>>  extern bool tty_buffer_cancel_work(struct tty_port *port);
>> +extern void tty_buffer_flush_work(struct tty_port *port);
>>  extern speed_t tty_termios_baud_rate(struct ktermios *termios);
>>  extern speed_t tty_termios_input_baud_rate(struct ktermios *termios);
>>  extern void tty_termios_encode_baud_rate(struct ktermios *termios,
>>
>> 	tty_buffer_flush_work(tty->port);
>>
>>> +	retval = read_cnt(ldata);
>>> +	if (retval) {
>>> +		const unsigned char *from = read_buf_addr(ldata, ldata->read_tail);
>>> +		if (copy_to_user(buf, from, retval))
>>> +			retval = -EFAULT;
>>> +	}
>>> +	up_read(&tty->termios_rwsem);
>>> +	mutex_unlock(&ldata->atomic_read_lock);
>>> +	return retval;
>>> +}
>>> +
>>>  static int n_tty_ioctl(struct tty_struct *tty, struct file *file,
>>>  		       unsigned int cmd, unsigned long arg)
>>>  {
>>> @@ -2437,6 +2457,8 @@ static int n_tty_ioctl(struct tty_struct
>>>  			retval = read_cnt(ldata);
>>>  		up_write(&tty->termios_rwsem);
>>>  		return put_user(retval, (unsigned int __user *) arg);
>>> +	case TIOCPEEKRAW:
>>
>> As with the function name, the ioctl symbol is redundant; simply TIOCPEEK
>> is ok.
> 
> OK
>