linux-kernel - Re: [PATCH tty-next 0/4] tty: Fix ^C echo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 5 Dec 2013 00:13:15 +0000
From:	One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>
To:	Peter Hurley <peter@...leysoftware.com>
Cc:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Jiri Slaby <jslaby@...e.cz>, linux-kernel@...r.kernel.org,
	linux-serial@...r.kernel.org
Subject: Re: [PATCH tty-next 0/4] tty: Fix ^C echo

> Not so much confused as simply merged. Input processing is inherently
> single-threaded; it makes sense to rely on that at the highest level
> possible.

I would disagree entirely. You want to minimise the areas affected by a
given lock. You also want to lock data not code. Correctness comes before
speed. You optimise it when its right, otherwise you end up in a nasty
mess when you discover you've optimised to assumptions that are flawed.

> On smp, locked instructions and cache-line contention on the tty_buffer
> list ptrs and read_buf indices account for more than 90% of the time cost
> in the read path for real hardware (and over 95% for ptys).

Yes I'm uncomfortably aware of that for modern SMP hardware, and also
that simply ripping out the buffering will screw the real low end people
(eg M68K and friends)

> Firewire, which is capable of sustained throughput in excess of 40MB/sec,
> struggles to get over 5MB/sec through the tty layer. [And drm output
> is orders-of-magnitude slower than that, which is just sad...]

And what protocols do you care about 5MB/second - n_tty - no ? For the
high speed protocols you are trying to fix a lost cause. By the time
we've gone piddling around with tty buffers and serialized tty queues
firing bytes through tasks and the like you already lost.

For drm I assume you mean the framebuffer console logic ? Last time I
benched that except for the Poulsbo it was bottlenecked on the GPU - not
that I can type at 5MB/second anyway. Not that fixing the performance of
the various bits wouldn't be a good thing too especially on the output
end.

> While that would work, it's expensive extra locking in a path that 99.999%
> of the time doesn't need it. I'd rather explore other solutions.

How about getting the high speed paths out of the whole tty buffer
layer ? Almost every line discipline can be a fastpath directly to the
network layer. If optimisation is the new obsession then we can cut the
crap entirely by optimising for networking not making it a slave of n_tty.

Starting at the beginning

we have locks on rx because
- we want serialized rx
- we have buffer lifetimes
- we have buffer queues
- we have loads of flow control parameters

Only n_tty needs the buffers (maybe some of irda but irda hasn't worked
for years afaik). IRQ receive paths are serialized (and as a bonus can be
pinned to a CPU). Flow control is n_tty stuff, everyone else simply fires
it at their network layer as fast as possible and net already does the
work.

Keep a single tty_buf in the tty for batching at any given time, and
private so no locks at all

Have a wrapper via
ld->receive(tty, buf)

which fires the tty_buf at the ldisc and allocates a new empty one

tty_queue_bytes(tty, buf, flags, len)

which adds to the buffer, and if full calls ld->queue and then carries on
the copying cycle

and

ld->receive_direct(tty, buf, flags, len)

which allows block mode devices to blast bytes directly at the queue (ie
all the USB 3G stuff, firewire, etc) without going via any additional
copies.

For almost all ldiscs

ld->receive would be

ld->receive_direct(tty, buf->buf, buf->flags, buf->len);
free buffer

For n_tty type stuff

ld->receive is basically much of tty_flip_buffer_push

ld->receive_direct allocates tty_buffers and copies into it

We may even be able to optimise some of the n_tty cases into the
fastpath afterwards (notably raw, no echo)

For anything receiving in blocks that puts us close to (but not quite at)
ethernet kinds of cleanness for network buffer delivery.

Worth me looking into ?

> The clock/generation method seems like it might yield a lockless solution
> for this problem, but maybe creates another one because the driver-side
> would need to stamp the buffer (in essence, a flush could affect data
> that has not yet been copied from the driver).

But it has arrived in the driver so might not matter. That requires a
little thought!

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/