[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56B15603.80807@hurleysoftware.com>
Date: Tue, 2 Feb 2016 17:21:07 -0800
From: Peter Hurley <peter@...leysoftware.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: gregkh@...uxfoundation.org, vinod.koul@...el.com,
dan.j.williams@...el.com, bigeasy@...utronix.de, tony@...mide.com,
nsekhar@...com, peter.ujfalusi@...com, dmaengine@...r.kernel.org,
linux-serial@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/4] serial: omap: robustify for high speed transfers
On 01/29/2016 08:35 AM, John Ogness wrote:
> Hi Peter,
>
> On 2016-01-25, Peter Hurley <peter@...leysoftware.com> wrote:
>>> The DMA-enabled OMAP UART driver in its current form queues 48 bytes
>>> for a DMA-RX transfer. After the transfer is complete, a new transfer
>>> of 48 bytes is queued. The DMA completion callback runs in tasklet
>>> context, so a reschedule with context switch is required for the
>>> completion to be processed and the next 48 bytes to be queued.
>>>
>>> When running at a high speed such as 3Mbit, the CPU has 128us between
>>> when the DMA hardware transfer completes and when the DMA hardware
>>> must be fully prepared for the next transfer. For an embedded board
>>> running applications, this does not give the CPU much time. If the
>>> UART is using hardware flow control, this situation results in a
>>> dramatic decrease in real transfer speeds. If flow control is not
>>> used, the CPU will almost certainly be forced to drop data.
>>
>> I'm not convinced by this logic at all.
>> Tasklets are not affected by the scheduler because they run in softirq.
>> Or is this -RT?
>
> Softirq runs as SCHED_OTHER. It is quite easy to create a scenario where
> DMA completion tasklets for this driver are not being serviced fast
> enough.
>
>> I'm not seeing this problem on other platforms at this baud rate,
>
> Do you run 3Mbit on other platforms without hardware flow control?
Yes, but only unidirectionally.
> I mention this because turning on hardware flow control can cover up the
> driver shortcomings by slowing down the transfers. What good is 3Mbit
> hardware if the driver never lets it get above 500Kbit on bulk
> transfers?
That's interesting. I wonder why the 6x hit when using h/w flow control.
Any thoughts on that?
>> and on this platform, all I see is lockups with DMA.
>
> I have seen (and fixed) interesting issues with the AM335x eDMA, but I
> have not experienced lockups in any of my testing. I'd be curious how
> you trigger that.
I haven't tested it since 4.1. I'll go back and re-enable DMA and retest.
>> What is the test setup to reproduce these results?
>
> Two Beaglebone boards connected via ttyS1. ttyS1's are set to raw mode
> at 3Mbit.
>
> sender: cat bigfile > /dev/ttyS1
> receiver: cat /dev/ttyS1 > bigfile
Ok, I can repro something similar.
> I am working on creating concrete examples that demonstrate not only
> that this patch series reduces system load (and thus can increase
> throughput on heavy system loads with hardware flow control), but also
> that it is able to handle baud rates without data loss well beyond the
> current implementation when no flow control is used.
>
> I wanted to wait until I had all the results before answering your
> email. But I'm getting caught up in other tasks right now, so it may
> take a few more weeks.
Ok. So just to be clear here: this patchset is really all about
performance improvement and not correct operation?
>>> This patch series modifies the UART driver to use cyclic DMA transfers
>>> with a growable ring buffer to accommodate baud rates. The ring buffer is
>>> large enough to hold at least 1s of RX-data.
>>> (At 3Mbit that is 367KiB.)
>>
>> Math slightly off because the frame is typically 10 bits, not 8.
>
> I was thinking 8 was the minimal frame size. Thanks for pointing that
> out. A frame can contain 7-12 bits so I will modify the code to create a
> buffer appropriate for the UART settings. At 3Mbit with 5n1 the driver
> would require a 419KiB ring buffer (8929 DMA periods of 48 bytes).
More about this below.
>>> In order to ensure that data in the ring buffer is not overwritten before
>>> being processed by the tty layer, a hrtimer is used as a watchdog.
>>
>> How'd it go from "We're just missing 128us window" to "This holds 1s
>> of data"?
>
> First, you need to recognize that DMA completion tasklets can be delayed
> significantly due to interrupt loads or rtprio processes (even on non-RT
> systems). And at 3Mbit we are talking about >12000 interrupts per
> second!
Not sure I see 12000 ints/sec. unless you're talking about full-duplex
at max rate in both directions? 3Mbit/sec / 10-bit frame / 48 bytes/dma =
6250 ints/sec.
But again, interrupt load is not going to result in 100ms service intervals.
So I think we're really talking about a (misbehaved) rtprio process that's
starving i/o.
> When using cyclic transfers, the only real concern is that the DMA
> overwrites data in the ring buffer before the CPU has processed it due
> to tasklet delays. That is what the hrtimer watchdog is for.
>
> Assuming the DMA is zooming along at full speed, the watchdog must be
> able to trigger before the ring buffer can fill up. If the watchdog sees
> the ring buffer is getting full, it pauses the DMA engine. But with
> cyclic DMA we never know if the DMA is zooming or sitting idle. So even
> on an idle system, the watchdog must assume DMA zooming and continually
> fire to check the status.
>
> I chose 1 second buffer sizes and set the watchdog to fire at half
> that. On an idle system you will see at most 2 new interrupts per second
> due to this patch series. I thought that would be an acceptable trade
> off. Whether the watchdog should fire at 50% buffer full or say 90%
> buffer full is something that could be debated. But to answer your
> question, the big ring buffer is really to keep the watchdog interrupts
> low frequency.
Ok, but your fundamental premise is that all of this is an acceptable
space-time tradeoff for everyone using this platform, when it's not.
So I'm trying to understand the actual use case you're trying to address.
I doubt that's 5n1, full-duplex.
>> And with a latency hit this bad, you'll never get the data to the
>> process because the tty buffer kworker will buffer-overflow too and
>> its much more susceptible to timing latency (although not as bad now
>> that it's exclusively on the unbounded workqueue).
>
> Yes, you are correct. But I think that is a problem that should be
> addressed at the tty layer.
I disagree. I think you should fix the source of 500ms latency.
Regards,
Peter Hurley
Powered by blists - more mailing lists