[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zj49cYCdFuCGCXdU@pengutronix.de>
Date: Fri, 10 May 2024 17:29:53 +0200
From: Michael Grzeschik <mgr@...gutronix.de>
To: Thinh Nguyen <Thinh.Nguyen@...opsys.com>
Cc: Wesley Cheng <quic_wcheng@...cinc.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"michael.riesch@...fvision.net" <michael.riesch@...fvision.net>,
"kernel@...gutronix.de" <kernel@...gutronix.de>,
"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] usb: dwc3: gadget: create per ep interrupts
On Thu, May 09, 2024 at 12:23:09AM +0000, Thinh Nguyen wrote:
>On Thu, May 09, 2024, Michael Grzeschik wrote:
>> On Wed, May 08, 2024 at 11:20:03PM +0000, Thinh Nguyen wrote:
>> > On Wed, May 08, 2024, Michael Grzeschik wrote:
>> > > On Tue, May 07, 2024 at 11:57:36AM -0700, Wesley Cheng wrote:
>> > > > Hi Michael,
>> > > >
>> > > > On 5/6/2024 4:06 PM, Michael Grzeschik wrote:
>> > > > > This patch is splitting up the interrupt event handling from one
>> > > > > interrupt thread to separate per endpoint interrupt threads.
>> > > > >
>> > > >
>> > > > I assume that the incentive from doing this is to improve overall
>> > > > throughput numbers. Would you be able to share some data on the
>> > > > benefits of moving to per EP event management?
>> > >
>> > > The main benefit is to make it possible to use high demanding usb
>> > > endpoints simultaneously. In our special case we saw that streaming
>> > > via uac and streaming via uvc was producing noise in the audio
>> > > stream. This was due to the fact, that the isoc feedback endpoint
>> > > that would adjust the samplerate was not being called fast enough
>> > > when there was heavy a lot of traffic in the uvc endpoint context.
>> > >
>> > > By moving the endpoints into their own thread handlers the short
>> > > feedback requests are at least able to be scheduled in between the bursts
>> > > of the uvc packages. The next step is to have all threads running on
>> > > different cpu cores, without interfering each other. However, as we
>> > > still have not matrix irq allocator for arm, there still is no direct
>> > > benefit from that yet.
>> > >
>> > >
>> > > > > To achieve this we create a new dwc3 interrupt domain in which
>> > > > > we map all claimed interrupts to individual interrupt threads.
>> > > > >
>> > > > > Although the gadget layer is preparing the claimed parameter
>> > > > > of each usb_ep which could be checked if the endpoint is
>> > > > > to used or not, the claimed value was 0 for each ep in gadget_start.
>> > > > > This was tested when describing some composite gadget using configfs.
>> > > > >
>> > > >
>> > > > yeah... the claimed flag is cleared by the USB gadget, ie USB configfs
>> > > > (not sure if you're using this) whenever it adds a USB config. This is
>> > > > to handle multi config situations, so subsequent USB configs can be
>> > > > assigned (resuse) endpoints, since only one config is active at a time
>> > > > for a USB device.
>> > > >
>> > > > This was a struggle for me as well when adding the TXFIFO resizing
>> > > > logic. We won't actually know which EPs are going to be used until the
>> > > > host issues the set configuration packet to select a config, and the
>> > > > set_alt() callback issues usb_ep_enable(). So the implementation
>> > > > (TXFIFO resizing) is currently based on the maximum potential endpoints
>> > > > used by any USB configuration.
>> > > >
>> > > > Not sure if having 31 (potentially) different IRQ entries would be ok,
>> > > > but maybe it would be simpler to just to request IRQ for dwc->num_eps
>> > > > always?
>> > > >
>> > > > Have you tried this on a multi config device?
>> > >
>> > > No, I didn't. I doubt that this will work after your explanation. So
>> > > thanks for the insides!
>> > >
>> > > I tried putting the request_threaded_irq into the ep_enable function
>> > > but this does not work as I see a lot of schedule while atomic
>> > > errors. This is possible as ep_enable is called from an set alt
>> > > coming from ep0 interrupt thread context.
>> > >
>> > > So there is probably now no other option left to have exact endpoint
>> > > interrupt threads. I will rework this back to request a kthread for each
>> > > endpoint even as we will probably would not be using them.
>> > >
>> >
>> > Do you have any data on latency here?
>>
>> I don't have the exact numbers for the uac feedback isoc endpoint
>> at the moment. But without the patch applied, it was reproducably
>> returning with EXDEV when we started uvc streaming and therefor
>> increased the amount of events per interrupt thread cycle.
>>
>> With the patch applied however, we are able to only route the events to
>> the corresponding soft irqs and leave the moment of truth to the
>> scheduler.
>
>Basically you're trying increase the priority of dwc3 by the greater
>number of soft interrupts.
Possible. Never thought about this.
>> > I don't see how introducing more soft interrupts would improve on
>> > latency, if anything, it should be worse?
>>
>> Why should explicit handling of coherent ep events on one cpu core
>> introduce more latency then by interleaving different events for
>> arbitrary ep all in one thread?
>
>Because we are only using a single interrupt line, the sequence of
>events need to be handled 1 set at a time. The amount of time saved from
>handling interrupts of different endpoint should be miniscule. There's
>latency to switching context and locking, which I think would offset and
>introduce more latency than what you can potentially save.
>
>I'd like to see data on the improvement on the net latency here.
If this is the case. Then we are currently dealing with way to much
durtion in the complete handler of the endpoints. I can't really
tell for the uac endpoints. But the uvc complete endpoint is going
through this roundtrip.
With no_interupt = 0 at every 16 request:
dwc3_endpoint_interrupt
dwc3_gadget_endpoint_trbs_complete
dwc3_gadget_ep_cleanup_completed_requests
~16 * {
dwc3_gadget_ep_cleanup_completed_request
dwc3_gadget_giveback
usb_gadget_giveback_request
usb_ep_queue
__dwc3_gadget_ep_queue
dwc3_prepare_trbs
~ * {
dwc3_prepare_trbs_sg/dwc3_prepare_trbs_linear
}
dwc3_send_gadget_ep_cmd
}
I think this is a lot of stack for an interrupt thread to handle if you
really want to pipeline this in one irqthread run and leave and make
sure that the other endpoints will also be handled soon enough.
>>
>> > This is making the driver way more complicated and potentially
>> > introduce many bugs.
>>
>> Possible, but not unsolvable.
>>
>> > I may be wrong here, but I suspect that by multiplying the interrupt
>> > handlings, you _may_ see improvement due to the a higher chance being
>> > selected by the scheduler. However, the overall latency will probably
>> > be worse. (correct me if I'm wrong).
>>
>> I doubt that it will be worse if each softirq can be handled on
>> different cpus at the same time.
>
>See comment above.
To solve this issue I see two options:
We could either do this by having different interrupt threads per ep
like in this patch.
Or we ensure that the complete handler is not running that long.
This could be ensured by providing an interface that is similar to the
threaded interrupt interface. The complete handler should then only
wake up the corresponding complete thread.
This policy of a short running complete handler also should be commented
somewhere in the kernel.
Which brings me back to the open discussion with avichal, where I
already ment that it should be possible to completely remove the
usb_ep_queue callback from the complete handler. We there should only
update the buffer state and make sure that the pump worker would take
care of queueing the right buffers to the dwc3 driver. I will go more
into the details in this thread:
https://lore.kernel.org/all/17192e0f-7f18-49ae-96fc-71054d46f74a@google.com/
>> > This will affect other applications.
>>
>> Let's make sure we will not break anything on the way. Okay? :)
>>
>> > Let's not do this.
>>
>> I actually thought that this is even requested:
>>
>> https://docs.kernel.org/usb/dwc3.html
>>
>
>That's a very old and outdate TODO list.
We should ensure that this chapter will be removed then.
>We don't use wait_for_completion_timeout in the commands. During
>transfers, we're using Update Transfer command, and it completes almost
>immediately. The only time where a command may take a longer time is
>when we need to bring the device down for reset/disconnect and stop
>transfers, but that's not what contributes to the problem here.
>
>Internal tests show that we can achieve very close theoretical USB
>speeds with the current dwc3 implementation.
Granted, but only if we ensure that the complete() callback is not
destroing the runtime duration and probably no usb_ep_queue is never
called from the complete callback.
Michael
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists