linux-kernel - Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b872fd98-102b-09ef-d43e-0a6aa2a40696@gmail.com>
Date:   Thu, 17 Jun 2021 21:54:39 +0200
From:   Ferry Toth <fntoth@...il.com>
To:     Wesley Cheng <wcheng@...eaurora.org>,
        Andy Shevchenko <andy.shevchenko@...il.com>,
        Heikki Krogerus <heikki.krogerus@...ux.intel.com>
Cc:     Felipe Balbi <balbi@...nel.org>,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Greg KH <gregkh@...uxfoundation.org>,
        Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Rob Herring <robh+dt@...nel.org>,
        USB <linux-usb@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-arm-msm@...r.kernel.org,
        devicetree <devicetree@...r.kernel.org>,
        Jack Pham <jackp@...eaurora.org>,
        Thinh Nguyen <Thinh.Nguyen@...opsys.com>,
        John Youn <John.Youn@...opsys.com>
Subject: Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting

Hi

Op 17-06-2021 om 10:30 schreef Wesley Cheng:
>
> On 6/17/2021 12:47 AM, Ferry Toth wrote:
>> Hi
>>
>> Op 17-06-2021 om 06:25 schreef Wesley Cheng:
>>> On 6/15/2021 12:53 PM, Ferry Toth wrote:
>>>> Hi
>>>>
>>>> Op 15-06-2021 om 06:22 schreef Wesley Cheng:
>>>>> On 6/14/2021 12:30 PM, Ferry Toth wrote:
>>>>>> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>>>>>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>>>>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>>>>>>> <heikki.krogerus@...ux.intel.com> wrote:
>>>>>>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Wesley Cheng <wcheng@...eaurora.org> writes:
>>>>>>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>>>>>>> the build
>>>>>>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>>>>>>> core with
>>>>>>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>>>>>>> and takes
>>>>>>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>>>>>>> vendors,
>>>>>>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>>>>>>> property, so that
>>>>>>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>>>>>>> disabled.  The
>>>>>>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>>>>>>> for each EP
>>>>>>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>>>>>>> like we can
>>>>>>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>>>>>>> looking
>>>>>>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>>>>>>> likely
>>>>>>>>>>>>> to break things left and right :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>>>>>>
>>>>>>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>>>>>>> pretty
>>>>>>>>>>>> big improvements on the TX path with this.
>>>>>>>>>>> fingers crossed
>>>>>>>>>>>
>>>>>>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>>>>>>> If that
>>>>>>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>>>>>>> feature flags paired with ->validate_endpoint() or whatever
>>>>>>>>>>>>> before
>>>>>>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>>>>>>> core to
>>>>>>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>>>>>>> interefering with
>>>>>>>>>>>>> everything else.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>>>>>>> different
>>>>>>>>>>>>> dwc3 instantiations.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>>>>>>> then we
>>>>>>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>>>>>>> will be
>>>>>>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>>>>>>> different
>>>>>>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>>>>>>
>>>>>>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>>>>>>> needing to
>>>>>>>>>>>> focus on the DWC3 implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>>>>>>> taking a look at, which is a nice way we can handle vendor
>>>>>>>>>>>> specific
>>>>>>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>>>>>>> endpoints.  We
>>>>>>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>>>>>>> separately
>>>>>>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or
>>>>>>>>>>>> most
>>>>>>>>>>>> pretty method :).
>>>>>>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>>>>>>> exactly to
>>>>>>>>>>> allow for these vendor-specific features :-)
>>>>>>>>>>>
>>>>>>>>>>> I'm more than happy to help testing now that I finally got our
>>>>>>>>>>> SM8150
>>>>>>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>>>>>>
>>>>>>>>>>>>>> However, I'm not sure how the changes would look like in the
>>>>>>>>>>>>>> end,
>>>>>>>>>>>>>> so I
>>>>>>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>>>>>>> though?
>>>>>>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>>>>>>> did you
>>>>>>>>>>>>> use? How did you test
>>>>>>>>>>>>>
>>>>>>>>>>>> The results that I included in the cover page was tested with the
>>>>>>>>>>>> pure
>>>>>>>>>>>> upstream kernel on our device.  Below was using the ConfigFS
>>>>>>>>>>>> gadget
>>>>>>>>>>>> w/ a
>>>>>>>>>>>> mass storage only composition.
>>>>>>>>>>>>
>>>>>>>>>>>> Test Parameters:
>>>>>>>>>>>>      - Platform: Qualcomm SM8150
>>>>>>>>>>>>      - bMaxBurst = 6
>>>>>>>>>>>>      - USB req size = 256kB
>>>>>>>>>>>>      - Num of USB reqs = 16
>>>>>>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>>>>>>
>>>>>>>>>>>>      - USB Speed = Super-Speed
>>>>>>>>>>>>      - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>>>>>>      - Test Application: CrystalDiskMark
>>>>>>>>>>>>
>>>>>>>>>>>> Results:
>>>>>>>>>>>>
>>>>>>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>>>>>>
>>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>>> Read      |9 loops    | 193.60
>>>>>>>>>>>>               |           | 195.86
>>>>>>>>>>>>               |           | 184.77
>>>>>>>>>>>>               |           | 193.60
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>>>>>>
>>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>>> Read      |9 loops    | 287.35
>>>>>>>>>>>>             |           | 304.94
>>>>>>>>>>>>               |           | 289.64
>>>>>>>>>>>>               |           | 293.61
>>>>>>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>>>>>>> though my
>>>>>>>>>>> memory could be failing.
>>>>>>>>>>>
>>>>>>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own
>>>>>>>>>>> tool
>>>>>>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>>>>>>
>>>>>>>>>>>> We also have internal numbers which have shown similar
>>>>>>>>>>>> improvements as
>>>>>>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>>>>>>> IPERF
>>>>>>>>>>>> loopback over TCP/UDP.
>>>>>>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>>>>>>
>>>>>>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>>>>>>> beneficial to
>>>>>>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>>>>>>> and at
>>>>>>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>>>>>>> memory to
>>>>>>>>>>>>>> fetch the next data packet, that latency takes enough time
>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>> controller to end the current burst.
>>>>>>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>>>>>>> cost of
>>>>>>>>>>>>> fetching data from memory, with a deeper request queue.
>>>>>>>>>>>>> Whenever I
>>>>>>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>>>>>>> that was
>>>>>>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>>>>>>> FIFO
>>>>>>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>>>>>>> generic
>>>>>>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>>>>>>
>>>>>>>>>>>> I wish I had a USB bus trace handy to show you, which would
>>>>>>>>>>>> make it
>>>>>>>>>>>> very
>>>>>>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>>>>>>> 6.  So
>>>>>>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>>>>>>> was a
>>>>>>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>>>>>>> processing them.
>>>>>>>>>>>>
>>>>>>>>>>>> So yes, this method of increasing the # of USB reqs will
>>>>>>>>>>>> definitely
>>>>>>>>>>>> help
>>>>>>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>>>>>>> used.
>>>>>>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes
>>>>>>>>>>>> endpoint
>>>>>>>>>>>> bursting.
>>>>>>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>>>>>>> role
>>>>>>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>>>>>>> about
>>>>>>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>>>>>>
>>>>>>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>>>>>>> your
>>>>>>>>>>> changes on g_mass_storage with 250 requests.
>>>>>>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>>>>>>> right? Can you help out here?
>>>>>>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>>>>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>>>>>>> keep this in mind.
>>>>>>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>>>>>>> apply. Switching between host/gadget works, no connections
>>>>>>>> dropping, no
>>>>>>>> errors in dmesg.
>>>>>>>>
>>>>>>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>>>>>>> composite device created from configfs with gser / eem /
>>>>>>>> mass_storage /
>>>>>>>> uac2.
>>>>>>>>
>>>>>>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>>>>>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>>>>>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>>>>>>
>>>>>>>> Gadget seems to be a little faster with the patches, but that might
>>>>>>>> also
>>>>>>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>>>>>>> between 207 and 199.
>>>>>>>>
>>>>>>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec.
>>>>>>>> With
>>>>>>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>>>>>>
>>>>>>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>>>>>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>>>>>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>>>>>>
>>>>>>> Hi Ferry,
>>>>>>>
>>>>>>> Thanks for the testing.  Just to double check, did you also enable the
>>>>>>> property, which enabled the TXFIFO resize feature on the platform?  For
>>>>>>> example, for the QCOM SM8150 platform, we're adding the following to
>>>>>>> our
>>>>>>> device tree node:
>>>>>>>
>>>>>>> tx-fifo-resize
>>>>>>>
>>>>>>> If not, then your results at least confirms that w/o the property
>>>>>>> present, the changes won't break anything :).  Thanks again for the
>>>>>>> initial testing!
>>>> I applied the patch now to 5.13.0-rc5 + the following:
>>>>
>>> Hi Ferry,
>>>
>>> Quick question...there was a compile error with the V9 patch series, as
>>> it was using the dwc3_mwidth() incorrectly.  I will update this with the
>>> proper use of the mdwidth, but which patch version did you use?
> Hi Ferry,
>
>> The V9 set gets applied to 5.13.0-rc5 by Yocto, if it doesn't apply it
>> stops the build. I didn't notice any compile errors, they stop the whole
>> build process too. But warnings are ignored. I'll check the logs to be sure.
> Ah, ok.  I think the incorrect usage of the API will result as a warning
> as seen in Greg's compile log:
>
> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of
> ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>    653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
I found exactly this warning in my logs. Sorry for this noise, I was 
watching for an error.
> This is probably why the page fault occurs, as we're not passing in the
> DWC3 struct.  I will send out a V10 shortly after testing it on my device.
>
> Thanks
> Wesley Cheng
>
>>> Thanks
>>> Wesley Cheng
>>>
>>>> --- a/drivers/usb/dwc3/dwc3-pci.c
>>>> +++ b/drivers/usb/dwc3/dwc3-pci.c
>>>> @@ -124,6 +124,7 @@ static const struct property_entry
>>>> dwc3_pci_mrfld_properties[] = {
>>>>       PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
>>>>       PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
>>>>       PROPERTY_ENTRY_BOOL("snps,usb2-gadget-lpm-disable"),
>>>> +    PROPERTY_ENTRY_BOOL("tx-fifo-resize"),
>>>>       PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
>>>>       {}
>>>>   };
>>>>
>>>>   and when switching to gadget mode unfortunately received the following
>>>> oops:
>>>>
>>>> BUG: unable to handle page fault for address: 00000000202043f2
>>>> #PF: supervisor read access in kernel mode
>>>> #PF: error_code(0x0000) - not-present page
>>>> PGD 0 P4D 0
>>>> Oops: 0000 [#1] SMP PTI
>>>> CPU: 0 PID: 617 Comm: conf-gadget.sh Not tainted
>>>> 5.13.0-rc5-edison-acpi-standard #1
>>>> Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542
>>>> 2015.01.21:18.19.48
>>>> RIP: 0010:dwc3_gadget_check_config+0x33/0x80
>>>> Code: 59 04 00 00 04 74 61 48 c1 ee 10 48 89 f7 f3 48 0f b8 c7 48 89 c7
>>>> 39 81 60 04 00 00 7d 4a 89 81 60 04 00 00 8b 81 08 04 00 00 <81> b8 e8
>>>> 03 00 00 32 33 00 00 0f b6 b0 09 04 00 00 75 0d 8b 80 20
>>>> RSP: 0018:ffffb5550038fda0 EFLAGS: 00010297
>>>> RAX: 000000002020400a RBX: ffffa04502627348 RCX: ffffa04507354028
>>>> RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000004
>>>> RBP: ffffa04508ac0550 R08: ffffa04503a75b2c R09: 0000000000000000
>>>> R10: 0000000000000216 R11: 000000000002eba0 R12: ffffa04508ac0550
>>>> R13: dead000000000100 R14: ffffa04508ac0600 R15: ffffa04508ac0520
>>>> FS:  00007f7471e2f740(0000) GS:ffffa0453e200000(0000)
>>>> knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 00000000202043f2 CR3: 0000000003f38000 CR4: 00000000001006f0
>>>> Call Trace:
>>>>   configfs_composite_bind+0x2f4/0x430 [libcomposite]
>>>>   udc_bind_to_driver+0x64/0x180
>>>>   usb_gadget_probe_driver+0x114/0x150
>>>>   gadget_dev_desc_UDC_store+0xbc/0x130 [libcomposite]
>>>>   configfs_write_file+0xcd/0x140
>>>>   vfs_write+0xbb/0x250
>>>>   ksys_write+0x5a/0xd0
>>>>   do_syscall_64+0x40/0x80
>>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> RIP: 0033:0x7f7471f1ff53
>>>> Code: 8b 15 21 cf 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f
>>>> 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00
>>>> f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
>>>> RSP: 002b:00007fffa3dcd328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>>> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7471f1ff53
>>>> RDX: 000000000000000c RSI: 00005614d615a770 RDI: 0000000000000001
>>>> RBP: 00005614d615a770 R08: 000000000000000a R09: 00007f7471fb20c0
>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>>>> R13: 00007f7471fee520 R14: 000000000000000c R15: 00007f7471fee720
>>>> Modules linked in: usb_f_uac2 u_audio usb_f_mass_storage usb_f_eem
>>>> u_ether usb_f_serial u_serial libcomposite rfcomm iptable_nat bnep
>>>> snd_sof_nocodec spi_pxa2xx_platform dw_dmac smsc snd_sof_pci_intel_tng
>>>> snd_sof_pci snd_sof_acpi_intel_byt snd_sof_intel_ipc snd_sof_acpi
>>>> smsc95xx snd_sof pwm_lpss_pci pwm_lpss snd_sof_xtensa_dsp
>>>> snd_intel_dspcfg snd_soc_acpi_intel_match snd_soc_acpi dw_dmac_pci
>>>> intel_mrfld_pwrbtn intel_mrfld_adc dw_dmac_core spi_pxa2xx_pci brcmfmac
>>>> brcmutil leds_gpio hci_uart btbcm ti_ads7950
>>>> industrialio_triggered_buffer kfifo_buf ledtrig_timer ledtrig_heartbeat
>>>> mmc_block extcon_intel_mrfld sdhci_pci cqhci sdhci led_class
>>>> intel_soc_pmic_mrfld mmc_core btrfs libcrc32c xor zstd_compress
>>>> zlib_deflate raid6_pq
>>>> CR2: 00000000202043f2
>>>> ---[ end trace 5c11fe50dca92ad4 ]---
>>>>
>>>>>> No I didn't. Afaik we don't have a devicetree property to set.
>>>>>>
>>>>>> But I'd be happy to test that as well. But where to set the property?
>>>>>>
>>>>>> dwc3_pci_mrfld_properties[] in dwc3-pci?
>>>>>>
>>>>> Hi Ferry,
>>>>>
>>>>> Not too sure which DWC3 driver is used for the Intel platform, but I
>>>>> believe that should be the one. (if that's what is normally used)  We'd
>>>>> just need to add an entry w/ the below:
>>>>>
>>>>> PROPERTY_ENTRY_BOOL("tx-fifo-resize")
>>>>>
>>>>> Thanks
>>>>> Wesley Cheng
>>>>>
>>>>>>> Thanks
>>>>>>> Wesley Cheng
>>>>>>>
>>>>>>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>>>>>>> Packet,
>>>>>>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>>>>>>> the EP
>>>>>>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>>>>>>> based
>>>>>>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because
>>>>>>>>>>> of a
>>>>>>>>>>> problem reported by John Youn.
>>>>>>>>>>>
>>>>>>>>>>> And now we've come full circle. Because even if I believe more
>>>>>>>>>>> requests
>>>>>>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>>>>>>> ends
>>>>>>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>>>>>>> squeeze more throughput out of the controller.
>>>>>>>>>>>
>>>>>>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>>>>>>> fact,
>>>>>>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>>>>>>> NumP
>>>>>>>>>>> (emphasis is mine):
>>>>>>>>>>>
>>>>>>>>>>>           "Number of Packets (NumP). This field is used to
>>>>>>>>>>> indicate the
>>>>>>>>>>>           number of Data Packet buffers that the **receiver** can
>>>>>>>>>>>           accept. The value in this field shall be less than or
>>>>>>>>>>> equal to
>>>>>>>>>>>           the maximum burst size supported by the endpoint as
>>>>>>>>>>> determined
>>>>>>>>>>>           by the value in the bMaxBurst field in the Endpoint
>>>>>>>>>>> Companion
>>>>>>>>>>>           Descriptor (refer to Section 9.6.7)."
>>>>>>>>>>>
>>>>>>>>>>> So, NumP is for the receiver, not the transmitter. Could you
>>>>>>>>>>> clarify
>>>>>>>>>>> what you mean here?
>>>>>>>>>>>
>>>>>>>>>>> /me keeps reading
>>>>>>>>>>>
>>>>>>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>>>>>>
>>>>>>>>>>>           "Number of Packets (NumP).
>>>>>>>>>>>
>>>>>>>>>>>           For an OUT endpoint, refer to Table 8-13 for the
>>>>>>>>>>> description of
>>>>>>>>>>>           this field.
>>>>>>>>>>>
>>>>>>>>>>>           For an IN endpoint this field is set by the endpoint to
>>>>>>>>>>> the
>>>>>>>>>>>           number of packets it can transmit when the host resumes
>>>>>>>>>>>           transactions to it. This field shall not have a value
>>>>>>>>>>> greater
>>>>>>>>>>>           than the maximum burst size supported by the endpoint as
>>>>>>>>>>>           indicated by the value in the bMaxBurst field in the
>>>>>>>>>>> Endpoint
>>>>>>>>>>>           Companion Descriptor. Note that the value reported in this
>>>>>>>>>>> field
>>>>>>>>>>>           may be treated by the host as informative only."
>>>>>>>>>>>
>>>>>>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>>>>>>> NUMP in
>>>>>>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3
>>>>>>>>>>> compute
>>>>>>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>>>>>>> DCFG.NUMP or
>>>>>>>>>>> TxFIFO size?
>>>>>>>>>>>
>>>>>>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>>>>>>> seen is
>>>>>>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>>>>>>> need
>>>>>>>>>>>> to send an ERDY once data is available within the FIFO, and the
>>>>>>>>>>>> same
>>>>>>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>>>>>>> constant
>>>>>>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is
>>>>>>>>>>>> under
>>>>>>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>>>>>>> constant
>>>>>>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>>>>>>> runs out
>>>>>>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>>>>>>> USB
>>>>>>>>>>>> request data transfer)
>>>>>>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Good points.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>>>>>>> different devices?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>>>>>>> testing :).
>>>>>>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>>>>>>> :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>>>>>>> flash dev
>>>>>>>>>>>>>>> images heh.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>>>>>>> non-Android
>>>>>>>>>>>>>> products that use the same code, and it has been working in
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>>>>>>> multimedia
>>>>>>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>>>>>>> end CPU
>>>>>>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>>>>>>> missed
>>>>>>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>>>>>>> interested in
>>>>>>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>>>>>>> endpoint. If
>>>>>>>>>>>>> our gadget driver uses a low number of requests, we're never
>>>>>>>>>>>>> really
>>>>>>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>>>>>>
>>>>>>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>>>>>>> resizing. :).
>>>>>>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>>>>>>> Burst
>>>>>>>>>>> behavior.
>>>>>>>>>> -- 
>>>>>>>>>> heikki