linux-kernel - RE: pcm|dmaengine|imx-sdma race condition on i.MX6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <VE1PR04MB6638AC2A3AE852C3047E7B97895F0@VE1PR04MB6638.eurprd04.prod.outlook.com>
Date:   Mon, 17 Aug 2020 09:22:46 +0000
From:   Robin Gong <yibin.gong@....com>
To:     Benjamin Bara - SKIDATA <Benjamin.Bara@...data.com>
CC:     "timur@...nel.org" <timur@...nel.org>,
        "nicoleotsuka@...il.com" <nicoleotsuka@...il.com>,
        "vkoul@...nel.org" <vkoul@...nel.org>,
        "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        dl-linux-imx <linux-imx@....com>,
        "shawnguo@...nel.org" <shawnguo@...nel.org>,
        "kernel@...gutronix.de" <kernel@...gutronix.de>,
        "dmaengine@...r.kernel.org" <dmaengine@...r.kernel.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "alsa-devel@...a-project.org" <alsa-devel@...a-project.org>,
        Richard Leitner - SKIDATA <Richard.Leitner@...data.com>
Subject: RE: pcm|dmaengine|imx-sdma race condition on i.MX6

On 2020/08/17 15:29 Benjamin Bara - SKIDATA <Benjamin.Bara@...data.com> wrote:
> We think this is not an i.MX6-specific problem, but a problem of the
> DMAengine usage from the PCM.
> In case of a XRUN, the DMA channel is never closed but first a
> SNDRV_PCM_TRIGGER_STOP next a SNDRV_PCM_TRIGGER_START is triggered.
> The SNDRV_PCM_TRIGGER_STOP simply executes a
> dmaengine_terminate_async() [1] but does not await the termination by calling
> dmaengine_synchronize(), which is required as stated by the docu [2].
> Anyways, we are not able to fix it in the pcm_dmaengine layer either at the
> end of SNDRV_PCM_TRIGGER_STOP (called from the DMA on complete
> interrupt handler) or at the beginning of SNDRV_PCM_TRIGGER_START (called
> from a PCM ioctl), since the dmaengine_synchronize() requires a non-atomic
> context.
> 
> Based on my understanding, most of the DMA implementations don't even
> implement device_synchronize and if they do, it might not really be necessary
> since the terminate_all operation is synchron.
> 
> With the i.MX6, it looks a bit different:
> Since [4], the terminate_all operation really schedules a worker which waits
> the required ~1ms and then does the context freeing.
> Now, the ioctl(SNDRV_PCM_IOCTL_PREPARE) and the following
> ioctl(SNDRV_PCM_IOCTL_READI_FRAMES),
> which are called from US to handle/recover from a XRUN, are in a race with the
> terminate_worker.
> If the terminate_worker finishes earlier, everything is fine.
> Otherwise, the sdma_prep_dma_cyclic() is called, sets up everything and as
> soon as it is scheduled out to wait for data, the terminate_worker is scheduled
> and kills it.
> In this case, we wait in [5] until the timeout is reached and return with -EIO.
> 
> Based on our understanding, there exist two different fixing approaches:
> We thought that the pcm_dmaengine should handle this by either
> synchronizing the DMA on a trigger or terminating it synchronously.
> However, as we are in an atomic context, we either have to give up the atomic
> context of the PCM to finish the termination or we have to design a
> synchronous terminate variant which is callable from an atomic context.
> 
> For the first option, which is potentially more performant, we have to leave the
> atomic PCM context and we are not sure if we are allowed to.
> For the second option, we would have to divide the dma_device terminate_all
> into an atomic sync and an async one, which would align with the dmaengine
> API, giving it the option to ensure termination in an atomic context.
> Based on my understanding, most of them are synchronous anyways, for the
> currently async ones we would have to implement busy waits.
> However, with this approach, we reach the WARN_ON [6] inside of an atomic
> context, indicating we might not do the right thing.
> 
> Ad Failure Log (at bottom):
> I haven't added the ioctl syscalls, but this is basically the output with additional
> prints to be able to follow the code execution path.
> A XRUN (buffer size is 480 but 960 available) leads to a
> SNDRV_PCM_TRIGGER_STOP.
> This leads to terminate_async, starting the terminate_worker.
> Next, the XRUN recovery triggers SNDRV_PCM_TRIGGER_START, calling
> sdma_prep_dma_cyclic and then waits for the DMA in wait_for_avail().
> Next we see the two freeings, first the old, then the newly added one; so the
> terminate_worker is back at work.
> Now the DMA is terminated, while we are still waiting on data from it.
> 
> What do you think about it? Is any of the provided solutions practicable?
> If you need further information or additional logging, feel free to ask.
busy_wait is not good I think, would you please have a try with the attached patch
which is based on https://lkml.org/lkml/2020/8/11/111? The basic idea is
to keep the freed descriptor into another list for freeing in later terminate_worker
instead of freeing directly all in terminate_worker by vchan_get_all_descriptors
which may break next descriptor coming soon


Download attachment "0001-dmaengine-imx-sdma-add-terminated-list-for-freed-des.patch" of type "application/octet-stream" (2732 bytes)