linux-kernel - Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=NsqxXP+SjH-ud8sjHD5y_LxZGUDnwHNPbzr_0RPwqVrwpPw@mail.gmail.com>
Date:   Wed, 1 Nov 2023 12:31:01 +0100
From:   Kornel Dulęba <korneld@...omium.org>
To:     Adrian Hunter <adrian.hunter@...el.com>
Cc:     linux-mmc@...r.kernel.org, linux-kernel@...r.kernel.org,
        Ulf Hansson <ulf.hansson@...aro.org>,
        Radoslaw Biernacki <biernacki@...gle.com>,
        Gwendal Grignou <gwendal@...omium.org>
Subject: Re: [PATCH 1/2] mmc: cqhci: Add a quirk to clear stale TC

On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <adrian.hunter@...el.com> wrote:
>
> On 27/10/23 17:56, Kornel Dulęba wrote:
> > This fix addresses a stale task completion event issued right after the
> > CQE recovery. As it's a hardware issue the fix is done in form of a
> > quirk.
> >
> > When error interrupt is received the driver runs recovery logic is run.
> > It halts the controller, clears all pending tasks, and then re-enables
> > it. On some platforms a stale task completion event is observed,
> > regardless of the CQHCI_CLEAR_ALL_TASKS bit being set.
> >
> > This results in either:
> > a) Spurious TC completion event for an empty slot.
> > b) Corrupted data being passed up the stack, as a result of premature
> >    completion for a newly added task.
> >
> > To fix that re-enable the controller, clear task completion bits,
> > interrupt status register and halt it again.
> > This is done at the end of the recovery process, right before interrupts
> > are re-enabled.
> >
> > Signed-off-by: Kornel Dulęba <korneld@...omium.org>
> > ---
> >  drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++
> >  drivers/mmc/host/cqhci.h      |  1 +
> >  2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> > index b3d7d6d8d654..e534222df90c 100644
> > --- a/drivers/mmc/host/cqhci-core.c
> > +++ b/drivers/mmc/host/cqhci-core.c
> > @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host)
> >  /* CQHCI could be expected to clear it's internal state pretty quickly */
> >  #define CQHCI_CLEAR_TIMEOUT          20
> >
> > +/*
> > + * During CQE recovery all pending tasks are cleared from the
> > + * controller and its state is being reset.
> > + * On some platforms the controller sets a task completion bit for
> > + * a stale(previously cleared) task right after being re-enabled.
> > + * This results in a spurious interrupt at best and corrupted data
> > + * being passed up the stack at worst. The latter happens when
> > + * the driver enqueues a new request on the problematic task slot
> > + * before the "spurious" task completion interrupt is handled.
> > + * To fix it:
> > + * 1. Re-enable controller by clearing the halt flag.
> > + * 2. Clear interrupt status and the task completion register.
> > + * 3. Halt the controller again to be consistent with quirkless logic.
> > + *
> > + * This assumes that there are no pending requests on the queue.
> > + */
> > +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host)
> > +{
> > +     u32 reg;
> > +
> > +     WARN_ON(cq_host->qcnt);
> > +     cqhci_writel(cq_host, 0, CQHCI_CTL);
> > +     if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) {
> > +             pr_err("%s: cqhci: CQE failed to exit halt state\n",
> > +                     mmc_hostname(cq_host->mmc));
> > +     }
> > +     reg = cqhci_readl(cq_host, CQHCI_TCN);
> > +     cqhci_writel(cq_host, reg, CQHCI_TCN);
> > +     reg = cqhci_readl(cq_host, CQHCI_IS);
> > +     cqhci_writel(cq_host, reg, CQHCI_IS);
> > +
> > +     /*
> > +      * Halt the controller again.
> > +      * This is only needed so that we're consistent across quirk
> > +      * and quirkless logic.
> > +      */
> > +     cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT);
> > +}
>
> Thanks a lot for tracking this down!
>
> It could be that the "un-halt" starts a task, so it would be
> better to force the "clear" to work if possible, which
> should be the case if CQE is disabled.
>
> Would you mind trying the code below?  Note the increased
> CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks
> when CQE has not halted.

Sure, I'll try it out tomorrow, as I don't have access to the DUT today.
BTW do we even need to halt the controller in the recovery_finish logic?
It has already been halted in recovery_start, I guess it could be
there in case the recovery_start halt didn't work.
But in that case shouldn't we do this disable/re-enable dance in recovery_start?

>
>
> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> index b3d7d6d8d654..534c13069833 100644
> --- a/drivers/mmc/host/cqhci-core.c
> +++ b/drivers/mmc/host/cqhci-core.c
> @@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout)
>   * layers will need to send a STOP command), so we set the timeout based on a
>   * generous command timeout.
>   */
> -#define CQHCI_START_HALT_TIMEOUT       5
> +#define CQHCI_START_HALT_TIMEOUT       500
>
>  static void cqhci_recovery_start(struct mmc_host *mmc)
>  {
> @@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc)
>
>         ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
>
> -       if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> -               ok = false;
> -
>         /*
>          * The specification contradicts itself, by saying that tasks cannot be
>          * cleared if CQHCI does not halt, but if CQHCI does not halt, it should
>          * be disabled/re-enabled, but not to disable before clearing tasks.
>          * Have a go anyway.
>          */
> -       if (!ok) {
> -               pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> -               cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> -               cqcfg &= ~CQHCI_ENABLE;
> -               cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> -               cqcfg |= CQHCI_ENABLE;
> -               cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> -               /* Be sure that there are no tasks */
> -               ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> -               if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> -                       ok = false;
> -               WARN_ON(!ok);
> -       }
> +       if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT))
> +               ok = false;
> +
> +       cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> +       cqcfg &= ~CQHCI_ENABLE;
> +       cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> +       cqcfg = cqhci_readl(cq_host, CQHCI_CFG);
> +       cqcfg |= CQHCI_ENABLE;
> +       cqhci_writel(cq_host, cqcfg, CQHCI_CFG);
> +
> +       cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT);
> +
> +       if (!ok)
> +               cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT);
>
>         cqhci_recover_mrqs(cq_host);
>
>