linux-kernel - Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel Baytrail SoCs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b5c03c57-87ea-e4a1-950c-5e7892bfd2ca@intel.com>
Date:   Tue, 10 Jul 2018 15:51:28 +0300
From:   Adrian Hunter <adrian.hunter@...el.com>
To:     Kurt Kanzenbach <kurt.kanzenbach@...utronix.de>
Cc:     ulf.hansson@...aro.org, linux-mmc@...r.kernel.org,
        linux-kernel@...r.kernel.org, tglx@...utronix.de
Subject: Re: [PATCH 1/1] mmc: sdhci-pci: fix eMMC controller issue on Intel
 Baytrail SoCs

On 25/06/18 17:36, Kurt Kanzenbach wrote:
>> On 06/20/2018 04:15 PM, Kurt Kanzenbach wrote:
>>> Hi,
>>>
>>> thanks for your response.
>>>
>>> On Tue, Jun 19, 2018 at 10:03:01AM +0300, Adrian Hunter wrote:
>>>> On 19/06/18 09:31, Kurt Kanzenbach wrote:
>>>>> Sometimes the eMMC controller doesn't respond anymore on Intel Baytrail
>>>>> SoCs. The resulting error looks like:
>>>>>
>>>>> |mmc1: Reset 0x1 never completed.
>>>>> |sdhci: =========== REGISTER DUMP (mmc1)===========
>>>>> |sdhci: Sys addr: 0xffffffff | Version:  0x0000ffff
>>>>> |sdhci: Blk size: 0x0000ffff | Blk cnt:  0x0000ffff
>>>>> |sdhci: Argument: 0xffffffff | Trn mode: 0x0000ffff
>>>>> |sdhci: Present:  0xffffffff | Host ctl: 0x000000ff
>>>>> |sdhci: Power:    0x000000ff | Blk gap:  0x000000ff
>>>>> |sdhci: Wake-up:  0x000000ff | Clock:    0x0000ffff
>>>>> |sdhci: Timeout:  0x000000ff | Int stat: 0xffffffff
>>>>> |sdhci: Int enab: 0xffffffff | Sig enab: 0xffffffff
>>>>> |sdhci: AC12 err: 0x0000ffff | Slot int: 0x0000ffff
>>>>> |sdhci: Caps:     0xffffffff | Caps_1:   0xffffffff
>>>>> |sdhci: Cmd:      0x0000ffff | Max curr: 0xffffffff
>>>>> |sdhci: Host ctl2: 0x0000ffff
>>>>> |sdhci: ADMA Err: 0xffffffff | ADMA Ptr: 0xffffffff
>>>>>
>>>>> The behavior was observed on an Intel Atom E3825 performing lots of reboots. The
>>>>
>>>> So you are saying this only happens at boot time?  And only when
>>>> re-booting?
>>>
>>> well, exactly. This issue was only observed when rebooting, not on cold
>>> boots.
>>>
>>>> Can you send all the kernel messages?  Can you send an acpidump?
>>>
>>> The kernel log is straightforward. The system is booting and starting a
>>> few applications. Afterwards the issue happens. The rootfilesystem is
>>> located on the eMMC.
>>
>> The full messages can be more revealing such as showing what else was
>> happening and the order of events, so I would still like to see them.
>>
>>>
>>> The error message above is from the Linux v4.9 boot log.
>>>
>>> On v4.17 the same issue happens, but the error messages are different:
>>>
>>> |mmc1: Timeout waiting for hardware interrupt.
>>> |mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> |mmc1: sdhci: Sys addr:  0x00000002 | Version:  0x00001002
>>> |mmc1: sdhci: Blk size:  0x00007200 | Blk cnt:  0x00000000
>>> |mmc1: sdhci: Argument:  0x00040fd4 | Trn mode: 0x0000003b
>>> |mmc1: sdhci: Present:   0x1fff0000 | Host ctl: 0x00000035
>>> |mmc1: sdhci: Power:     0x0000000b | Blk gap:  0x00000080
>>> |mmc1: sdhci: Wake-up:   0x00000000 | Clock:    0x00000207
>>> |mmc1: sdhci: Timeout:   0x00000000 | Int stat: 0x00000003
>>> |mmc1: sdhci: Int enab:  0x02ff000b | Sig enab: 0x02ff000b
>>> |mmc1: sdhci: AC12 err:  0x00000000 | Slot int: 0x00000001
>>> |mmc1: sdhci: Caps:      0x446cc801 | Caps_1:   0x00000005
>>> |mmc1: sdhci: Cmd:       0x0000123a | Max curr: 0x00000000
>>> |mmc1: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0xffffffff
>>> |mmc1: sdhci: Resp[2]:   0x320f5913 | Resp[3]:  0x00000900
>>> |mmc1: sdhci: Host ctl2: 0x0000000c
>>> |mmc1: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x34ee5208
>>> |mmc1: sdhci: ============================================
>>> |[...]
>>
>> Those messages show that the interrupt did happen but the driver did not see
>> it.  Are you doing anything unusual like using threadirqs?
> 
> No, I'm not doing anything unusual. The mmc core uses threaded irqs by
> default. But, most of the work is performed in the primary handler. So,
> that shouldn't be a problem.
> 
> But in the v4.9 case, we use preempt rt. I took a few scheduler traces

preempt rt is unusual.  SDHCI uses synchronize_hardirq() and that might
explain the difference between the 4.9 case with preempt rt and the 4.17
without.

> in order to see if there might be any task blocking or preempting the
> mmc irqs. However, that's not the case.
> 
> The common pattern is: mmc1 is suspended, afterwards some applications
> use mmc0 and finally a different application accesses mmc1. The suspend
> function is called and during initialization the reset doesn't work
> anymore.
> 
> Anyway, I'll perform more tests.
> 
> Thanks, Kurt
> 
>>
>>>
>>> Both issues disappear when disabling runtime pm.
>>>
>>> Anyway I'll prepare an acpidump for you.
>>>
>>>>
>>>>> issue seems to occur if runtime power management is used. Found by utilizing
>>>>> ftrace.
>>>>>
>>>>> The erratum VLI10 for the Intel E3825 states, that the eMMC controller
>>>>> incorrectly announces that it supports suspend/resume. However, that shouldn't
>>>>> be used, as the controller may incorrectly transfer data between memory and the
>>>>> SD device.
>>>>
>>>> That erratum is not related to this problem.  The suspend/resume that is
>>>> documented is an internal SDHCI feature, not the kernel's suspend/resume.
>>>> The SDHCI Suspend/Resume Mechanism is not supported in the driver, so it is
>>>> not being used anyway.
>>>
>>> Thanks for the clarification.
>>>
>>> Do you have any idea why this issue might happen?
>>
>> No, but it seems like the runtime pm callbacks aren't happening when they
>> are supposed to.
>>
>>>
>>> Thanks, Kurt
>>>
>>>>
>>>>>
>>>>> Therefore, disallowing runtime pm resolves the issue. Tested on the E3825.
>>>>>
>>>>> Signed-off-by: Kurt Kanzenbach <kurt@...utronix.de>
>>>>> ---
>>>>>  drivers/mmc/host/sdhci-pci-core.c | 17 ++++++++++++++++-
>>>>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c
>>>>> index 77dd3521daae..df89381944cd 100644
>>>>> --- a/drivers/mmc/host/sdhci-pci-core.c
>>>>> +++ b/drivers/mmc/host/sdhci-pci-core.c
>>>>> @@ -870,6 +870,21 @@ static const struct sdhci_pci_fixes sdhci_intel_byt_emmc = {
>>>>>  	.priv_size	= sizeof(struct intel_host),
>>>>>  };
>>>>>
>>>>> +/*
>>>>> + * See Erratum VLI10 from Errata List for Intel Atom E3825, Link:
>>>>> + * https://www.intel.ca/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
>>>>> + */
>>>>> +static const struct sdhci_pci_fixes sdhci_intel_byt_emmc_no_runtime_pm = {
>>>>> +	.allow_runtime_pm = false,
>>>>> +	.probe_slot	= byt_emmc_probe_slot,
>>>>> +	.quirks		= SDHCI_QUIRK_NO_ENDATTR_IN_NOPDESC,
>>>>> +	.quirks2	= SDHCI_QUIRK2_PRESET_VALUE_BROKEN |
>>>>> +			  SDHCI_QUIRK2_CAPS_BIT63_FOR_HS400 |
>>>>> +			  SDHCI_QUIRK2_STOP_WITH_TC,
>>>>> +	.ops		= &sdhci_intel_byt_ops,
>>>>> +	.priv_size	= sizeof(struct intel_host),
>>>>> +};
>>>>> +
>>>>>  static const struct sdhci_pci_fixes sdhci_intel_glk_emmc = {
>>>>>  	.allow_runtime_pm	= true,
>>>>>  	.probe_slot		= glk_emmc_probe_slot,
>>>>> @@ -1470,7 +1485,7 @@ static const struct pci_device_id pci_ids[] = {
>>>>>  	SDHCI_PCI_SUBDEVICE(INTEL, BYT_SDIO, NI, 7884, ni_byt_sdio),
>>>>>  	SDHCI_PCI_DEVICE(INTEL, BYT_SDIO,  intel_byt_sdio),
>>>>>  	SDHCI_PCI_DEVICE(INTEL, BYT_SD,    intel_byt_sd),
>>>>> -	SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc),
>>>>> +	SDHCI_PCI_DEVICE(INTEL, BYT_EMMC2, intel_byt_emmc_no_runtime_pm),
>>>>>  	SDHCI_PCI_DEVICE(INTEL, BSW_EMMC,  intel_byt_emmc),
>>>>>  	SDHCI_PCI_DEVICE(INTEL, BSW_SDIO,  intel_byt_sdio),
>>>>>  	SDHCI_PCI_DEVICE(INTEL, BSW_SD,    intel_byt_sd),
>>>>>
>>>>
>>>
>>
>