linux-kernel - Re: [PCI] BUG: unable to handle kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54FDD995.1080000@roeck-us.net>
Date:	Mon, 09 Mar 2015 10:34:13 -0700
From:	Guenter Roeck <linux@...ck-us.net>
To:	Murali Karicheri <m-karicheri2@...com>
CC:	Bjorn Helgaas <bhelgaas@...gle.com>,
	Fengguang Wu <fengguang.wu@...el.com>, LKP <lkp@...org>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PCI] BUG: unable to handle kernel

On 03/09/2015 10:03 AM, Murali Karicheri wrote:
> On 03/09/2015 12:07 PM, Guenter Roeck wrote:
>> On 03/09/2015 08:53 AM, Murali Karicheri wrote:
>>> On 03/09/2015 10:44 AM, Bjorn Helgaas wrote:
>>>> On Mon, Mar 9, 2015 at 9:17 AM, Murali Karicheri<m-karicheri2@...com>
>>>> wrote:
>>>>> On 03/06/2015 12:58 PM, Murali Karicheri wrote:
>>>>>>
>>>>>> On 03/06/2015 11:55 AM, Guenter Roeck wrote:
>>>>>>>
>>>>>>> On Fri, Mar 06, 2015 at 10:48:59AM -0500, Murali Karicheri wrote:
>>>>>>> [ ... ]
>>>>>>>
>>>>>>>>> From 098b4f5e4ab9407fbdbfcca3a91785c17e25cf03 Mon Sep 17
>>>>>>>>> 00:00:00 2001
>>>>>>>> From: Murali Karicheri<m-karicheri2@...com>
>>>>>>>> Date: Fri, 6 Mar 2015 10:23:08 -0500
>>>>>>>> Subject: [PATCH] pci: of : fix kernel crash
>>>>>>>>
>>>>>>>> This is a debug patch to root cause the kernel crash
>>>>>>>>
>>>>>>>> commit 0b2af171520e5d5e7d5b5f479b90a6a5014d9df6
>>>>>>>>
>>>>>>>> PCI: Update DMA configuration from DT
>>>>>>>>
>>>>>>>> Signed-off-by: Murali Karicheri<m-karicheri2@...com>
>>>>>>>> ---
>>>>>>>> drivers/of/of_pci.c | 8 ++++++++
>>>>>>>> drivers/pci/host-bridge.c | 5 +++++
>>>>>>>> 2 files changed, 13 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
>>>>>>>> index 86d3c38..5a59fb8 100644
>>>>>>>> --- a/drivers/of/of_pci.c
>>>>>>>> +++ b/drivers/of/of_pci.c
>>>>>>>> @@ -129,6 +129,14 @@ void of_pci_dma_configure(struct pci_dev
>>>>>>>> *pci_dev)
>>>>>>>> struct device *dev =&pci_dev->dev;
>>>>>>>> struct device *bridge = pci_get_host_bridge_device(pci_dev);
>>>>>>>>
>>>>>>>> + if (!bridge || !bridge->parent) {
>>>>>>>> + if (!bridge)
>>>>>>>> + pr_err("PCI bridge not found\n");
>>>>>>>> + if (!bridge->parent)
>>>>>>>> + pr_err("PCI bridge parent not found\n");
>>>>>>>
>>>>>>>
>>>>>>> You'll see a crash here if bridge is NULL. Maybe add an else before
>>>>>>> the second
>>>>>>> if statement ? Also, dev_err might be a bit more useful and would be
>>>>>>> available.
>>>>>>>
>>>>>> Fixed and attached.
>>>>>>
>>>>>> Murali
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Guenter
>>>>>>>
>>>>>>>> + return;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> of_dma_configure(dev, bridge->parent->of_node);
>>>>>>>> pci_put_host_bridge_device(bridge);
>>>>>>>> }
>>>>>>>> diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
>>>>>>>> index 3e5bbf9..ef2ab51 100644
>>>>>>>> --- a/drivers/pci/host-bridge.c
>>>>>>>> +++ b/drivers/pci/host-bridge.c
>>>>>>>> @@ -28,6 +28,11 @@ struct device *pci_get_host_bridge_device(struct
>>>>>>>> pci_dev *dev)
>>>>>>>> struct pci_bus *root_bus = find_pci_root_bus(dev->bus);
>>>>>>>> struct device *bridge = root_bus->bridge;
>>>>>>>>
>>>>>>>> + if (!bridge) {
>>>>>>>> + pr_err("PCI: bridge not found\n");
>>>>>>>> + return NULL;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> kobject_get(&bridge->kobj);
>>>>>>>> return bridge;
>>>>>>>> }
>>>>>>>> --
>>>>>>>> 1.7.9.5
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> BJorn,
>>>>>
>>>>> Any chance of applying the attached debug patch to see if this fixes
>>>>> and
>>>>> provide some additional information on this BUG? Not sure who will
>>>>> pick this
>>>>> one and apply.
>>>>
>>>> The change that caused the oops (0b2af171520e ("PCI: Update DMA
>>>> configuration from DT")) only exists on my pci/iommu branch, so I'm
>>>> the one to apply it.
>>>>
>>>> It's much easier for me to deal with plain text patches (not
>>>> attachments).
>>>>
>>>> I'm hesitating because I don't want to encourage use of the 0-day
>>>> testing robot as a tool at which we can just throw debug patches. The
>>>> robot is a service that costs somebody real money, and I want to be a
>>>> good neighbor when using it.
>>>
>>> Thanks for the clarification as I don't have much information on the
>>> testing robot. At the same time the question is how similar incidence
>>> in the past have been handled. I am a newbie w.r.t to this. This is
>>> first time I have introduced a patch that impacts multiple arch/machines.
>>>
>>>>
>>>> Was the information in the robot's report enough to reproduce the
>>>> oops? If not, is there additional information we could add to the
>>>> report that would enable you to reproduce it? Even if we can't
>>>> reproduce the oops, the report seems detailed enough that we should be
>>>> able to deduce the problem and produce a fix in which we have high
>>>> confidence.
>>>
>>> The BUG report essentially indicates the crash happened in
>>> of_pci_dma_configure(). The machine specific log make sense to a
>>> person familiar with this arch and I am not familiar with the same. So
>>> anyone can help narrow down the root cause of this?
>>>
>>> Looking at the code, there are two ptr variables that are accessed
>>> without checking for NULL as initial thinking was that these can never
>>> be NULL. So the debug patch is just adding addition check before
>>> accessing the ptr. I can send this patch without debug prints if that
>>> make sense. I was thinking to get confirmation that this is indeed the
>>> case before adding the check. What do you think the right approach
>>> here? Send a patch for this to the ML for adding the check as a
>>> potential fix? Or someone can help me investigate the crash dump and
>>> root cause it? or if we can use test robot to confirm this, I can
>>> re-send the patch ASIS to the list. Please clarify.
>>>
>> If the assumption is that the pointers can never be NULL,
>> wouldn't it be important to see a call trace and to find out
>> if the NULL pointers can actually be seen by design,
>> or if there is some other bug ?
>
> Call trace shows
>
>   [    0.576666]  [<7976c1ac>] pci_device_add+0xbc/0x820
> [    0.576666]  [<7976c1ac>] pci_device_add+0xbc/0x820
>
>
> And BUG seems to be in of_pci_dma_configure() as shown in the BUG report.
>
> of_pci_dma_configure() calls newly added API call to pci_get_host_bridge_device(). Seems like this has succeeded which means bridge is non NULL IMO. However in this function it passes bridge->parent->of_node to of_dma_configure(). So I suspect bridge->parent is NULL for some reason. Is there a chance for parent being NULL in this or any other platform?
>

Can bridge be the root bridge ?

Guenter

>>
>> I am a bit concerned that adding those NULL pointer checks
>> might end up hiding some other bug, ie that they just hide
>> the real bug without fixing it.
>>
>
> I agree with you as well. That is the reason I had added the debug prints in the attached patch to check what is NULL here and that can help us dig into this more.
>
> If BJorn can accept this debug patch for finding this, that will help. I can re-send it to the list as a debug patch if everyone agrees. Otherwise I don't know how to proceed from here.
>
> Thanks and regards,
>
> Murali
>> Thanks,
>> Guenter
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/