linux-kernel - Re: [PATCH V2 01/10] accel/amdxdna: Add a new driver for AMD AI Engine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f50a3d7-0aca-e1a8-423f-75bc5cb6e744@amd.com>
Date: Wed, 14 Aug 2024 13:24:09 -0700
From: Lizhi Hou <lizhi.hou@....com>
To: Jeffrey Hugo <quic_jhugo@...cinc.com>, <ogabbay@...nel.org>,
	<dri-devel@...ts.freedesktop.org>
CC: <linux-kernel@...r.kernel.org>, <min.ma@....com>, <max.zhen@....com>,
	<sonal.santan@....com>, <king.tam@....com>, Narendra Gutta
	<VenkataNarendraKumar.Gutta@....com>, George Yang <George.Yang@....com>
Subject: Re: [PATCH V2 01/10] accel/amdxdna: Add a new driver for AMD AI
 Engine


On 8/14/24 11:46, Jeffrey Hugo wrote:
> On 8/14/2024 12:16 PM, Lizhi Hou wrote:
>>
>> On 8/9/24 09:11, Jeffrey Hugo wrote:
>>> On 8/5/2024 11:39 AM, Lizhi Hou wrote:
>>>> diff --git a/drivers/accel/amdxdna/aie2_pci.c 
>>>> b/drivers/accel/amdxdna/aie2_pci.c
>>>> new file mode 100644
>>>> index 000000000000..3660967c00e6
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/aie2_pci.c
>>>> @@ -0,0 +1,182 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (C) 2023-2024, Advanced Micro Devices, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/amd-iommu.h>
>>>> +#include <linux/errno.h>
>>>> +#include <linux/firmware.h>
>>>> +#include <linux/iommu.h>
>>>
>>> You are clearly missing linux/pci.h and I suspect many more.
>> Other headers are indirectly included by "aie2_pci.h" underneath.
>
> aie2_pci.h also does not directly include linux/pci.h

it is aie2_pci.h --> amdxdna_pci_drv.h --> linux/pci.h.

It compiles without any issue.

>
>>>> +
>>>> +    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
>>>> +    if (ret) {
>>>> +        XDNA_ERR(xdna, "Failed to set DMA mask: %d", ret);
>>>> +        goto release_fw;
>>>> +    }
>>>> +
>>>> +    nvec = pci_msix_vec_count(pdev);
>>>
>>> This feels weird.  Can your device advertise variable number of 
>>> MSI-X vectors?  It only works if all of the vectors are used?
>> That is possible. the driver supports different hardware. And the fw 
>> assigns vector for hardware context dynamically. So the driver needs 
>> to allocate all vectors ahead.
>
> So, if the device requests N MSIs, but the host is only able to 
> satisfy 1 (or some number less than N), the fw is completely unable to 
> function?
The fw may return interrupt 2 is assigned to hardware context. Then the 
driver may not deal with it in this case. I think it is ok to fail if 
the system has very limited resource.
>
>
>>>> +struct psp_device *aie2m_psp_create(struct device *dev, struct 
>>>> psp_config *conf)
>>>> +{
>>>> +    struct psp_device *psp;
>>>> +    u64 offset;
>>>> +
>>>> +    psp = devm_kzalloc(dev, sizeof(*psp), GFP_KERNEL);
>>>> +    if (!psp)
>>>> +        return NULL;
>>>> +
>>>> +    psp->dev = dev;
>>>> +    memcpy(psp->psp_regs, conf->psp_regs, sizeof(psp->psp_regs));
>>>> +
>>>> +    psp->fw_buf_sz = ALIGN(conf->fw_size, PSP_FW_ALIGN) + 
>>>> PSP_FW_ALIGN;
>>>> +    psp->fw_buffer = devm_kmalloc(psp->dev, psp->fw_buf_sz, 
>>>> GFP_KERNEL);
>>>
>>> Feels like this (and a bunch of other instances I haven't commented 
>>> on) should be drmm_* allocs.
>>
>> The PSP code is kind of low level and directly interact with 
>> hardware. All the PSP interfaces use struct device * instead of 
>> drm_device. I think it is kind make sense because PSP is not related 
>> to drm.
>>
>> I will scan all other allocs and change them to drmm_* allocs for the 
>> code related to drm_device. Does this sound ok to you?
>
> I don't think so.  Look up
> drm/todo: Add TODO entry for "lints"
> on the dri-devel list, and its history.
Ok, I will replace them with drm_*alloc.
>
>>
>>>
>>>> +    if (!psp->fw_buffer) {
>>>> +        dev_err(psp->dev, "no memory for fw buffer");
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    psp->fw_paddr = virt_to_phys(psp->fw_buffer);
>>>
>>> I'm pretty sure virt_to_phys() is always wrong
>>
>> The hardware exposes several registers to communicate with platform 
>> PSP (AMD Platform Security Processor) to load NPU firmware. And PSP 
>> only accept host physical address with current hardware.
>>
>> I understand usually virt_to_phys() should not be needed for device 
>> driver. And maybe it is ok to use if there is hardware requirement? I 
>> can see some drivers use it as well.
>
> Eh.  I guess the PSP would never have an IOMMU in front of it or 
> anything like that.
>
> This feels similar to what Qualcomm MSM platforms do, which uses the 
> remoteproc framework.  Not sure if that helps you here.
>
> This still feels not good, but you might have a valid exception here. 
> I'd suggest putting a justification comment in the code through. 
> Someone looking at this in X months might raise the same question.
Sure. I will add a justification.
>
>>
>>>
>>>> +    offset = ALIGN(psp->fw_paddr, PSP_FW_ALIGN) - psp->fw_paddr;
>>>> +    psp->fw_paddr += offset;
>>>> +    memcpy(psp->fw_buffer + offset, conf->fw_buf, conf->fw_size);
>>>> +
>>>> +    return psp;
>>>> +}
>>>> diff --git a/drivers/accel/amdxdna/amdxdna_drm.c 
>>>> b/drivers/accel/amdxdna/amdxdna_drm.c
>>>> new file mode 100644
>>>> index 000000000000..91e4f9c9dac9
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/amdxdna_drm.c
>>>
>>> What is the point of this file?  Seems like all of this could just 
>>> be in amdxdna_pci_drv.c
>> The future product may have NPU with non-pci device. So it might be a 
>> amdxdna_plat_drv.c and share the same amdxdna_drm.c in the future.
>
> This seems like a weak justification.  "may" is not definitive. If 
> such hardware appears, you could refactor the driver at that time.
Ok, I will merge them.
>
>
>>>> diff --git a/drivers/accel/amdxdna/amdxdna_pci_drv.c 
>>>> b/drivers/accel/amdxdna/amdxdna_pci_drv.c
>>>> new file mode 100644
>>>> index 000000000000..7d0cfd918b0e
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/amdxdna_pci_drv.c
>>>> @@ -0,0 +1,118 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (C) 2022-2024, Advanced Micro Devices, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/module.h>
>>>> +
>>>> +#include "amdxdna_pci_drv.h"
>>>> +
>>>> +/*
>>>> + *  There are platforms which share the same PCI device ID
>>>> + *  but have different PCI revision IDs. So, let the PCI class
>>>> + *  determine the probe and later use the (device_id, rev_id)
>>>> + *  pair as a key to select the devices.
>>>> + */
>>>
>>> Huh?  So, VID == AMD, DID == 0x17f0, rev == 0x1 is a completely 
>>> different device?  That feels like a PCIe spec violation...
>> Maybe the comment is misleading. The hardware with same device id 
>> 0x17f0 uses the same commands, registers etc. And they are same 
>> device with different revisions.
>
> Then I don't understand why you need to do the class matching. Match 
> on PCI_VENDOR_ID_AMD with the Device IDs you need to support like a 
> "normal" PCI(e) driver?

ok. I will used device id to bind.


Thanks,

Lizhi

>
>>>
>>>> +static const struct pci_device_id pci_ids[] = {
>>>> +    { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_ANY_ID),
>>>> +        .class = PCI_CLASS_SP_OTHER << 8,
>>>
>>> Weird.  I would have expected the Accelerator class to be used
>> We contacted our hardware team to figure out why accelerator class is 
>> not used here. Some of hardware is already released. Hopefully 
>> hardware team may consider to use accelerator class with new products.