[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f50a3d7-0aca-e1a8-423f-75bc5cb6e744@amd.com>
Date: Wed, 14 Aug 2024 13:24:09 -0700
From: Lizhi Hou <lizhi.hou@....com>
To: Jeffrey Hugo <quic_jhugo@...cinc.com>, <ogabbay@...nel.org>,
<dri-devel@...ts.freedesktop.org>
CC: <linux-kernel@...r.kernel.org>, <min.ma@....com>, <max.zhen@....com>,
<sonal.santan@....com>, <king.tam@....com>, Narendra Gutta
<VenkataNarendraKumar.Gutta@....com>, George Yang <George.Yang@....com>
Subject: Re: [PATCH V2 01/10] accel/amdxdna: Add a new driver for AMD AI
Engine
On 8/14/24 11:46, Jeffrey Hugo wrote:
> On 8/14/2024 12:16 PM, Lizhi Hou wrote:
>>
>> On 8/9/24 09:11, Jeffrey Hugo wrote:
>>> On 8/5/2024 11:39 AM, Lizhi Hou wrote:
>>>> diff --git a/drivers/accel/amdxdna/aie2_pci.c
>>>> b/drivers/accel/amdxdna/aie2_pci.c
>>>> new file mode 100644
>>>> index 000000000000..3660967c00e6
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/aie2_pci.c
>>>> @@ -0,0 +1,182 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (C) 2023-2024, Advanced Micro Devices, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/amd-iommu.h>
>>>> +#include <linux/errno.h>
>>>> +#include <linux/firmware.h>
>>>> +#include <linux/iommu.h>
>>>
>>> You are clearly missing linux/pci.h and I suspect many more.
>> Other headers are indirectly included by "aie2_pci.h" underneath.
>
> aie2_pci.h also does not directly include linux/pci.h
it is aie2_pci.h --> amdxdna_pci_drv.h --> linux/pci.h.
It compiles without any issue.
>
>>>> +
>>>> + ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
>>>> + if (ret) {
>>>> + XDNA_ERR(xdna, "Failed to set DMA mask: %d", ret);
>>>> + goto release_fw;
>>>> + }
>>>> +
>>>> + nvec = pci_msix_vec_count(pdev);
>>>
>>> This feels weird. Can your device advertise variable number of
>>> MSI-X vectors? It only works if all of the vectors are used?
>> That is possible. the driver supports different hardware. And the fw
>> assigns vector for hardware context dynamically. So the driver needs
>> to allocate all vectors ahead.
>
> So, if the device requests N MSIs, but the host is only able to
> satisfy 1 (or some number less than N), the fw is completely unable to
> function?
The fw may return interrupt 2 is assigned to hardware context. Then the
driver may not deal with it in this case. I think it is ok to fail if
the system has very limited resource.
>
>
>>>> +struct psp_device *aie2m_psp_create(struct device *dev, struct
>>>> psp_config *conf)
>>>> +{
>>>> + struct psp_device *psp;
>>>> + u64 offset;
>>>> +
>>>> + psp = devm_kzalloc(dev, sizeof(*psp), GFP_KERNEL);
>>>> + if (!psp)
>>>> + return NULL;
>>>> +
>>>> + psp->dev = dev;
>>>> + memcpy(psp->psp_regs, conf->psp_regs, sizeof(psp->psp_regs));
>>>> +
>>>> + psp->fw_buf_sz = ALIGN(conf->fw_size, PSP_FW_ALIGN) +
>>>> PSP_FW_ALIGN;
>>>> + psp->fw_buffer = devm_kmalloc(psp->dev, psp->fw_buf_sz,
>>>> GFP_KERNEL);
>>>
>>> Feels like this (and a bunch of other instances I haven't commented
>>> on) should be drmm_* allocs.
>>
>> The PSP code is kind of low level and directly interact with
>> hardware. All the PSP interfaces use struct device * instead of
>> drm_device. I think it is kind make sense because PSP is not related
>> to drm.
>>
>> I will scan all other allocs and change them to drmm_* allocs for the
>> code related to drm_device. Does this sound ok to you?
>
> I don't think so. Look up
> drm/todo: Add TODO entry for "lints"
> on the dri-devel list, and its history.
Ok, I will replace them with drm_*alloc.
>
>>
>>>
>>>> + if (!psp->fw_buffer) {
>>>> + dev_err(psp->dev, "no memory for fw buffer");
>>>> + return NULL;
>>>> + }
>>>> +
>>>> + psp->fw_paddr = virt_to_phys(psp->fw_buffer);
>>>
>>> I'm pretty sure virt_to_phys() is always wrong
>>
>> The hardware exposes several registers to communicate with platform
>> PSP (AMD Platform Security Processor) to load NPU firmware. And PSP
>> only accept host physical address with current hardware.
>>
>> I understand usually virt_to_phys() should not be needed for device
>> driver. And maybe it is ok to use if there is hardware requirement? I
>> can see some drivers use it as well.
>
> Eh. I guess the PSP would never have an IOMMU in front of it or
> anything like that.
>
> This feels similar to what Qualcomm MSM platforms do, which uses the
> remoteproc framework. Not sure if that helps you here.
>
> This still feels not good, but you might have a valid exception here.
> I'd suggest putting a justification comment in the code through.
> Someone looking at this in X months might raise the same question.
Sure. I will add a justification.
>
>>
>>>
>>>> + offset = ALIGN(psp->fw_paddr, PSP_FW_ALIGN) - psp->fw_paddr;
>>>> + psp->fw_paddr += offset;
>>>> + memcpy(psp->fw_buffer + offset, conf->fw_buf, conf->fw_size);
>>>> +
>>>> + return psp;
>>>> +}
>>>> diff --git a/drivers/accel/amdxdna/amdxdna_drm.c
>>>> b/drivers/accel/amdxdna/amdxdna_drm.c
>>>> new file mode 100644
>>>> index 000000000000..91e4f9c9dac9
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/amdxdna_drm.c
>>>
>>> What is the point of this file? Seems like all of this could just
>>> be in amdxdna_pci_drv.c
>> The future product may have NPU with non-pci device. So it might be a
>> amdxdna_plat_drv.c and share the same amdxdna_drm.c in the future.
>
> This seems like a weak justification. "may" is not definitive. If
> such hardware appears, you could refactor the driver at that time.
Ok, I will merge them.
>
>
>>>> diff --git a/drivers/accel/amdxdna/amdxdna_pci_drv.c
>>>> b/drivers/accel/amdxdna/amdxdna_pci_drv.c
>>>> new file mode 100644
>>>> index 000000000000..7d0cfd918b0e
>>>> --- /dev/null
>>>> +++ b/drivers/accel/amdxdna/amdxdna_pci_drv.c
>>>> @@ -0,0 +1,118 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (C) 2022-2024, Advanced Micro Devices, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/module.h>
>>>> +
>>>> +#include "amdxdna_pci_drv.h"
>>>> +
>>>> +/*
>>>> + * There are platforms which share the same PCI device ID
>>>> + * but have different PCI revision IDs. So, let the PCI class
>>>> + * determine the probe and later use the (device_id, rev_id)
>>>> + * pair as a key to select the devices.
>>>> + */
>>>
>>> Huh? So, VID == AMD, DID == 0x17f0, rev == 0x1 is a completely
>>> different device? That feels like a PCIe spec violation...
>> Maybe the comment is misleading. The hardware with same device id
>> 0x17f0 uses the same commands, registers etc. And they are same
>> device with different revisions.
>
> Then I don't understand why you need to do the class matching. Match
> on PCI_VENDOR_ID_AMD with the Device IDs you need to support like a
> "normal" PCI(e) driver?
ok. I will used device id to bind.
Thanks,
Lizhi
>
>>>
>>>> +static const struct pci_device_id pci_ids[] = {
>>>> + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_ANY_ID),
>>>> + .class = PCI_CLASS_SP_OTHER << 8,
>>>
>>> Weird. I would have expected the Accelerator class to be used
>> We contacted our hardware team to figure out why accelerator class is
>> not used here. Some of hardware is already released. Hopefully
>> hardware team may consider to use accelerator class with new products.
Powered by blists - more mailing lists