[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29704025-fba4-434a-9a43-ed2184a322f9@habana.ai>
Date: Mon, 24 Jun 2024 08:47:41 +0000
From: Omer Shpigelman <oshpigelman@...ana.ai>
To: Leon Romanovsky <leon@...nel.org>
CC: "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
"ogabbay@...nel.org" <ogabbay@...nel.org>,
Zvika Yehudai <zyehudai@...ana.ai>
Subject: Re: [PATCH 11/15] RDMA/hbl: add habanalabs RDMA driver
On 6/19/24 13:52, Leon Romanovsky wrote:
> On Wed, Jun 19, 2024 at 09:27:54AM +0000, Omer Shpigelman wrote:
>> On 6/18/24 15:58, Leon Romanovsky wrote:
>>> On Tue, Jun 18, 2024 at 11:08:34AM +0000, Omer Shpigelman wrote:
>>>> On 6/17/24 22:04, Leon Romanovsky wrote:
>>>>> [Some people who received this message don't often get email from leon@...nel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Mon, Jun 17, 2024 at 05:43:49PM +0000, Omer Shpigelman wrote:
>>>>>> On 6/13/24 22:18, Leon Romanovsky wrote:
>>>>>>> [Some people who received this message don't often get email from leon@...nel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>
>>>>>>> On Thu, Jun 13, 2024 at 11:22:04AM +0300, Omer Shpigelman wrote:
>>>>>>>> Add an RDMA driver of Gaudi ASICs family for AI scaling.
>>>>>>>> The driver itself is agnostic to the ASIC in action, it operates according
>>>>>>>> to the capabilities that were passed on device initialization.
>>>>>>>> The device is initialized by the hbl_cn driver via auxiliary bus.
>>>>>>>> The driver also supports QP resource tracking and port/device HW counters.
>>>>>>>>
>>>>>>>> Signed-off-by: Omer Shpigelman <oshpigelman@...ana.ai>
>>>>>>>> Co-developed-by: Abhilash K V <kvabhilash@...ana.ai>
>>>>>>>> Signed-off-by: Abhilash K V <kvabhilash@...ana.ai>
>>>>>>>> Co-developed-by: Andrey Agranovich <aagranovich@...ana.ai>
>>>>>>>> Signed-off-by: Andrey Agranovich <aagranovich@...ana.ai>
>>>>>>>> Co-developed-by: Bharat Jauhari <bjauhari@...ana.ai>
>>>>>>>> Signed-off-by: Bharat Jauhari <bjauhari@...ana.ai>
>>>>>>>> Co-developed-by: David Meriin <dmeriin@...ana.ai>
>>>>>>>> Signed-off-by: David Meriin <dmeriin@...ana.ai>
>>>>>>>> Co-developed-by: Sagiv Ozeri <sozeri@...ana.ai>
>>>>>>>> Signed-off-by: Sagiv Ozeri <sozeri@...ana.ai>
>>>>>>>> Co-developed-by: Zvika Yehudai <zyehudai@...ana.ai>
>>>>>>>> Signed-off-by: Zvika Yehudai <zyehudai@...ana.ai>
>>>>>>>
>>>>>>> I afraid that you misinterpreted the "Co-developed-by" tag. All these
>>>>>>> people are probably touch the code and not actually sit together at
>>>>>>> the same room and write the code together. So, please remove the
>>>>>>> extensive "Co-developed-by" tags.
>>>>>>>
>>>>>>> It is not full review yet, but simple pass-by-comments.
>>>>>>>
>>>>>>
>>>>>> Actually except of two, all of the mentioned persons sat in the same room
>>>>>> and developed the code together.
>>>>>> The remaining two are located on a different site (but also together).
>>>>>> Isn't that what "Co-developed-by" tag for?
>>>>>> I wanted to give them credit for writing the code but I can remove if it's
>>>>>> not common.
>>>>>
>>>>> Signed-off-by will be enough to give them credit.
>>>>>
>>>>
>>>> Ok, good enough.
>>>>
>>>>>>
>>>>>>>> ---
>>>>>>>> MAINTAINERS | 10 +
>>>>>>>> drivers/infiniband/Kconfig | 1 +
>>>>>>>> drivers/infiniband/hw/Makefile | 1 +
>>>>>>>> drivers/infiniband/hw/hbl/Kconfig | 17 +
>>>>>>>> drivers/infiniband/hw/hbl/Makefile | 8 +
>>>>>>>> drivers/infiniband/hw/hbl/hbl.h | 326 +++
>>>>>>>> drivers/infiniband/hw/hbl/hbl_main.c | 478 ++++
>>>>>>>> drivers/infiniband/hw/hbl/hbl_verbs.c | 2686 ++++++++++++++++++++++
>>>>>>>> include/uapi/rdma/hbl-abi.h | 204 ++
>>>>>>>> include/uapi/rdma/hbl_user_ioctl_cmds.h | 66 +
>>>>>>>> include/uapi/rdma/hbl_user_ioctl_verbs.h | 106 +
>>>>>>>> include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
>>>>>>>> 12 files changed, 3904 insertions(+)
>>>>>>>> create mode 100644 drivers/infiniband/hw/hbl/Kconfig
>>>>>>>> create mode 100644 drivers/infiniband/hw/hbl/Makefile
>>>>>>>> create mode 100644 drivers/infiniband/hw/hbl/hbl.h
>>>>>>>> create mode 100644 drivers/infiniband/hw/hbl/hbl_main.c
>>>>>>>> create mode 100644 drivers/infiniband/hw/hbl/hbl_verbs.c
>>>>>>>> create mode 100644 include/uapi/rdma/hbl-abi.h
>>>>>>>> create mode 100644 include/uapi/rdma/hbl_user_ioctl_cmds.h
>>>>>>>> create mode 100644 include/uapi/rdma/hbl_user_ioctl_verbs.h
>>>>>>>
>>>>>>> <...>
>>>>>>>
>>>>>>>> +#define hbl_ibdev_emerg(ibdev, format, ...) ibdev_emerg(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_alert(ibdev, format, ...) ibdev_alert(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_crit(ibdev, format, ...) ibdev_crit(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_err(ibdev, format, ...) ibdev_err(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_warn(ibdev, format, ...) ibdev_warn(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_notice(ibdev, format, ...) ibdev_notice(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_info(ibdev, format, ...) ibdev_info(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_dbg(ibdev, format, ...) ibdev_dbg(ibdev, format, ##__VA_ARGS__)
>>>>>>>> +
>>>>>>>> +#define hbl_ibdev_emerg_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_emerg_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_alert_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_alert_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_crit_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_crit_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_err_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_err_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_warn_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_warn_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_notice_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_notice_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_info_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_info_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +#define hbl_ibdev_dbg_ratelimited(ibdev, fmt, ...) \
>>>>>>>> + ibdev_dbg_ratelimited(ibdev, fmt, ##__VA_ARGS__)
>>>>>>>> +
>>>>>>>
>>>>>>> Please don't redefine the existing macros. Just use the existing ones.
>>>>>>>
>>>>>>>
>>>>>>> <...>
>>>>>>>
>>>>>>
>>>>>> That's a leftover from some debug code. I'll remove.
>>>>>>
>>>>>>>> + if (hbl_ib_match_netdev(ibdev, netdev))
>>>>>>>> + ib_port = hbl_to_ib_port_num(hdev, netdev->dev_port);
>>>>>>>> + else
>>>>>>>> + return NOTIFY_DONE;
>>>>>>>
>>>>>>> It is not kernel coding style. Please write:
>>>>>>> if (!hbl_ib_match_netdev(ibdev, netdev))
>>>>>>> return NOTIFY_DONE;
>>>>>>>
>>>>>>> ib_port = hbl_to_ib_port_num(hdev, netdev->dev_port);
>>>>>>>
>>>>>>
>>>>>> I'll fix the code, thanks.
>>>>>>
>>>>>>>> +
>>>>>>>
>>>>>>> <...>
>>>>>>>
>>>>>>>> +static int hbl_ib_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id)
>>>>>>>> +{
>>>>>>>> + struct hbl_aux_dev *aux_dev = container_of(adev, struct hbl_aux_dev, adev);
>>>>>>>> + struct hbl_ib_aux_ops *aux_ops = aux_dev->aux_ops;
>>>>>>>> + struct hbl_ib_device *hdev;
>>>>>>>> + ktime_t timeout;
>>>>>>>> + int rc;
>>>>>>>> +
>>>>>>>> + rc = hdev_init(aux_dev);
>>>>>>>> + if (rc) {
>>>>>>>> + dev_err(&aux_dev->adev.dev, "Failed to init hdev\n");
>>>>>>>> + return -EIO;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + hdev = aux_dev->priv;
>>>>>>>> +
>>>>>>>> + /* don't allow module unloading while it is attached */
>>>>>>>> + if (!try_module_get(THIS_MODULE)) {
>>>>>>>
>>>>>>> This part makes wonder, what are you trying to do here? What doesn't work for you
>>>>>>> in standard driver core and module load mechanism?
>>>>>>>
>>>>>>
>>>>>> Before auxiliary bus was introduced, we used EXPORT_SYMBOLs for inter
>>>>>> driver communication. That incremented the refcount of the used module so
>>>>>> it couldn't be removed while it is in use.
>>>>>> Auxiliary bus usage doesn't increment the used module refcount and hence
>>>>>> the used module can be removed while it is in use and that's something
>>>>>> we don't want to allow.
>>>>>> We could solve it by some global locking or in_use atomic but the most
>>>>>> simple and clean way is just to increment the used module refcount on
>>>>>> auxiliary device probe and decrement it on auxiliary device removal.
>>>>>
>>>>> No, you was supposed to continue to use EXPORT_SYMBOLs and don't
>>>>> invent auxiliary ops structure (this is why you lost module
>>>>> reference counting).
>>>>>
>>>>
>>>> Sorry, but according to the auxiliary bus doc, a domain-specific ops
>>>> structure can be used.
>>>> We followed the usage example described at drivers/base/auxiliary.c.
>>>> What am I missing?
>>>
>>> Being the one who implemented auxiliary bus in the kernel and converted
>>> number of drivers to use it, I strongly recommend do NOT follow the example
>>> provided there.
>>>
>>> So you are missing "best practice", and "best practice" is to use
>>> EXPORT_SYMBOLs and rely on module reference counting.
>>>
>>
>> It is not just the usage example but also the general feature doc before
>> it:
>> "The generic behavior can be extended and specialized as needed by
>> encapsulating an auxiliary_device within other domain-specific structures
>> and the use of .ops callbacks."
>> It is also mentioned there that the ops structure are used for specific
>> auxiliary device operations while EXPORT_SYMBOLs should be used for common
>> infrastrucure the parent driver exposes:
>> "Note that ops are intended as a way to augment instance behavior within a
>> class of auxiliary devices, it is not the mechanism for exporting common
>> infrastructure from the parent."
>> All of our ops callbacks are meant to provide functionality related to the
>> auxiliary device, they are not just general/common infrastructure.
>
> Of course they are common, otherwise why did you put them in common code?
> For example, you have callbacks to lock and unlock internal HW access,
> how is it not common?
>
As I saw it, the "common" functions are general capabilities the parent
driver exposes, not necessaritly related to the auxiliary device.
But let me revisit this and try to restructure the code so the parent
driver will use EXPORT_SYMBOLs.
>>
>> Why do we have this doc if we should ignore it? why wasn't the doc
>> modified according to the "best practice" you described? the doc is
>> misleading.
>
> Because this is how upstream kernel development works. We are trying to
> come to the agreement and get the best solution for the problem. Sometimes,
> the outcome of the discussion is not "the best solution", but "good
> enough". This doc can be served as an example. Everyone involved in the
> development of auxbus and later usage of it, were focused on implementation,
> documentation was good enough as it didn't limit anyone who actually
> used it.
>
I get your point but still I think that the doc is misleading if it shows
a usage exmaple but practically no one should follow it.
Better to remove this usage exmaple completely IMHO.
>>
>> Adding gregkh here as he requested the auxiliary bus feature IIRC.
>> Greg - isn't the doc legit? should EXPORT_SYMBOLs necessarily be used
>> together with auxiliary bus rather than ops structure?
>
> This is not what you are doing here. You completely ditched EXPORT_SYMBOLs
> and reinvented module reference counting which overcomplicated the code
> just to avoid using standard kernel mechanism.
>
>> As we saw it, auxiliary bus gives us the flexibility to choose which
>> modules will be loaded while EXPORT_SYMBOLs enforces the dependencies
>> which might not be needed in some cases.
>>
>>>> Moreover, we'd like to support the mode where the IB or the ETH driver is
>>>> not loaded at all. But this cannot be achieved if we use EXPORT_SYMBOLs
>>>> exclusively for inter driver communication.
>>>
>>> It is not true and not how the kernel works. You can perfectly load core
>>> driver without IB and ETH, at some extent this is how mlx5 driver works.
>>>
>>
>> mlx5 IB driver doesn't export any symbol that is used by the core driver,
>> that's why the core driver can be loaded without the IB driver (althought
>> you'll get circular dependency if you would export).
>
> Yes, IB and ETH drivers are "users" of core driver. As RDMA maintainer,
> I'm reluctant to accept code that exports symbols from IB drivers to
> other subsystems. We have drivers/infiniband/core/ for that.
>
So we'll need to restructure the code to follow this limitation. We'll
take care of it for the next patch set version.
BTW if you won't allow such driver specific EXPORT_SYMBOLs, I think it is
good to have it documented similarly to other "don't do" guideliens in the
infiniband doc.
That's because in the net/ethernet subsystem for exmaple it is very common
to add such driver specific EXPORT_SYMBOLs.
>> If relying on exported symbols only, then our IB and ETH drivers will need
>> to export symbols too because the core driver accesses them post probing.
>
> So you should fix your core driver. This is exactly what auxbus model
> proposes.
>
>> Hence we won't be able to load the core driver without both of them (or
>> loading anything due to circular dependency).
>> Unless we'll use dynamic symbol lookup and I don't think that's your
>> intention.
>
> No it is not.
>
>>
>>>>
>>>>>>
>>>>>>>> + dev_err(hdev->dev, "Failed to increment %s module refcount\n",
>>>>>>>> + module_name(THIS_MODULE));
>>>>>>>> + rc = -EIO;
>>>>>>>> + goto module_get_err;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + timeout = ktime_add_ms(ktime_get(), hdev->pending_reset_long_timeout * MSEC_PER_SEC);
>>>>>>>> + while (1) {
>>>>>>>> + aux_ops->hw_access_lock(aux_dev);
>>>>>>>> +
>>>>>>>> + /* if the device is operational, proceed to actual init while holding the lock in
>>>>>>>> + * order to prevent concurrent hard reset
>>>>>>>> + */
>>>>>>>> + if (aux_ops->device_operational(aux_dev))
>>>>>>>> + break;
>>>>>>>> +
>>>>>>>> + aux_ops->hw_access_unlock(aux_dev);
>>>>>>>> +
>>>>>>>> + if (ktime_compare(ktime_get(), timeout) > 0) {
>>>>>>>> + dev_err(hdev->dev, "Timeout while waiting for hard reset to finish\n");
>>>>>>>> + rc = -EBUSY;
>>>>>>>> + goto timeout_err;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + dev_notice_once(hdev->dev, "Waiting for hard reset to finish before probing IB\n");
>>>>>>>> +
>>>>>>>> + msleep_interruptible(MSEC_PER_SEC);
>>>>>>>> + }
>>>>>>>
>>>>>>> The code above is unexpected.
>>>>>>>
>>>>>>
>>>>>> We have no control on when the user insmod the IB driver.
>>>>>
>>>>> It is not true, this is controlled through module dependencies
>>>>> mechanism.
>>>>>
>>>>
>>>> Yeah, if we would use EXPORT_SYMBOLs for inter driver communication but
>>>> we don't.
>>>
>>> So please use it and don't add complexity where it is not needed.
>>>
>>>>
>>>>>> As a result it is possible that the IB auxiliary device will be probed
>>>>>> while the compute device is under reset (due to some HW error).
>>>>>
>>>>> No, it is not possible. If you structure your driver right.
>>>>>
>>>>
>>>> Again, it is not possible if we would use EXPORT_SYMBOLs.
>>>> Please let me know if we misunderstood something because AFAIU we followed
>>>> the auxiliary bus doc usage example.
>>>
>>> It is better to follow actual drivers that use auxiliary bus and see how
>>> they implemented it and not rely on examples in the documentation.
>>>
>>
>> But isn't that what the doc for? to explain the guidelines? and it's not
>> that there is a big red note there of "this example should not be taken as
>> is, please look at your subsystem guidelines".
>
> At the beginning that doc was located in Documentation/ folder and no one
> really cared about it. After moving from Documentation/ to drivers/base/auxiliary.c,
> it became more visible, but still no one relied on it. You are first one
> who read.
>
> There is no subsystem rules here. Everyone relied on EXPORT_SYMBOLs and didn't
> use ops structure. Kernel is evolving project, there is no need to find a rule
> for everything.
>
> Thanks
>
>>
>>> Thanks
>>>
>>>>
>>>>> Thanks
Powered by blists - more mailing lists