netdev - Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFCwf109t5=GuNvqTqLUCiYbjLC6o2xVoLY5C-SBqbN66f6wxg@mail.gmail.com>
Date:   Fri, 18 Sep 2020 16:49:25 +0300
From:   Oded Gabbay <oded.gabbay@...il.com>
To:     Jason Gunthorpe <jgg@...pe.ca>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc:     izur@...ana.ai, Gal Pressman <galpress@...zon.com>,
        Jakub Kicinski <kuba@...nel.org>,
        "Linux-Kernel@...r. Kernel. Org" <linux-kernel@...r.kernel.org>,
        netdev@...r.kernel.org, SW_Drivers <SW_Drivers@...ana.ai>,
        "David S. Miller" <davem@...emloft.net>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        linux-rdma@...r.kernel.org, Olof Johansson <olof@...om.net>
Subject: Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
>
> > The problem with MR is that the API doesn't let us return a new VA. It
> > forces us to use the original VA that the Host OS allocated.
>
> If using the common MR API you'd have to assign a unique linear range
> in the single device address map and record both the IOVA and the MMU
> VA in the kernel struct.
>
> Then when submitting work using that MR lkey the kernel will adjust
> the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> forwarding to HW.
>
We can't do that. That will kill the performance. If for every
submission I need to modify the packet's contents, the throughput will
go downhill.
Also, submissions to our RDMA qmans are coupled with submissions to
our DMA/Compute QMANs. We can't separate those to different API calls.
That will also kill performance and in addition, will prevent us from
synchronizing all the engines.

I also have to say, it troubles me that you keep referring to our
device as an RDMA device. It is not an RDMA device. It is a
deep-learning accelerator which uses RDMA as a way to interconnect
multiple devices. We don't intend to replace General-Purpose RDMA
devices. We know we don't support that.
Therefore, I still fail to see why we need to support all the above...

Our work submission is not to just "send/receive packets". Sending
packets is part of a general recipe to do DMA, perform compute on data
and send/receive data. All together, in a synchronized fashion.

The way you try to force me to go is to separate that into different
functionality, as if I have different ASICs, which is very
counter-productive in terms of performance and simplicity. i.e. have
one method of submitting work to DMA/compute and another way to RDMA
ports.

I know this is how the kernel is structured now - subsystems for
devices that belong to a single domain (graphics, net, storage). But I
fear that you will soon see this paradigm doesn't work with new
devices in AI, which combine multiple domains into a single ASIC.

Greg, I would love to hear your opinion here. Am I totally wrong ? Is
treating a single ASIC that belongs to multiple domains as if it were
multiple ASICs a good thing ? Don't you think it will hurt the
performance ?

Oded

> EFA doesn't support rkeys, so they are not required to be emulated. It
> would have to create rkeys using some guadidv_reg_mr_rkey()
>
> It is important to understand that the usual way we support these
> non-RDMA devices is to insist that they use SW to construct a minimal
> standards based RDMA API, and then allow the device to have a 'dv' API
> to access a faster, highly device specific, SW bypass path.
>
> So for instance you might have some guadidv_post_work(qp) that doesn't
> use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr)
> would return the required HW VA from the kernel.
>
> Usually the higher level communication library (UCX, MPI, etc) forms
> the dv primitives into something application usable.
>
> > we do if that VA is in the range of our HBM addresses ? The device
> > won't be able to distinguish between them. The transaction that is
> > generated by an engine inside our device will go to the HBM instead of
> > going to the PCI controller and then to the host.
> >
> > That's the crust of the problem and why we didn't use MR.
>
> No, the problem with the device is that it doesn't have a lkey/rkey,
> so it is stuck with a single translation domain. RoCE compliant
> devices are required to have multiple translation domains - each
> lkey/rkey specifies a unique translation.
>
> The MR concept is a region of process VA mapped into the device for
> device access, and this device *clearly* has that.
>
> Jason