lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sun, 7 Aug 2016 20:50:48 +0300
From:	Serge Semin <fancer.lancer@...il.com>
To:	Allen Hubbe <Allen.Hubbe@....com>
Cc:	jdmason@...zu.us, dave.jiang@...el.com, Xiangliang.Yu@....com,
	Sergey.Semin@...latforms.ru, linux-ntb@...glegroups.com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/3] ntb: Add asynchronous devices support to NTB-bus
 interface

Hello Allen.

Thanks for your careful review. Going through this mailing thread I hope we'll come up with solutions, which improve the driver code as well as extend the Linux kernel support of new devices like IDT PCIe-swtiches.

Before getting to the inline commentaries I need to give some introduction to the IDT NTB-related hardware so we could speak on the same language. Additionally I'll give a brief explanation how the setup of memory windows works in IDT PCIe-switches.

First of all, before getting into the IDT NTB driver development I had made a research of the currently developed NTB kernel API and AMD/Intel hardware drivers. Due to lack of the hardware manuals It might be not in deep details, but I understand how the AMD/Intel NTB-hardware drivers work. At least I understand the concept of memory windowing, which led to the current NTB bus kernel API.

So lets get to IDT PCIe-switches. There is a whole series of NTB-related switches IDT produces. All of them I split into two distinct groups:
1) Two NTB-ported switches (models 89PES8NT2, 89PES16NT2, 89PES12NT3, 89PES124NT3),
2) Multi NTB-ported switches (models 89HPES24NT6AG2, 89HPES32NT8AG2, 89HPES32NT8BG2, 89HPES12NT12G2, 89HPES16NT16G2, 89HPES24NT24G2, 89HPES32NT24AG2, 89HPES32NT24BG2).
Just to note all of these switches are a part of IDT PRECISE(TM) family of PCI Express® switching solutions. Why do I split them up? Because of the next reasons:
1) Number of upstream ports, which have access to NTB functions (obviously, yeah? =)). So the switches of the first group can connect just two domains over NTB. Unlike the second group of switches, which expose a way to setup an interaction between several PCIe-switch ports, which have NT-function activated.
2) The groups are significantly distinct by the way of NT-functions configuration.

Before getting further, I should note, that the uploaded driver supports the second group of devices only. But still I'll give a comparative explanation, since the first group of switches is very similar to the AMD/Intel NTBs.

Lets dive into the configurations a bit deeper. Particularly NT-functions of the first group of switches can be configured the same way as AMD/Intel NTB-functions are. There is an PCIe end-point configuration space, which fully reflects the cross-coupled local and peer PCIe/NTB settings. So local Root complex can set any of the peer registers by direct writing to mapped memory. Here is the image, which perfectly explains the configuration registers mapping:
https://s8.postimg.org/3nhkzqfxx/IDT_NTB_old_configspace.png
Since the first group switches connect only two root complexes, the race condition of read/write operations to cross-coupled registers can be easily resolved just by roles distribution. So local root complex sets the translated base address directly to a peer configuration space registers, which correspond to BAR0-BAR3 locally mapped memory windows. Of course 2-4 memory windows is enough to connect just two domains. That's why you made the NTB bus kernel API the way it is.

The things get different when one wants to have an access from one domain to multiple coupling up to eight root complexes in the second group of switches. First of all the hardware doesn't support the configuration space cross-coupling anymore. Instead there are two Global Address Space Access registers provided to have an access to a peers configuration space. In fact it is not a big problem, since there are no much differences in accessing registers over a memory mapped space or a pair of fixed Address/Data registers. The problem arises when one wants to share a memory windows between eight domains. Five BARs are not enough for it even if they'd be configured to be of x32 address type. Instead IDT introduces Lookup table address translation. So BAR2/BAR4 can be configured to translate addresses using 12 or 24 entries lookup tables. Each entry can be initialized with translated base address of a peer and IDT switch port, which peer is connected to. So when local root complex locally maps BAR2/BAR4, one can have an access to a memory of a peer just by reading/writing with a shift corresponding to the lookup table entry. That's how more than five peers can be accessed. The root problem is the way the lookup table is accessed. Alas It is accessed only by a pair of "Entry index/Data" registers. So a root complex must write an entry index to one registers, then read/write data from another. As you might realise, that weak point leads to a race condition of multiple root complexes accessing the lookup table of one shared peer. Alas I could not come up with a simple and strong solution of the race.

That's why I've introduced the asynchronous hardware in the NTB bus kernel API. Since local root complex can't directly write a translated base address to a peer, it must wait until a peer asks him to allocate a memory and send the address back using some of a hardware mechanism. It can be anything: Scratchpad registers, Message registers or even "crazy" doorbells bingbanging. For instance, the IDT switches of the first group support:
1) Shared Memory windows. In particular local root complex can set a translated base address to BARs of local and peer NT-function using the cross-coupled PCIe/NTB configuration space, the same way as it can be done for AMD/Intel NTBs.
2) One Doorbell register.
3) Two Scratchpads.
4) Four message regietsrs.
As you can see the switches of the first group can be considered as both synchronous and asynchronous. All the NTB bus kernel API can be implemented for it including the changes introduced by this patch (I would do it if I had a corresponding hardware). AMD and Intel NTBs can be considered both synchronous and asynchronous as well, although they don't support messaging so Scratchpads can be used to send a data to a peer. Finally the switches of the second group lack of ability to initialize BARs translated base address of peers due to the race condition I described before.

To sum up I've spent a lot of time designing the IDT NTB driver. I've done my best to make the IDT driver as much compatible with current design as possible, nevertheless the NTB bus kernel API had to be slightly changed. You can find answers to the commentaries down below.

On Fri, Aug 05, 2016 at 11:31:58AM -0400, Allen Hubbe <Allen.Hubbe@....com> wrote:
> From: Serge Semin
> > Currently supported AMD and Intel Non-transparent PCIe-bridges are synchronous
> > devices, so translated base address of memory windows can be direcly written
> > to peer registers. But there are some IDT PCIe-switches which implement
> > complex interfaces using Lookup Tables of translation addresses. Due to
> > the way the table is accessed, it can not be done synchronously from different
> > RCs, that's why the asynchronous interface should be developed.
> > 
> > For these purpose the Memory Window related interface is correspondingly split
> > as it is for Doorbell and Scratchpad registers. The definition of Memory Window
> > is following: "It is a virtual memory region, which locally reflects a physical
> > memory of peer device." So to speak the "ntb_peer_mw_"-prefixed methods control
> > the peers memory windows, "ntb_mw_"-prefixed functions work with the local
> > memory windows.
> > Here is the description of the Memory Window related NTB-bus callback
> > functions:
> >  - ntb_mw_count() - number of local memory windows.
> >  - ntb_mw_get_maprsc() - get the physical address and size of the local memory
> >                          window to map.
> >  - ntb_mw_set_trans() - set translation address of local memory window (this
> >                         address should be somehow retrieved from a peer).
> >  - ntb_mw_get_trans() - get translation address of local memory window.
> >  - ntb_mw_get_align() - get alignment of translated base address and size of
> >                         local memory window. Additionally one can get the
> >                         upper size limit of the memory window.
> >  - ntb_peer_mw_count() - number of peer memory windows (it can differ from the
> >                          local number).
> >  - ntb_peer_mw_set_trans() - set translation address of peer memory window
> >  - ntb_peer_mw_get_trans() - get translation address of peer memory window
> >  - ntb_peer_mw_get_align() - get alignment of translated base address and size
> >                              of peer memory window.Additionally one can get the
> >                              upper size limit of the memory window.
> > 
> > As one can see current AMD and Intel NTB drivers mostly implement the
> > "ntb_peer_mw_"-prefixed methods. So this patch correspondingly renames the
> > driver functions. IDT NTB driver mostly expose "ntb_nw_"-prefixed methods,
> > since it doesn't have convenient access to the peer Lookup Table.
> > 
> > In order to pass information from one RC to another NTB functions of IDT
> > PCIe-switch implement Messaging subsystem. They currently support four message
> > registers to transfer DWORD sized data to a specified peer. So there are two
> > new callback methods are introduced:
> >  - ntb_msg_size() - get the number of DWORDs supported by NTB function to send
> >                     and receive messages
> >  - ntb_msg_post() - send message of size retrieved from ntb_msg_size()
> >                     to a peer
> > Additionally there is a new event function:
> >  - ntb_msg_event() - it is invoked when either a new message was retrieved
> >                      (NTB_MSG_NEW), or last message was successfully sent
> >                      (NTB_MSG_SENT), or the last message failed to be sent
> >                      (NTB_MSG_FAIL).
> > 
> > The last change concerns the IDs (practically names) of NTB-devices on the
> > NTB-bus. It is not good to have the devices with same names in the system
> > and it brakes my IDT NTB driver from being loaded =) So I developed a simple
> > algorithm of NTB devices naming. Particulary it generates names "ntbS{N}" for
> > synchronous devices, "ntbA{N}" for asynchronous devices, and "ntbAS{N}" for
> > devices supporting both interfaces.
> 
> Thanks for the work that went into writing this driver, and thanks for your patience with the review.  Please read my initial comments inline.  I would like to approach this from a top-down api perspective first, and settle on that first before requesting any specific changes in the hardware driver.  My major concern about these changes is that they introduce a distinct classification for sync and async hardware, supported by different sets of methods in the api, neither is a subset of the other.
> 
> You know the IDT hardware, so if any of my requests below are infeasible, I would like your constructive opinion (even if it means significant changes to existing drivers) on how to resolve the api so that new and existing hardware drivers can be unified under the same api, if possible.

I understand your concern. I have been thinking of this a lot. In my opinion the proposed in this patch alterations are the best of all variants I've been thinking about. Regarding the lack of APIs subset. In fact I would not agree with that. As I described in the introduction AMD and Intel drivers can be considered as both synchronous and asynchronous, since a translated base address can be directly set in a local and peer configuration space. Although AMD and Intel devices don't support messaging, they have Scratchpads, which can be used to exchange an information between root complexes. The thing we need to do is to implement ntb_mw_set_trans() and ntb_mw_get_align() for them. Which isn't much different from the "mw_peer"-prefixed ones. The first method just sets a translated base address to the corresponding local register. The second one does exactly the same as "mw_peer"-prefixed ones. I would do it, but I haven't got a hardware to test, that's why I left things the way it was with just slight changes of names.

> 
> > 
> > Signed-off-by: Serge Semin <fancer.lancer@...il.com>
> > 
> > ---
> >  drivers/ntb/Kconfig                 |   4 +-
> >  drivers/ntb/hw/amd/ntb_hw_amd.c     |  49 ++-
> >  drivers/ntb/hw/intel/ntb_hw_intel.c |  59 +++-
> >  drivers/ntb/ntb.c                   |  86 +++++-
> >  drivers/ntb/ntb_transport.c         |  19 +-
> >  drivers/ntb/test/ntb_perf.c         |  16 +-
> >  drivers/ntb/test/ntb_pingpong.c     |   5 +
> >  drivers/ntb/test/ntb_tool.c         |  25 +-
> >  include/linux/ntb.h                 | 600 +++++++++++++++++++++++++++++-------
> >  9 files changed, 701 insertions(+), 162 deletions(-)
> > 
> > diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> > index 95944e5..67d80c4 100644
> > --- a/drivers/ntb/Kconfig
> > +++ b/drivers/ntb/Kconfig
> > @@ -14,8 +14,6 @@ if NTB
> > 
> >  source "drivers/ntb/hw/Kconfig"
> > 
> > -source "drivers/ntb/test/Kconfig"
> > -
> >  config NTB_TRANSPORT
> >  	tristate "NTB Transport Client"
> >  	help
> > @@ -25,4 +23,6 @@ config NTB_TRANSPORT
> > 
> >  	 If unsure, say N.
> > 
> > +source "drivers/ntb/test/Kconfig"
> > +
> >  endif # NTB
> > diff --git a/drivers/ntb/hw/amd/ntb_hw_amd.c b/drivers/ntb/hw/amd/ntb_hw_amd.c
> > index 6ccba0d..ab6f353 100644
> > --- a/drivers/ntb/hw/amd/ntb_hw_amd.c
> > +++ b/drivers/ntb/hw/amd/ntb_hw_amd.c
> > @@ -55,6 +55,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/random.h>
> >  #include <linux/slab.h>
> > +#include <linux/sizes.h>
> >  #include <linux/ntb.h>
> > 
> >  #include "ntb_hw_amd.h"
> > @@ -84,11 +85,8 @@ static int amd_ntb_mw_count(struct ntb_dev *ntb)
> >  	return ntb_ndev(ntb)->mw_count;
> >  }
> > 
> > -static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				phys_addr_t *base,
> > -				resource_size_t *size,
> > -				resource_size_t *align,
> > -				resource_size_t *align_size)
> > +static int amd_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				 phys_addr_t *base, resource_size_t *size)
> >  {
> >  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> >  	int bar;
> > @@ -103,17 +101,40 @@ static int amd_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> >  	if (size)
> >  		*size = pci_resource_len(ndev->ntb.pdev, bar);
> > 
> > -	if (align)
> > -		*align = SZ_4K;
> > +	return 0;
> > +}
> > +
> > +static int amd_ntb_peer_mw_count(struct ntb_dev *ntb)
> > +{
> > +	return ntb_ndev(ntb)->mw_count;
> > +}
> > +
> > +static int amd_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				     resource_size_t *addr_align,
> > +				     resource_size_t *size_align,
> > +				     resource_size_t *size_max)
> > +{
> > +	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> > +	int bar;
> > +
> > +	bar = ndev_mw_to_bar(ndev, idx);
> > +	if (bar < 0)
> > +		return bar;
> > +
> > +	if (addr_align)
> > +		*addr_align = SZ_4K;
> > +
> > +	if (size_align)
> > +		*size_align = 1;
> > 
> > -	if (align_size)
> > -		*align_size = 1;
> > +	if (size_max)
> > +		*size_max = pci_resource_len(ndev->ntb.pdev, bar);
> > 
> >  	return 0;
> >  }
> > 
> > -static int amd_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> > -				dma_addr_t addr, resource_size_t size)
> > +static int amd_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +				     dma_addr_t addr, resource_size_t size)
> >  {
> >  	struct amd_ntb_dev *ndev = ntb_ndev(ntb);
> >  	unsigned long xlat_reg, limit_reg = 0;
> > @@ -432,8 +453,10 @@ static int amd_ntb_peer_spad_write(struct ntb_dev *ntb,
> > 
> >  static const struct ntb_dev_ops amd_ntb_ops = {
> >  	.mw_count		= amd_ntb_mw_count,
> > -	.mw_get_range		= amd_ntb_mw_get_range,
> > -	.mw_set_trans		= amd_ntb_mw_set_trans,
> > +	.mw_get_maprsc		= amd_ntb_mw_get_maprsc,
> > +	.peer_mw_count		= amd_ntb_peer_mw_count,
> > +	.peer_mw_get_align	= amd_ntb_peer_mw_get_align,
> > +	.peer_mw_set_trans	= amd_ntb_peer_mw_set_trans,
> >  	.link_is_up		= amd_ntb_link_is_up,
> >  	.link_enable		= amd_ntb_link_enable,
> >  	.link_disable		= amd_ntb_link_disable,
> > diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.c b/drivers/ntb/hw/intel/ntb_hw_intel.c
> > index 40d04ef..fdb2838 100644
> > --- a/drivers/ntb/hw/intel/ntb_hw_intel.c
> > +++ b/drivers/ntb/hw/intel/ntb_hw_intel.c
> > @@ -804,11 +804,8 @@ static int intel_ntb_mw_count(struct ntb_dev *ntb)
> >  	return ntb_ndev(ntb)->mw_count;
> >  }
> > 
> > -static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				  phys_addr_t *base,
> > -				  resource_size_t *size,
> > -				  resource_size_t *align,
> > -				  resource_size_t *align_size)
> > +static int intel_ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				   phys_addr_t *base, resource_size_t *size)
> >  {
> >  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> >  	int bar;
> > @@ -828,17 +825,51 @@ static int intel_ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> >  		*size = pci_resource_len(ndev->ntb.pdev, bar) -
> >  			(idx == ndev->b2b_idx ? ndev->b2b_off : 0);
> > 
> > -	if (align)
> > -		*align = pci_resource_len(ndev->ntb.pdev, bar);
> > +	return 0;
> > +}
> > +
> > +static int intel_ntb_peer_mw_count(struct ntb_dev *ntb)
> > +{
> > +	return ntb_ndev(ntb)->mw_count;
> > +}
> > +
> > +static int intel_ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				       resource_size_t *addr_align,
> > +				       resource_size_t *size_align,
> > +				       resource_size_t *size_max)
> > +{
> > +	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> > +	resource_size_t bar_size, mw_size;
> > +	int bar;
> > +
> > +	if (idx >= ndev->b2b_idx && !ndev->b2b_off)
> > +		idx += 1;
> > +
> > +	bar = ndev_mw_to_bar(ndev, idx);
> > +	if (bar < 0)
> > +		return bar;
> > +
> > +	bar_size = pci_resource_len(ndev->ntb.pdev, bar);
> > +
> > +	if (idx == ndev->b2b_idx)
> > +		mw_size = bar_size - ndev->b2b_off;
> > +	else
> > +		mw_size = bar_size;
> > +
> > +	if (addr_align)
> > +		*addr_align = bar_size;
> > +
> > +	if (size_align)
> > +		*size_align = 1;
> > 
> > -	if (align_size)
> > -		*align_size = 1;
> > +	if (size_max)
> > +		*size_max = mw_size;
> > 
> >  	return 0;
> >  }
> > 
> > -static int intel_ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> > -				  dma_addr_t addr, resource_size_t size)
> > +static int intel_ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +				       dma_addr_t addr, resource_size_t size)
> >  {
> >  	struct intel_ntb_dev *ndev = ntb_ndev(ntb);
> >  	unsigned long base_reg, xlat_reg, limit_reg;
> > @@ -2220,8 +2251,10 @@ static struct intel_b2b_addr xeon_b2b_dsd_addr = {
> >  /* operations for primary side of local ntb */
> >  static const struct ntb_dev_ops intel_ntb_ops = {
> >  	.mw_count		= intel_ntb_mw_count,
> > -	.mw_get_range		= intel_ntb_mw_get_range,
> > -	.mw_set_trans		= intel_ntb_mw_set_trans,
> > +	.mw_get_maprsc		= intel_ntb_mw_get_maprsc,
> > +	.peer_mw_count		= intel_ntb_peer_mw_count,
> > +	.peer_mw_get_align	= intel_ntb_peer_mw_get_align,
> > +	.peer_mw_set_trans	= intel_ntb_peer_mw_set_trans,
> >  	.link_is_up		= intel_ntb_link_is_up,
> >  	.link_enable		= intel_ntb_link_enable,
> >  	.link_disable		= intel_ntb_link_disable,
> > diff --git a/drivers/ntb/ntb.c b/drivers/ntb/ntb.c
> > index 2e25307..37c3b36 100644
> > --- a/drivers/ntb/ntb.c
> > +++ b/drivers/ntb/ntb.c
> > @@ -54,6 +54,7 @@
> >  #include <linux/device.h>
> >  #include <linux/kernel.h>
> >  #include <linux/module.h>
> > +#include <linux/atomic.h>
> > 
> >  #include <linux/ntb.h>
> >  #include <linux/pci.h>
> > @@ -72,8 +73,62 @@ MODULE_AUTHOR(DRIVER_AUTHOR);
> >  MODULE_DESCRIPTION(DRIVER_DESCRIPTION);
> > 
> >  static struct bus_type ntb_bus;
> > +static struct ntb_bus_data ntb_data;
> >  static void ntb_dev_release(struct device *dev);
> > 
> > +static int ntb_gen_devid(struct ntb_dev *ntb)
> > +{
> > +	const char *name;
> > +	unsigned long *mask;
> > +	int id;
> > +
> > +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> > +		name = "ntbAS%d";
> > +		mask = ntb_data.both_msk;
> > +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> > +		name = "ntbS%d";
> > +		mask = ntb_data.sync_msk;
> > +	} else if (ntb_valid_async_dev_ops(ntb)) {
> > +		name = "ntbA%d";
> > +		mask = ntb_data.async_msk;
> > +	} else {
> > +		return -EINVAL;
> > +	}
> > +
> > +	for (id = 0; NTB_MAX_DEVID > id; id++) {
> > +		if (0 == test_and_set_bit(id, mask)) {
> > +			ntb->id = id;
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (NTB_MAX_DEVID > id) {
> > +		dev_set_name(&ntb->dev, name, ntb->id);
> > +	} else {
> > +		return -ENOMEM;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void ntb_free_devid(struct ntb_dev *ntb)
> > +{
> > +	unsigned long *mask;
> > +
> > +	if (ntb_valid_sync_dev_ops(ntb) && ntb_valid_async_dev_ops(ntb)) {
> > +		mask = ntb_data.both_msk;
> > +	} else if (ntb_valid_sync_dev_ops(ntb)) {
> > +		mask = ntb_data.sync_msk;
> > +	} else if (ntb_valid_async_dev_ops(ntb)) {
> > +		mask = ntb_data.async_msk;
> > +	} else {
> > +		/* It's impossible */
> > +		BUG();
> > +	}
> > +
> > +	clear_bit(ntb->id, mask);
> > +}
> > +
> >  int __ntb_register_client(struct ntb_client *client, struct module *mod,
> >  			  const char *mod_name)
> >  {
> > @@ -99,13 +154,15 @@ EXPORT_SYMBOL(ntb_unregister_client);
> > 
> >  int ntb_register_device(struct ntb_dev *ntb)
> >  {
> > +	int ret;
> > +
> >  	if (!ntb)
> >  		return -EINVAL;
> >  	if (!ntb->pdev)
> >  		return -EINVAL;
> >  	if (!ntb->ops)
> >  		return -EINVAL;
> > -	if (!ntb_dev_ops_is_valid(ntb->ops))
> > +	if (!ntb_valid_sync_dev_ops(ntb) && !ntb_valid_async_dev_ops(ntb))
> >  		return -EINVAL;
> > 
> >  	init_completion(&ntb->released);
> > @@ -114,13 +171,21 @@ int ntb_register_device(struct ntb_dev *ntb)
> >  	ntb->dev.bus = &ntb_bus;
> >  	ntb->dev.parent = &ntb->pdev->dev;
> >  	ntb->dev.release = ntb_dev_release;
> > -	dev_set_name(&ntb->dev, "%s", pci_name(ntb->pdev));
> > 
> >  	ntb->ctx = NULL;
> >  	ntb->ctx_ops = NULL;
> >  	spin_lock_init(&ntb->ctx_lock);
> > 
> > -	return device_register(&ntb->dev);
> > +	/* No need to wait for completion if failed */
> > +	ret = ntb_gen_devid(ntb);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = device_register(&ntb->dev);
> > +	if (ret)
> > +		ntb_free_devid(ntb);
> > +
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL(ntb_register_device);
> > 
> > @@ -128,6 +193,7 @@ void ntb_unregister_device(struct ntb_dev *ntb)
> >  {
> >  	device_unregister(&ntb->dev);
> >  	wait_for_completion(&ntb->released);
> > +	ntb_free_devid(ntb);
> >  }
> >  EXPORT_SYMBOL(ntb_unregister_device);
> > 
> > @@ -191,6 +257,20 @@ void ntb_db_event(struct ntb_dev *ntb, int vector)
> >  }
> >  EXPORT_SYMBOL(ntb_db_event);
> > 
> > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > +		   struct ntb_msg *msg)
> > +{
> > +	unsigned long irqflags;
> > +
> > +	spin_lock_irqsave(&ntb->ctx_lock, irqflags);
> > +	{
> > +		if (ntb->ctx_ops && ntb->ctx_ops->msg_event)
> > +			ntb->ctx_ops->msg_event(ntb->ctx, ev, msg);
> > +	}
> > +	spin_unlock_irqrestore(&ntb->ctx_lock, irqflags);
> > +}
> > +EXPORT_SYMBOL(ntb_msg_event);
> > +
> >  static int ntb_probe(struct device *dev)
> >  {
> >  	struct ntb_dev *ntb;
> > diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
> > index d5c5894..2626ba0 100644
> > --- a/drivers/ntb/ntb_transport.c
> > +++ b/drivers/ntb/ntb_transport.c
> > @@ -673,7 +673,7 @@ static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
> >  	if (!mw->virt_addr)
> >  		return;
> > 
> > -	ntb_mw_clear_trans(nt->ndev, num_mw);
> > +	ntb_peer_mw_set_trans(nt->ndev, num_mw, 0, 0);
> >  	dma_free_coherent(&pdev->dev, mw->buff_size,
> >  			  mw->virt_addr, mw->dma_addr);
> >  	mw->xlat_size = 0;
> > @@ -730,7 +730,8 @@ static int ntb_set_mw(struct ntb_transport_ctx *nt, int num_mw,
> >  	}
> > 
> >  	/* Notify HW the memory location of the receive buffer */
> > -	rc = ntb_mw_set_trans(nt->ndev, num_mw, mw->dma_addr, mw->xlat_size);
> > +	rc = ntb_peer_mw_set_trans(nt->ndev, num_mw, mw->dma_addr,
> > +				   mw->xlat_size);
> >  	if (rc) {
> >  		dev_err(&pdev->dev, "Unable to set mw%d translation", num_mw);
> >  		ntb_free_mw(nt, num_mw);
> > @@ -1060,7 +1061,11 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> > ntb_dev *ndev)
> >  	int node;
> >  	int rc, i;
> > 
> > -	mw_count = ntb_mw_count(ndev);
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ndev))
> > +		return -EINVAL;
> > +
> > +	mw_count = ntb_peer_mw_count(ndev);
> >  	if (ntb_spad_count(ndev) < (NUM_MWS + 1 + mw_count * 2)) {
> >  		dev_err(&ndev->dev, "Not enough scratch pad registers for %s",
> >  			NTB_TRANSPORT_NAME);
> > @@ -1094,8 +1099,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct
> > ntb_dev *ndev)
> >  	for (i = 0; i < mw_count; i++) {
> >  		mw = &nt->mw_vec[i];
> > 
> > -		rc = ntb_mw_get_range(ndev, i, &mw->phys_addr, &mw->phys_size,
> > -				      &mw->xlat_align, &mw->xlat_align_size);
> > +		rc = ntb_mw_get_maprsc(ndev, i, &mw->phys_addr, &mw->phys_size);
> > +		if (rc)
> > +			goto err1;
> > +
> > +		rc = ntb_peer_mw_get_align(ndev, i, &mw->xlat_align,
> > +					   &mw->xlat_align_size, NULL);
> 
> Looks like ntb_mw_get_range() was simpler before the change.
> 

If I didn't change NTB bus kernel API, I would have split them up anyway. First of all functions with long argument list look more confusing, than ones with shorter list. It helps to stick to the "80 character per line" rule and improves readability. Secondly the function splitting improves the readability of the code in general. When I first saw the function name "ntb_mw_get_range()", it was not obvious what kind of ranges this function returned. The function lacked of "high code coherence" unofficial rule. It is better when one function does one coherent thing and return a well coherent data. Particularly function "ntb_mw_get_range()" returned a local memory windows mapping address and size, as well as alignment of memory allocated for a peer. So now "ntb_mw_get_maprsc()" method returns mapping resources. If local NTB client driver is not going to allocate any memory, so one just doesn't need to call "ntb_peer_mw_get_align()" method at all. I understand, that a client driver could pass NULL to a unused arguments of the "ntb_mw_get_range()", but still the new design is better readable.

Additionally I've split them up because of the difference in the way the asynchronous interface works. IDT driver can not safely perform ntb_peer_mw_set_trans(), that's why I had to add ntb_mw_set_trans(). Each of that method should logically have related "ntb_*mw_get_align()" method. Method ntb_mw_get_align() shall give to a local client driver a hint how the retrieved from the peer translated base address should be aligned, so ntb_mw_set_trans() method would successfully return. Method ntb_peer_mw_get_align() will give a hint how the local memory buffer should be allocated to fulfil a peer translated base address alignment. In this way it returns restrictions for parameters of "ntb_peer_mw_set_trans()".

Finally, IDT driver is designed so Primary and Secondary ports can support a different number of memory windows. In this way methods "ntb_mw_get_maprsc()/ntb_mw_set_trans()/ntb_mw_get_trans()/ntb_mw_get_align()" have different range of acceptable values of the second argument, which is determined by the "ntb_mw_count()" method, comparing to methods "ntb_peer_mw_set_trans()/ntb_peer_mw_get_trans()/ntb_peer_mw_get_align()", which memory windows index restriction is determined by the "ntb_peer_mw_count()" method.

So to speak the splitting was really necessary to make the API looking more logical.

> >  		if (rc)
> >  			goto err1;
> > 
> > diff --git a/drivers/ntb/test/ntb_perf.c b/drivers/ntb/test/ntb_perf.c
> > index 6a50f20..f2952f7 100644
> > --- a/drivers/ntb/test/ntb_perf.c
> > +++ b/drivers/ntb/test/ntb_perf.c
> > @@ -452,7 +452,7 @@ static void perf_free_mw(struct perf_ctx *perf)
> >  	if (!mw->virt_addr)
> >  		return;
> > 
> > -	ntb_mw_clear_trans(perf->ntb, 0);
> > +	ntb_peer_mw_set_trans(perf->ntb, 0, 0, 0);
> >  	dma_free_coherent(&pdev->dev, mw->buf_size,
> >  			  mw->virt_addr, mw->dma_addr);
> >  	mw->xlat_size = 0;
> > @@ -488,7 +488,7 @@ static int perf_set_mw(struct perf_ctx *perf, resource_size_t size)
> >  		mw->buf_size = 0;
> >  	}
> > 
> > -	rc = ntb_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> > +	rc = ntb_peer_mw_set_trans(perf->ntb, 0, mw->dma_addr, mw->xlat_size);
> >  	if (rc) {
> >  		dev_err(&perf->ntb->dev, "Unable to set mw0 translation\n");
> >  		perf_free_mw(perf);
> > @@ -559,8 +559,12 @@ static int perf_setup_mw(struct ntb_dev *ntb, struct perf_ctx *perf)
> > 
> >  	mw = &perf->mw;
> > 
> > -	rc = ntb_mw_get_range(ntb, 0, &mw->phys_addr, &mw->phys_size,
> > -			      &mw->xlat_align, &mw->xlat_align_size);
> > +	rc = ntb_mw_get_maprsc(ntb, 0, &mw->phys_addr, &mw->phys_size);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = ntb_peer_mw_get_align(ntb, 0, &mw->xlat_align,
> > +				   &mw->xlat_align_size, NULL);
> 
> Looks like ntb_mw_get_range() was simpler.
> 

See the previous answer.

> >  	if (rc)
> >  		return rc;
> > 
> > @@ -758,6 +762,10 @@ static int perf_probe(struct ntb_client *client, struct ntb_dev *ntb)
> >  	int node;
> >  	int rc = 0;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb))
> > +		return -EINVAL;
> > +
> >  	if (ntb_spad_count(ntb) < MAX_SPAD) {
> >  		dev_err(&ntb->dev, "Not enough scratch pad registers for %s",
> >  			DRIVER_NAME);
> > diff --git a/drivers/ntb/test/ntb_pingpong.c b/drivers/ntb/test/ntb_pingpong.c
> > index 7d31179..e833649 100644
> > --- a/drivers/ntb/test/ntb_pingpong.c
> > +++ b/drivers/ntb/test/ntb_pingpong.c
> > @@ -214,6 +214,11 @@ static int pp_probe(struct ntb_client *client,
> >  	struct pp_ctx *pp;
> >  	int rc;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > +		return -EINVAL;
> > +	}
> > +
> >  	if (ntb_db_is_unsafe(ntb)) {
> >  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> >  		if (!unsafe) {
> > diff --git a/drivers/ntb/test/ntb_tool.c b/drivers/ntb/test/ntb_tool.c
> > index 61bf2ef..5dfe12f 100644
> > --- a/drivers/ntb/test/ntb_tool.c
> > +++ b/drivers/ntb/test/ntb_tool.c
> > @@ -675,8 +675,11 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> > req_size)
> >  	if (mw->peer)
> >  		return 0;
> > 
> > -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &size, &align,
> > -			      &align_size);
> > +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &size);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = ntb_peer_mw_get_align(tc->ntb, idx, &align, &align_size, NULL);
> >  	if (rc)
> >  		return rc;
> 
> Looks like ntb_mw_get_range() was simpler.
> 

See the previous answer.

> > 
> > @@ -689,7 +692,7 @@ static int tool_setup_mw(struct tool_ctx *tc, int idx, size_t
> > req_size)
> >  	if (!mw->peer)
> >  		return -ENOMEM;
> > 
> > -	rc = ntb_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> > +	rc = ntb_peer_mw_set_trans(tc->ntb, idx, mw->peer_dma, mw->size);
> >  	if (rc)
> >  		goto err_free_dma;
> > 
> > @@ -716,7 +719,7 @@ static void tool_free_mw(struct tool_ctx *tc, int idx)
> >  	struct tool_mw *mw = &tc->mws[idx];
> > 
> >  	if (mw->peer) {
> > -		ntb_mw_clear_trans(tc->ntb, idx);
> > +		ntb_peer_mw_set_trans(tc->ntb, idx, 0, 0);
> >  		dma_free_coherent(&tc->ntb->pdev->dev, mw->size,
> >  				  mw->peer,
> >  				  mw->peer_dma);
> > @@ -751,8 +754,8 @@ static ssize_t tool_peer_mw_trans_read(struct file *filep,
> >  	if (!buf)
> >  		return -ENOMEM;
> > 
> > -	ntb_mw_get_range(mw->tc->ntb, mw->idx,
> > -			 &base, &mw_size, &align, &align_size);
> > +	ntb_mw_get_maprsc(mw->tc->ntb, mw->idx, &base, &mw_size);
> > +	ntb_peer_mw_get_align(mw->tc->ntb, mw->idx, &align, &align_size, NULL);
> > 
> >  	off += scnprintf(buf + off, buf_size - off,
> >  			 "Peer MW %d Information:\n", mw->idx);
> > @@ -827,8 +830,7 @@ static int tool_init_mw(struct tool_ctx *tc, int idx)
> >  	phys_addr_t base;
> >  	int rc;
> > 
> > -	rc = ntb_mw_get_range(tc->ntb, idx, &base, &mw->win_size,
> > -			      NULL, NULL);
> > +	rc = ntb_mw_get_maprsc(tc->ntb, idx, &base, &mw->win_size);
> >  	if (rc)
> >  		return rc;
> > 
> > @@ -913,6 +915,11 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
> >  	int rc;
> >  	int i;
> > 
> > +	/* Synchronous hardware is only supported */
> > +	if (!ntb_valid_sync_dev_ops(ntb)) {
> > +		return -EINVAL;
> > +	}
> > +
> 
> It would be nice if both types could be supported by the same api.
> 

Yes, it would be. Alas it isn't possible in general. See the introduction to this letter. AMD and Intel devices support asynchronous interface, although they lack of messaging mechanism.

Getting back to the discussion, we still need to provide a way to determine which type of interface an NTB device supports: synchronous/asynchronous translated base address initialization, Scratchpads and memory windows. Currently it can be determined by the functions ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops(). I understand, that it's not the best solution. We can implement the traditional Linux kernel bus device-driver matching, using table_ids and so on. For example, each hardware driver fills in a table with all the functionality it supports, like: synchronous/asynchronous memory windows, Doorbells, Scratchpads, Messaging. Then driver initialize a table of functionality it uses. NTB bus core implements a "match()" callback, which compares those two tables and calls "probe()" callback method of a driver when the tables successfully matches.

On the other hand, we might don't have to comprehend the NTB bus core. We can just introduce a table_id for NTB hardware device, which would just describe the device vendor itself, like "ntb,amd", "ntb,intel", "ntb,idt" and so on. Client driver will declare a supported device by its table_id. It might look easier, since the client driver developer should have a basic understanding of the device one develops a driver for. Then NTB bus kernel API core will simply match NTB devices with drivers like any other buses (PCI, PCIe, i2c, spi, etc) do. 
 
> >  	if (ntb_db_is_unsafe(ntb))
> >  		dev_dbg(&ntb->dev, "doorbell is unsafe\n");
> > 
> > @@ -928,7 +935,7 @@ static int tool_probe(struct ntb_client *self, struct ntb_dev *ntb)
> >  	tc->ntb = ntb;
> >  	init_waitqueue_head(&tc->link_wq);
> > 
> > -	tc->mw_count = min(ntb_mw_count(tc->ntb), MAX_MWS);
> > +	tc->mw_count = min(ntb_peer_mw_count(tc->ntb), MAX_MWS);
> >  	for (i = 0; i < tc->mw_count; i++) {
> >  		rc = tool_init_mw(tc, i);
> >  		if (rc)
> > diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> > index 6f47562..d1937d3 100644
> > --- a/include/linux/ntb.h
> > +++ b/include/linux/ntb.h
> > @@ -159,13 +159,44 @@ static inline int ntb_client_ops_is_valid(const struct
> > ntb_client_ops *ops)
> >  }
> > 
> >  /**
> > + * struct ntb_msg - ntb driver message structure
> > + * @type:	Message type.
> > + * @payload:	Payload data to send to a peer
> > + * @data:	Array of u32 data to send (size might be hw dependent)
> > + */
> > +#define NTB_MAX_MSGSIZE 4
> > +struct ntb_msg {
> > +	union {
> > +		struct {
> > +			u32 type;
> > +			u32 payload[NTB_MAX_MSGSIZE - 1];
> > +		};
> > +		u32 data[NTB_MAX_MSGSIZE];
> > +	};
> > +};
> > +
> > +/**
> > + * enum NTB_MSG_EVENT - message event types
> > + * @NTB_MSG_NEW:	New message just arrived and passed to the handler
> > + * @NTB_MSG_SENT:	Posted message has just been successfully sent
> > + * @NTB_MSG_FAIL:	Posted message failed to be sent
> > + */
> > +enum NTB_MSG_EVENT {
> > +	NTB_MSG_NEW,
> > +	NTB_MSG_SENT,
> > +	NTB_MSG_FAIL
> > +};
> > +
> > +/**
> >   * struct ntb_ctx_ops - ntb driver context operations
> >   * @link_event:		See ntb_link_event().
> >   * @db_event:		See ntb_db_event().
> > + * @msg_event:		See ntb_msg_event().
> >   */
> >  struct ntb_ctx_ops {
> >  	void (*link_event)(void *ctx);
> >  	void (*db_event)(void *ctx, int db_vector);
> > +	void (*msg_event)(void *ctx, enum NTB_MSG_EVENT ev, struct ntb_msg *msg);
> >  };
> > 
> >  static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops *ops)
> > @@ -174,18 +205,24 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> > *ops)
> >  	return
> >  		/* ops->link_event		&& */
> >  		/* ops->db_event		&& */
> > +		/* ops->msg_event		&& */
> >  		1;
> >  }
> > 
> >  /**
> >   * struct ntb_ctx_ops - ntb device operations
> > - * @mw_count:		See ntb_mw_count().
> > - * @mw_get_range:	See ntb_mw_get_range().
> > - * @mw_set_trans:	See ntb_mw_set_trans().
> > - * @mw_clear_trans:	See ntb_mw_clear_trans().
> >   * @link_is_up:		See ntb_link_is_up().
> >   * @link_enable:	See ntb_link_enable().
> >   * @link_disable:	See ntb_link_disable().
> > + * @mw_count:		See ntb_mw_count().
> > + * @mw_get_maprsc:	See ntb_mw_get_maprsc().
> > + * @mw_set_trans:	See ntb_mw_set_trans().
> > + * @mw_get_trans:	See ntb_mw_get_trans().
> > + * @mw_get_align:	See ntb_mw_get_align().
> > + * @peer_mw_count:	See ntb_peer_mw_count().
> > + * @peer_mw_set_trans:	See ntb_peer_mw_set_trans().
> > + * @peer_mw_get_trans:	See ntb_peer_mw_get_trans().
> > + * @peer_mw_get_align:	See ntb_peer_mw_get_align().
> >   * @db_is_unsafe:	See ntb_db_is_unsafe().
> >   * @db_valid_mask:	See ntb_db_valid_mask().
> >   * @db_vector_count:	See ntb_db_vector_count().
> > @@ -210,22 +247,38 @@ static inline int ntb_ctx_ops_is_valid(const struct ntb_ctx_ops
> > *ops)
> >   * @peer_spad_addr:	See ntb_peer_spad_addr().
> >   * @peer_spad_read:	See ntb_peer_spad_read().
> >   * @peer_spad_write:	See ntb_peer_spad_write().
> > + * @msg_post:		See ntb_msg_post().
> > + * @msg_size:		See ntb_msg_size().
> >   */
> >  struct ntb_dev_ops {
> > -	int (*mw_count)(struct ntb_dev *ntb);
> > -	int (*mw_get_range)(struct ntb_dev *ntb, int idx,
> > -			    phys_addr_t *base, resource_size_t *size,
> > -			resource_size_t *align, resource_size_t *align_size);
> > -	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> > -			    dma_addr_t addr, resource_size_t size);
> > -	int (*mw_clear_trans)(struct ntb_dev *ntb, int idx);
> > -
> >  	int (*link_is_up)(struct ntb_dev *ntb,
> >  			  enum ntb_speed *speed, enum ntb_width *width);
> >  	int (*link_enable)(struct ntb_dev *ntb,
> >  			   enum ntb_speed max_speed, enum ntb_width max_width);
> >  	int (*link_disable)(struct ntb_dev *ntb);
> > 
> > +	int (*mw_count)(struct ntb_dev *ntb);
> > +	int (*mw_get_maprsc)(struct ntb_dev *ntb, int idx,
> > +			     phys_addr_t *base, resource_size_t *size);
> > +	int (*mw_get_align)(struct ntb_dev *ntb, int idx,
> > +			    resource_size_t *addr_align,
> > +			    resource_size_t *size_align,
> > +			    resource_size_t *size_max);
> > +	int (*mw_set_trans)(struct ntb_dev *ntb, int idx,
> > +			    dma_addr_t addr, resource_size_t size);
> > +	int (*mw_get_trans)(struct ntb_dev *ntb, int idx,
> > +			    dma_addr_t *addr, resource_size_t *size);
> > +
> > +	int (*peer_mw_count)(struct ntb_dev *ntb);
> > +	int (*peer_mw_get_align)(struct ntb_dev *ntb, int idx,
> > +				 resource_size_t *addr_align,
> > +				 resource_size_t *size_align,
> > +				 resource_size_t *size_max);
> > +	int (*peer_mw_set_trans)(struct ntb_dev *ntb, int idx,
> > +				 dma_addr_t addr, resource_size_t size);
> > +	int (*peer_mw_get_trans)(struct ntb_dev *ntb, int idx,
> > +				 dma_addr_t *addr, resource_size_t *size);
> > +
> >  	int (*db_is_unsafe)(struct ntb_dev *ntb);
> >  	u64 (*db_valid_mask)(struct ntb_dev *ntb);
> >  	int (*db_vector_count)(struct ntb_dev *ntb);
> > @@ -259,47 +312,10 @@ struct ntb_dev_ops {
> >  			      phys_addr_t *spad_addr);
> >  	u32 (*peer_spad_read)(struct ntb_dev *ntb, int idx);
> >  	int (*peer_spad_write)(struct ntb_dev *ntb, int idx, u32 val);
> > -};
> > -
> > -static inline int ntb_dev_ops_is_valid(const struct ntb_dev_ops *ops)
> > -{
> > -	/* commented callbacks are not required: */
> > -	return
> > -		ops->mw_count				&&
> > -		ops->mw_get_range			&&
> > -		ops->mw_set_trans			&&
> > -		/* ops->mw_clear_trans			&& */
> > -		ops->link_is_up				&&
> > -		ops->link_enable			&&
> > -		ops->link_disable			&&
> > -		/* ops->db_is_unsafe			&& */
> > -		ops->db_valid_mask			&&
> > 
> > -		/* both set, or both unset */
> > -		(!ops->db_vector_count == !ops->db_vector_mask) &&
> > -
> > -		ops->db_read				&&
> > -		/* ops->db_set				&& */
> > -		ops->db_clear				&&
> > -		/* ops->db_read_mask			&& */
> > -		ops->db_set_mask			&&
> > -		ops->db_clear_mask			&&
> > -		/* ops->peer_db_addr			&& */
> > -		/* ops->peer_db_read			&& */
> > -		ops->peer_db_set			&&
> > -		/* ops->peer_db_clear			&& */
> > -		/* ops->peer_db_read_mask		&& */
> > -		/* ops->peer_db_set_mask		&& */
> > -		/* ops->peer_db_clear_mask		&& */
> > -		/* ops->spad_is_unsafe			&& */
> > -		ops->spad_count				&&
> > -		ops->spad_read				&&
> > -		ops->spad_write				&&
> > -		/* ops->peer_spad_addr			&& */
> > -		/* ops->peer_spad_read			&& */
> > -		ops->peer_spad_write			&&
> > -		1;
> > -}
> > +	int (*msg_post)(struct ntb_dev *ntb, struct ntb_msg *msg);
> > +	int (*msg_size)(struct ntb_dev *ntb);
> > +};
> > 
> >  /**
> >   * struct ntb_client - client interested in ntb devices
> > @@ -310,10 +326,22 @@ struct ntb_client {
> >  	struct device_driver		drv;
> >  	const struct ntb_client_ops	ops;
> >  };
> > -
> >  #define drv_ntb_client(__drv) container_of((__drv), struct ntb_client, drv)
> > 
> >  /**
> > + * struct ntb_bus_data - NTB bus data
> > + * @sync_msk:	Synchroous devices mask
> > + * @async_msk:	Asynchronous devices mask
> > + * @both_msk:	Both sync and async devices mask
> > + */
> > +#define NTB_MAX_DEVID (8*BITS_PER_LONG)
> > +struct ntb_bus_data {
> > +	unsigned long sync_msk[8];
> > +	unsigned long async_msk[8];
> > +	unsigned long both_msk[8];
> > +};
> > +
> > +/**
> >   * struct ntb_device - ntb device
> >   * @dev:		Linux device object.
> >   * @pdev:		Pci device entry of the ntb.
> > @@ -332,15 +360,151 @@ struct ntb_dev {
> > 
> >  	/* private: */
> > 
> > +	/* device id */
> > +	int id;
> >  	/* synchronize setting, clearing, and calling ctx_ops */
> >  	spinlock_t			ctx_lock;
> >  	/* block unregister until device is fully released */
> >  	struct completion		released;
> >  };
> > -
> >  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
> > 
> >  /**
> > + * ntb_valid_sync_dev_ops() - valid operations for synchronous hardware setup
> > + * @ntb:	NTB device
> > + *
> > + * There might be two types of NTB hardware differed by the way of the settings
> > + * configuration. The synchronous chips allows to set the memory windows by
> > + * directly writing to the peer registers. Additionally there can be shared
> > + * Scratchpad registers for synchronous information exchange. Client drivers
> > + * should call this function to make sure the hardware supports the proper
> > + * functionality.
> > + */
> > +static inline int ntb_valid_sync_dev_ops(const struct ntb_dev *ntb)
> > +{
> > +	const struct ntb_dev_ops *ops = ntb->ops;
> > +
> > +	/* Commented callbacks are not required, but might be developed */
> > +	return	/* NTB link status ops */
> > +		ops->link_is_up					&&
> > +		ops->link_enable				&&
> > +		ops->link_disable				&&
> > +
> > +		/* Synchronous memory windows ops */
> > +		ops->mw_count					&&
> > +		ops->mw_get_maprsc				&&
> > +		/* ops->mw_get_align				&& */
> > +		/* ops->mw_set_trans				&& */
> > +		/* ops->mw_get_trans				&& */
> > +		ops->peer_mw_count				&&
> > +		ops->peer_mw_get_align				&&
> > +		ops->peer_mw_set_trans				&&
> > +		/* ops->peer_mw_get_trans			&& */
> > +
> > +		/* Doorbell ops */
> > +		/* ops->db_is_unsafe				&& */
> > +		ops->db_valid_mask				&&
> > +		/* both set, or both unset */
> > +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> > +		ops->db_read					&&
> > +		/* ops->db_set					&& */
> > +		ops->db_clear					&&
> > +		/* ops->db_read_mask				&& */
> > +		ops->db_set_mask				&&
> > +		ops->db_clear_mask				&&
> > +		/* ops->peer_db_addr				&& */
> > +		/* ops->peer_db_read				&& */
> > +		ops->peer_db_set				&&
> > +		/* ops->peer_db_clear				&& */
> > +		/* ops->peer_db_read_mask			&& */
> > +		/* ops->peer_db_set_mask			&& */
> > +		/* ops->peer_db_clear_mask			&& */
> > +
> > +		/* Scratchpad ops */
> > +		/* ops->spad_is_unsafe				&& */
> > +		ops->spad_count					&&
> > +		ops->spad_read					&&
> > +		ops->spad_write					&&
> > +		/* ops->peer_spad_addr				&& */
> > +		/* ops->peer_spad_read				&& */
> > +		ops->peer_spad_write				&&
> > +
> > +		/* Messages IO ops */
> > +		/* ops->msg_post				&& */
> > +		/* ops->msg_size				&& */
> > +		1;
> > +}
> > +
> > +/**
> > + * ntb_valid_async_dev_ops() - valid operations for asynchronous hardware setup
> > + * @ntb:	NTB device
> > + *
> > + * There might be two types of NTB hardware differed by the way of the settings
> > + * configuration. The asynchronous chips does not allow to set the memory
> > + * windows by directly writing to the peer registers. Instead it implements
> > + * the additional method to communinicate between NTB nodes like messages.
> > + * Scratchpad registers aren't likely supported by such hardware. Client
> > + * drivers should call this function to make sure the hardware supports
> > + * the proper functionality.
> > + */
> > +static inline int ntb_valid_async_dev_ops(const struct ntb_dev *ntb)
> > +{
> > +	const struct ntb_dev_ops *ops = ntb->ops;
> > +
> > +	/* Commented callbacks are not required, but might be developed */
> > +	return	/* NTB link status ops */
> > +		ops->link_is_up					&&
> > +		ops->link_enable				&&
> > +		ops->link_disable				&&
> > +
> > +		/* Asynchronous memory windows ops */
> > +		ops->mw_count					&&
> > +		ops->mw_get_maprsc				&&
> > +		ops->mw_get_align				&&
> > +		ops->mw_set_trans				&&
> > +		/* ops->mw_get_trans				&& */
> > +		ops->peer_mw_count				&&
> > +		ops->peer_mw_get_align				&&
> > +		/* ops->peer_mw_set_trans			&& */
> > +		/* ops->peer_mw_get_trans			&& */
> > +
> > +		/* Doorbell ops */
> > +		/* ops->db_is_unsafe				&& */
> > +		ops->db_valid_mask				&&
> > +		/* both set, or both unset */
> > +		(!ops->db_vector_count == !ops->db_vector_mask)	&&
> > +		ops->db_read					&&
> > +		/* ops->db_set					&& */
> > +		ops->db_clear					&&
> > +		/* ops->db_read_mask				&& */
> > +		ops->db_set_mask				&&
> > +		ops->db_clear_mask				&&
> > +		/* ops->peer_db_addr				&& */
> > +		/* ops->peer_db_read				&& */
> > +		ops->peer_db_set				&&
> > +		/* ops->peer_db_clear				&& */
> > +		/* ops->peer_db_read_mask			&& */
> > +		/* ops->peer_db_set_mask			&& */
> > +		/* ops->peer_db_clear_mask			&& */
> > +
> > +		/* Scratchpad ops */
> > +		/* ops->spad_is_unsafe				&& */
> > +		/* ops->spad_count				&& */
> > +		/* ops->spad_read				&& */
> > +		/* ops->spad_write				&& */
> > +		/* ops->peer_spad_addr				&& */
> > +		/* ops->peer_spad_read				&& */
> > +		/* ops->peer_spad_write				&& */
> > +
> > +		/* Messages IO ops */
> > +		ops->msg_post					&&
> > +		ops->msg_size					&&
> > +		1;
> > +}
> 
> I understand why IDT requires a different api for dealing with addressing multiple peers.  I would be interested in a solution that would allow, for example, the Intel driver fit under the api for dealing with multiple peers, even though it only supports one peer.  I would rather see that, than two separate apis under ntb.
> 
> Thoughts?
> 
> Can the sync api be described by some subset of the async api?  Are there less overloaded terms we can use instead of sync/async?
> 

Answer to this concern is mostly provided in the introduction as well. I'll repeat it here in details. As I said AMD and Intel hardware support asynchronous API except the messaging. Additionally I can even think of emulating messaging using Doorbells and Scratchpads, but not the other way around. Why not? Before answering, here is how the messaging works in IDT switches of both first and second groups (see introduction for describing the groups).

There are four outbound and inbound message registers for each NTB port in the device. Local root complex can connect its any outbound message to any inbound message register of the IDT switch. When one writes a data to an outbound message register it immediately gets to the connected inbound message registers. Then peer can read its inbound message registers and empty it by clearing a corresponding bit. Then and only then next data can be written to any outbound message registers connected to that inbound message register. So the possible race condition between multiple domains sending a message to same peer is resolved by the IDT switch itself.

One would ask: "Why don't you just wrap the message registers up back to the same port? It would look just like Scratchpads." Yes, It would. But still there are only four message registers. It's not enough to distribute them between all the possibly connected NTB ports. As I said earlier there can be up to eight domains connected, so there must be at least seven message register to fulfil the possible design.

Howbeit all the emulations would look ugly anyway. In my opinion It's better to slightly adapt design for a hardware, rather than hardware to a design. Following that rule would simplify a code and support respectively.

Regarding the APIs subset. As I said before async API is kind of subset of synchronous API. We can develop all the memory window related callback-method for AMD and Intel hardware driver, which is pretty much easy. We can even simulate message registers by using Doorbells and Scratchpads, which is not that easy, but possible. Alas the second group of IDT switches can't implement the synchronous API, as I already said in the introduction.

Regarding the overloaded naming. The "sync/async" names are the best I could think of. If you have any idea how one can be appropriately changed, be my guest. I would be really glad to substitute them with something better.

> > +
> > +
> > +
> > +/**
> >   * ntb_register_client() - register a client for interest in ntb devices
> >   * @client:	Client context.
> >   *
> > @@ -441,10 +605,84 @@ void ntb_link_event(struct ntb_dev *ntb);
> >  void ntb_db_event(struct ntb_dev *ntb, int vector);
> > 
> >  /**
> > - * ntb_mw_count() - get the number of memory windows
> > + * ntb_msg_event() - notify driver context of event in messaging subsystem
> >   * @ntb:	NTB device context.
> > + * @ev:		Event type caused the handler invocation
> > + * @msg:	Message related to the event
> > + *
> > + * Notify the driver context that there is some event happaned in the event
> > + * subsystem. If NTB_MSG_NEW is emitted then the new message has just arrived.
> > + * NTB_MSG_SENT is rised if some message has just been successfully sent to a
> > + * peer. If a message failed to be sent then NTB_MSG_FAIL is emitted. The very
> > + * last argument is used to pass the event related message. It discarded right
> > + * after the handler returns.
> > + */
> > +void ntb_msg_event(struct ntb_dev *ntb, enum NTB_MSG_EVENT ev,
> > +		   struct ntb_msg *msg);
> 
> I would prefer to see a notify-and-poll api (like NAPI).  This will allow scheduling of the message handling to be done more appropriately at a higher layer of the application.  I am concerned to see inmsg/outmsg_work in the new hardware driver [PATCH 2/3], which I think would be more appropriate for a ntb transport (or higher layer) driver.
> 

Hmmm, that's how it's done.) MSI interrupt is raised when a new message arrived into a first inbound message register (the rest of message registers are used as an additional data buffers). Then a corresponding tasklet is started to release a hardware interrupt context. That tasklet extracts a message from the inbound message registers, puts it into the driver inbound message queue and marks the registers as empty so the next message could be retrieved. Then tasklet starts a corresponding kernel work thread delivering all new messages to a client driver, which preliminary registered "ntb_msg_event()" callback method. When callback method "ntb_msg_event()" the passed message is discarded.

Description of how messages are sent to a peer is provided below in the corresponding commentary.

> > +
> > +/**
> > + * ntb_link_is_up() - get the current ntb link state
> > + * @ntb:	NTB device context.
> > + * @speed:	OUT - The link speed expressed as PCIe generation number.
> > + * @width:	OUT - The link width expressed as the number of PCIe lanes.
> > + *
> > + * Get the current state of the ntb link.  It is recommended to query the link
> > + * state once after every link event.  It is safe to query the link state in
> > + * the context of the link event callback.
> > + *
> > + * Return: One if the link is up, zero if the link is down, otherwise a
> > + *		negative value indicating the error number.
> > + */
> > +static inline int ntb_link_is_up(struct ntb_dev *ntb,
> > +				 enum ntb_speed *speed, enum ntb_width *width)
> > +{
> > +	return ntb->ops->link_is_up(ntb, speed, width);
> > +}
> > +
> 
> It looks like there was some rearranging of code, so big hunks appear to be added or removed.  Can you split this into two (or more) patches so that rearranging the code is distinct from more interesting changes?
> 

Lets say there was not much rearranging here. I've just put link-related method before everything else. The rearranging was done from the point of methods importance view. There can't be any memory sharing and doorbells operations done before the link is established. The new arrangements is reflected in ntb_valid_sync_dev_ops()/ntb_valid_async_dev_ops() methods.

> > +/**
> > + * ntb_link_enable() - enable the link on the secondary side of the ntb
> > + * @ntb:	NTB device context.
> > + * @max_speed:	The maximum link speed expressed as PCIe generation number.
> > + * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> >   *
> > - * Hardware and topology may support a different number of memory windows.
> > + * Enable the link on the secondary side of the ntb.  This can only be done
> > + * from only one (primary or secondary) side of the ntb in primary or b2b
> > + * topology.  The ntb device should train the link to its maximum speed and
> > + * width, or the requested speed and width, whichever is smaller, if supported.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_link_enable(struct ntb_dev *ntb,
> > +				  enum ntb_speed max_speed,
> > +				  enum ntb_width max_width)
> > +{
> > +	return ntb->ops->link_enable(ntb, max_speed, max_width);
> > +}
> > +
> > +/**
> > + * ntb_link_disable() - disable the link on the secondary side of the ntb
> > + * @ntb:	NTB device context.
> > + *
> > + * Disable the link on the secondary side of the ntb.  This can only be
> > + * done from only one (primary or secondary) side of the ntb in primary or b2b
> > + * topology.  The ntb device should disable the link.  Returning from this call
> > + * must indicate that a barrier has passed, though with no more writes may pass
> > + * in either direction across the link, except if this call returns an error
> > + * number.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_link_disable(struct ntb_dev *ntb)
> > +{
> > +	return ntb->ops->link_disable(ntb);
> > +}
> > +
> > +/**
> > + * ntb_mw_count() - get the number of local memory windows
> > + * @ntb:	NTB device context.
> > + *
> > + * Hardware and topology may support a different number of memory windows at
> > + * local and remote devices
> >   *
> >   * Return: the number of memory windows.
> >   */
> > @@ -454,122 +692,186 @@ static inline int ntb_mw_count(struct ntb_dev *ntb)
> >  }
> > 
> >  /**
> > - * ntb_mw_get_range() - get the range of a memory window
> > + * ntb_mw_get_maprsc() - get the range of a memory window to map
> 
> What was insufficient about ntb_mw_get_range() that it needed to be split into ntb_mw_get_maprsc() and ntb_mw_get_align()?  In all the places that I found in this patch, it seems ntb_mw_get_range() would have been more simple.
> 
> I didn't see any use of ntb_mw_get_mapsrc() in the new async test clients [PATCH 3/3].  So, there is no example of how usage of new api would be used differently or more efficiently than ntb_mw_get_range() for async devices.
> 

This concern is answered a bit earlier, when you first commented the method "ntb_mw_get_range()" splitting.

You could not find the "ntb_mw_get_mapsrc()" method usage because you misspelled it. The real method signature is "ntb_mw_get_maprsc()" (look more carefully at the name ending), which is decrypted as "Mapping Resources", but no "Mapping Source". ntb/test/ntb_mw_test.c driver is developed to demonstrate how the new asynchronous API is utilized including the "ntb_mw_get_maprsc()" method usage.

> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> >   * @base:	OUT - the base address for mapping the memory window
> >   * @size:	OUT - the size for mapping the memory window
> > - * @align:	OUT - the base alignment for translating the memory window
> > - * @align_size:	OUT - the size alignment for translating the memory window
> >   *
> > - * Get the range of a memory window.  NULL may be given for any output
> > - * parameter if the value is not needed.  The base and size may be used for
> > - * mapping the memory window, to access the peer memory.  The alignment and
> > - * size may be used for translating the memory window, for the peer to access
> > - * memory on the local system.
> > + * Get the map range of a memory window. The base and size may be used for
> > + * mapping the memory window to access the peer memory.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_mw_get_range(struct ntb_dev *ntb, int idx,
> > -				   phys_addr_t *base, resource_size_t *size,
> > -		resource_size_t *align, resource_size_t *align_size)
> > +static inline int ntb_mw_get_maprsc(struct ntb_dev *ntb, int idx,
> > +				    phys_addr_t *base, resource_size_t *size)
> >  {
> > -	return ntb->ops->mw_get_range(ntb, idx, base, size,
> > -			align, align_size);
> > +	return ntb->ops->mw_get_maprsc(ntb, idx, base, size);
> > +}
> > +
> > +/**
> > + * ntb_mw_get_align() - get memory window alignment of the local node
> > + * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr_align:	OUT - the translated base address alignment of the memory window
> > + * @size_align:	OUT - the translated memory size alignment of the memory window
> > + * @size_max:	OUT - the translated memory maximum size
> > + *
> > + * Get the alignment parameters to allocate the proper memory window. NULL may
> > + * be given for any output parameter if the value is not needed.
> > + *
> > + * Drivers of synchronous hardware don't have to support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_mw_get_align(struct ntb_dev *ntb, int idx,
> > +				   resource_size_t *addr_align,
> > +				   resource_size_t *size_align,
> > +				   resource_size_t *size_max)
> > +{
> > +	if (!ntb->ops->mw_get_align)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->mw_get_align(ntb, idx, addr_align, size_align, size_max);
> >  }
> > 
> >  /**
> > - * ntb_mw_set_trans() - set the translation of a memory window
> > + * ntb_mw_set_trans() - set the translated base address of a peer memory window
> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> > - * @addr:	The dma address local memory to expose to the peer.
> > - * @size:	The size of the local memory to expose to the peer.
> > + * @addr:	DMA memory address exposed by the peer.
> > + * @size:	Size of the memory exposed by the peer.
> > + *
> > + * Set the translated base address of a memory window. The peer preliminary
> > + * allocates a memory, then someway passes the address to the remote node, that
> > + * finally sets up the memory window at the address, up to the size. The address
> > + * and size must be aligned to the parameters specified by ntb_mw_get_align() of
> > + * the local node and ntb_peer_mw_get_align() of the peer, which must return the
> > + * same values. Zero size effectively disables the memory window.
> >   *
> > - * Set the translation of a memory window.  The peer may access local memory
> > - * through the window starting at the address, up to the size.  The address
> > - * must be aligned to the alignment specified by ntb_mw_get_range().  The size
> > - * must be aligned to the size alignment specified by ntb_mw_get_range().
> > + * Drivers of synchronous hardware don't have to support it.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_mw_set_trans(struct ntb_dev *ntb, int idx,
> >  				   dma_addr_t addr, resource_size_t size)
> >  {
> > +	if (!ntb->ops->mw_set_trans)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->mw_set_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > - * ntb_mw_clear_trans() - clear the translation of a memory window
> > + * ntb_mw_get_trans() - get the translated base address of a memory window
> >   * @ntb:	NTB device context.
> >   * @idx:	Memory window number.
> > + * @addr:	The dma memory address exposed by the peer.
> > + * @size:	The size of the memory exposed by the peer.
> >   *
> > - * Clear the translation of a memory window.  The peer may no longer access
> > - * local memory through the window.
> > + * Get the translated base address of a memory window spicified for the local
> > + * hardware and allocated by the peer. If the addr and size are zero, the
> > + * memory window is effectively disabled.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_mw_clear_trans(struct ntb_dev *ntb, int idx)
> > +static inline int ntb_mw_get_trans(struct ntb_dev *ntb, int idx,
> > +				   dma_addr_t *addr, resource_size_t *size)
> >  {
> > -	if (!ntb->ops->mw_clear_trans)
> > -		return ntb->ops->mw_set_trans(ntb, idx, 0, 0);
> > +	if (!ntb->ops->mw_get_trans)
> > +		return -EINVAL;
> > 
> > -	return ntb->ops->mw_clear_trans(ntb, idx);
> > +	return ntb->ops->mw_get_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > - * ntb_link_is_up() - get the current ntb link state
> > + * ntb_peer_mw_count() - get the number of peer memory windows
> >   * @ntb:	NTB device context.
> > - * @speed:	OUT - The link speed expressed as PCIe generation number.
> > - * @width:	OUT - The link width expressed as the number of PCIe lanes.
> >   *
> > - * Get the current state of the ntb link.  It is recommended to query the link
> > - * state once after every link event.  It is safe to query the link state in
> > - * the context of the link event callback.
> > + * Hardware and topology may support a different number of memory windows at
> > + * local and remote nodes.
> >   *
> > - * Return: One if the link is up, zero if the link is down, otherwise a
> > - *		negative value indicating the error number.
> > + * Return: the number of memory windows.
> >   */
> > -static inline int ntb_link_is_up(struct ntb_dev *ntb,
> > -				 enum ntb_speed *speed, enum ntb_width *width)
> > +static inline int ntb_peer_mw_count(struct ntb_dev *ntb)
> >  {
> > -	return ntb->ops->link_is_up(ntb, speed, width);
> > +	return ntb->ops->peer_mw_count(ntb);
> >  }
> > 
> >  /**
> > - * ntb_link_enable() - enable the link on the secondary side of the ntb
> > + * ntb_peer_mw_get_align() - get memory window alignment of the peer
> >   * @ntb:	NTB device context.
> > - * @max_speed:	The maximum link speed expressed as PCIe generation number.
> > - * @max_width:	The maximum link width expressed as the number of PCIe lanes.
> > + * @idx:	Memory window number.
> > + * @addr_align:	OUT - the translated base address alignment of the memory window
> > + * @size_align:	OUT - the translated memory size alignment of the memory window
> > + * @size_max:	OUT - the translated memory maximum size
> >   *
> > - * Enable the link on the secondary side of the ntb.  This can only be done
> > - * from the primary side of the ntb in primary or b2b topology.  The ntb device
> > - * should train the link to its maximum speed and width, or the requested speed
> > - * and width, whichever is smaller, if supported.
> > + * Get the alignment parameters to allocate the proper memory window for the
> > + * peer. NULL may be given for any output parameter if the value is not needed.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_link_enable(struct ntb_dev *ntb,
> > -				  enum ntb_speed max_speed,
> > -				  enum ntb_width max_width)
> > +static inline int ntb_peer_mw_get_align(struct ntb_dev *ntb, int idx,
> > +					resource_size_t *addr_align,
> > +					resource_size_t *size_align,
> > +					resource_size_t *size_max)
> >  {
> > -	return ntb->ops->link_enable(ntb, max_speed, max_width);
> > +	if (!ntb->ops->peer_mw_get_align)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_get_align(ntb, idx, addr_align, size_align,
> > +					   size_max);
> >  }
> > 
> >  /**
> > - * ntb_link_disable() - disable the link on the secondary side of the ntb
> > + * ntb_peer_mw_set_trans() - set the translated base address of a peer
> > + *			     memory window
> >   * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr:	Local DMA memory address exposed to the peer.
> > + * @size:	Size of the memory exposed to the peer.
> >   *
> > - * Disable the link on the secondary side of the ntb.  This can only be
> > - * done from the primary side of the ntb in primary or b2b topology.  The ntb
> > - * device should disable the link.  Returning from this call must indicate that
> > - * a barrier has passed, though with no more writes may pass in either
> > - * direction across the link, except if this call returns an error number.
> > + * Set the translated base address of a memory window exposed to the peer.
> > + * The local node preliminary allocates the window, then directly writes the
> 
> I think ntb_peer_mw_set_trans() and ntb_mw_set_trans() are backwards.  Does the following make sense, or have I completely misunderstood something?
> 
> ntb_mw_set_trans(): set up translation so that incoming writes to the memory window are translated to the local memory destination.
> 
> ntb_peer_mw_set_trans(): set up (what exactly?) so that outgoing writes to a peer memory window (is this something that needs to be configured on the local ntb?) are translated to the peer ntb (i.e. their port/bridge) memory window.  Then, the peer's setting of ntb_mw_set_trans() will complete the translation to the peer memory destination.
> 

These functions actually do the opposite you described:

ntb_mw_set_trans() - method sets the translated base address retrieved from a peer, so outgoing writes to a memory window would be translated and reach the peer memory destination.

ntb_peer_mw_set_trans() - method sets translated base address to peer configuration space, so the local incoming writes would be correctly translated on the peer and reach the local memory destination.

Globally thinking, these methods do the same think, when they called from opposite domains. So to speak locally called "ntb_mw_set_trans()" method does the same thing as the method "ntb_peer_mw_set_trans()" called from a peer, and vise versa the locally called method "ntb_peer_mw_set_trans()" does the same procedure as the method "ntb_mw_set_trans()" called from a peer.

To make things simpler, think of memory windows in the framework of the next definition: "Memory Window is a virtual memory region, which locally reflects a physical memory of peer/remote device." So when we call ntb_mw_set_trans(), we initialize the local memory window, so the locally mapped virtual addresses would be connected with the peer physical memory. When we call ntb_peer_mw_set_trans(), we initialize a peer/remote virtual memory region, so the peer could successfully perform a writes to our local physical memory.

Of course all the actual memory read/write operations should follow up ntb_mw_get_maprsc() and ioremap_nocache() method invocation doublet. You do the same thing in the client test drivers for AMD and Intel hadrware.

> > + * address and size to the peer control registers. The address and size must
> > + * be aligned to the parameters specified by ntb_peer_mw_get_align() of
> > + * the local node and ntb_mw_get_align() of the peer, which must return the
> > + * same values. Zero size effectively disables the memory window.
> > + *
> > + * Drivers of synchronous hardware must support it.
> >   *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> > -static inline int ntb_link_disable(struct ntb_dev *ntb)
> > +static inline int ntb_peer_mw_set_trans(struct ntb_dev *ntb, int idx,
> > +					dma_addr_t addr, resource_size_t size)
> >  {
> > -	return ntb->ops->link_disable(ntb);
> > +	if (!ntb->ops->peer_mw_set_trans)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_set_trans(ntb, idx, addr, size);
> > +}
> > +
> > +/**
> > + * ntb_peer_mw_get_trans() - get the translated base address of a peer
> > + *			     memory window
> > + * @ntb:	NTB device context.
> > + * @idx:	Memory window number.
> > + * @addr:	Local dma memory address exposed to the peer.
> > + * @size:	Size of the memory exposed to the peer.
> > + *
> > + * Get the translated base address of a memory window spicified for the peer
> > + * hardware. If the addr and size are zero then the memory window is effectively
> > + * disabled.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_peer_mw_get_trans(struct ntb_dev *ntb, int idx,
> > +					dma_addr_t *addr, resource_size_t *size)
> > +{
> > +	if (!ntb->ops->peer_mw_get_trans)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->peer_mw_get_trans(ntb, idx, addr, size);
> >  }
> > 
> >  /**
> > @@ -751,6 +1053,8 @@ static inline int ntb_db_clear_mask(struct ntb_dev *ntb, u64 db_bits)
> >   * append one additional dma memory copy with the doorbell register as the
> >   * destination, after the memory copy operations.
> >   *
> > + * This is unusual, and hardware may not be suitable to implement it.
> > + *
> 
> Why is this unusual?  Do you mean async hardware may not support it?
> 

Of course I can always return an address of a Doorbell register, but it's not safe to do it working with IDT NTB hardware driver. To make thing explained simpler think a IDT hardware, which supports the Doorbell bits routing. Each local inbound Doorbell bits of each port can be configured to either reflect the global switch doorbell bits state or not to reflect. Global doorbell bits are set by using outbound doorbell register, which is exist for every NTB port. Primary port is the port which can have an access to multiple peers, so the Primary port inbound and outbound doorbell registers are shared between several NTB devices, sited on the linux kernel NTB bus. As you understand, these devices should not interfere each other, which can happen on uncontrollable usage of Doorbell registers addresses. That's why the method cou "ntb_peer_db_addr()" should not be developed for the IDT NTB hardware driver.

> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_db_addr(struct ntb_dev *ntb,
> > @@ -901,10 +1205,15 @@ static inline int ntb_spad_is_unsafe(struct ntb_dev *ntb)
> >   *
> >   * Hardware and topology may support a different number of scratchpads.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: the number of scratchpads.
> >   */
> >  static inline int ntb_spad_count(struct ntb_dev *ntb)
> >  {
> > +	if (!ntb->ops->spad_count)
> > +		return -EINVAL;
> > +
> 
> Maybe we should return zero (i.e. there are no scratchpads).
> 

Agreed. I will fix it in the next patchset.

> >  	return ntb->ops->spad_count(ntb);
> >  }
> > 
> > @@ -915,10 +1224,15 @@ static inline int ntb_spad_count(struct ntb_dev *ntb)
> >   *
> >   * Read the local scratchpad register, and return the value.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: The value of the local scratchpad register.
> >   */
> >  static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
> >  {
> > +	if (!ntb->ops->spad_read)
> > +		return 0;
> > +
> 
> Let's return ~0.  I think that's what a driver would read from the pci bus for a memory miss. 
> 

Agreed. I will make it returning -EINVAL in the next patchset.

> >  	return ntb->ops->spad_read(ntb, idx);
> >  }
> > 
> > @@ -930,10 +1244,15 @@ static inline u32 ntb_spad_read(struct ntb_dev *ntb, int idx)
> >   *
> >   * Write the value to the local scratchpad register.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32 val)
> >  {
> > +	if (!ntb->ops->spad_write)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->spad_write(ntb, idx, val);
> >  }
> > 
> > @@ -946,6 +1265,8 @@ static inline int ntb_spad_write(struct ntb_dev *ntb, int idx, u32
> > val)
> >   * Return the address of the peer doorbell register.  This may be used, for
> >   * example, by drivers that offload memory copy operations to a dma engine.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> > @@ -964,10 +1285,15 @@ static inline int ntb_peer_spad_addr(struct ntb_dev *ntb, int idx,
> >   *
> >   * Read the peer scratchpad register, and return the value.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: The value of the local scratchpad register.
> >   */
> >  static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
> >  {
> > +	if (!ntb->ops->peer_spad_read)
> > +		return 0;
> 
> Also, ~0?
> 

Agreed. I will make it returning -EINVAL in the next patchset.

> > +
> >  	return ntb->ops->peer_spad_read(ntb, idx);
> >  }
> > 
> > @@ -979,11 +1305,59 @@ static inline u32 ntb_peer_spad_read(struct ntb_dev *ntb, int idx)
> >   *
> >   * Write the value to the peer scratchpad register.
> >   *
> > + * Asynchronous hardware may not support it.
> > + *
> >   * Return: Zero on success, otherwise an error number.
> >   */
> >  static inline int ntb_peer_spad_write(struct ntb_dev *ntb, int idx, u32 val)
> >  {
> > +	if (!ntb->ops->peer_spad_write)
> > +		return -EINVAL;
> > +
> >  	return ntb->ops->peer_spad_write(ntb, idx, val);
> >  }
> > 
> > +/**
> > + * ntb_msg_post() - post the message to the peer
> > + * @ntb:	NTB device context.
> > + * @msg:	Message
> > + *
> > + * Post the message to a peer. It shall be delivered to the peer by the
> > + * corresponding hardware method. The peer should be notified about the new
> > + * message by calling the ntb_msg_event() handler of NTB_MSG_NEW event type.
> > + * If delivery is fails for some reasong the local node will get NTB_MSG_FAIL
> > + * event. Otherwise the NTB_MSG_SENT is emitted.
> 
> Interesting.. local driver would be notified about completion (success or failure) of delivery.  Is there any order-of-completion guarantee for the completion notifications?  Is there some tolerance for faults, in case we never get a completion notification from the peer (eg. we lose the link)?  If we lose the link, report a local fault, and the link comes up again, can we still get a completion notification from the peer, and how would that be handled?
> 
> Does delivery mean the application has processed the message, or is it just delivery at the hardware layer, or just delivery at the ntb hardware driver layer?
> 

Let me explain how the message delivery works. When a client driver calls the "ntb_msg_post()" method, the corresponding message is placed in an outbound messages queue. Such the message queue exists for every peer device. Then a dedicated kernel work thread is started to send all the messages from the queue. If kernel thread failed to send a message (for instance, if the peer IDT NTB hardware driver still has not freed its inbound message registers), it performs a new attempt after a small timeout. If after a preconfigured number of attempts the kernel thread still fails to delivery the message, it invokes ntb_msg_event() callback with NTB_MSG_FAIL event. If the message is successfully delivered, then the method ntb_msg_event() is called with NTB_MSG_SENT event.

To be clear the messsages are transfered directly to the peer memory, but instead they are placed in the IDT NTB switch registers, then peer is notified about a new message arrived at the corresponding message registers and the corresponding interrupt handler is called.

If we loose the PCI express or NTB link between the IDT switch and a peer, then the ntb_msg_event() method is called with NTB_MSG_FAIL event.

> > + *
> > + * Synchronous hardware may not support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_msg_post(struct ntb_dev *ntb, struct ntb_msg *msg)
> > +{
> > +	if (!ntb->ops->msg_post)
> > +		return -EINVAL;
> > +
> > +	return ntb->ops->msg_post(ntb, msg);
> > +}
> > +
> > +/**
> > + * ntb_msg_size() - size of the message data
> > + * @ntb:	NTB device context.
> > + *
> > + * Different hardware may support different number of message registers. This
> > + * callback shall return the number of those used for data sending and
> > + * receiving including the type field.
> > + *
> > + * Synchronous hardware may not support it.
> > + *
> > + * Return: Zero on success, otherwise an error number.
> > + */
> > +static inline int ntb_msg_size(struct ntb_dev *ntb)
> > +{
> > +	if (!ntb->ops->msg_size)
> > +		return 0;
> > +
> > +	return ntb->ops->msg_size(ntb);
> > +}
> > +
> >  #endif
> > --
> > 2.6.6
>

Finally, I've answered to all the questions. Hopefully the things look clearer now.

Regards,
-Sergey

Powered by blists - more mailing lists