[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4242558.6NaQec4f7j@wuerfel>
Date: Wed, 02 Apr 2014 17:24:24 +0200
From: Arnd Bergmann <arnd@...db.de>
To: linux-arm-kernel@...ts.infradead.org
Cc: zhangfei <zhangfei.gao@...aro.org>, mark.rutland@....com,
devicetree@...r.kernel.org, f.fainelli@...il.com,
linux@....linux.org.uk, eric.dumazet@...il.com,
sergei.shtylyov@...entembedded.com, netdev@...r.kernel.org,
David.Laight@...lab.com, davem@...emloft.net
Subject: Re: [PATCH 3/3] net: hisilicon: new hip04 ethernet driver
On Wednesday 02 April 2014 17:51:54 zhangfei wrote:
> Dear Arnd
>
> On 04/02/2014 05:21 PM, Arnd Bergmann wrote:
> > On Tuesday 01 April 2014 21:27:12 Zhangfei Gao wrote:
> >> +static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
> >
> > While it looks like there are no serious functionality bugs left, this
> > function is rather inefficient, as has been pointed out before:
>
> Yes, still need more performance tuning in the next step.
> We need to enable the hardware feature of cache flush, under help of
> arm-smmu, as a result dma_map_single etc can be removed.
You cannot remove the dma_map_single call here, but the implementation
of that function will be different when you use the iommu_coherent_ops:
Instead of flushing the caches, it will create or remove an iommu entry
and return the bus address.
I remember you mentioned before that using the iommu on this particular
SoC actually gives you cache-coherent DMA, so you may also be able
to use arm_coherent_dma_ops if you can set up a static 1:1 mapping
between bus and phys addresses.
> >> +{
> >> + struct hip04_priv *priv = netdev_priv(ndev);
> >> + struct net_device_stats *stats = &ndev->stats;
> >> + unsigned int tx_head = priv->tx_head;
> >> + struct tx_desc *desc = &priv->tx_desc[tx_head];
> >> + dma_addr_t phys;
> >> +
> >> + hip04_tx_reclaim(ndev, false);
> >> + mod_timer(&priv->txtimer, jiffies + RECLAIM_PERIOD);
> >> +
> >> + if (priv->tx_count >= TX_DESC_NUM) {
> >> + netif_stop_queue(ndev);
> >> + return NETDEV_TX_BUSY;
> >> + }
> >
> > This is where you have two problems:
> >
> > - if the descriptor ring is full, you wait for RECLAIM_PERIOD,
> > which is far too long at 500ms, because during that time you
> > are not able to add further data to the stopped queue.
>
> Understand
> The idea here is not using the timer as much as possible.
> As experiment shows, only xmit reclaim buffers, the best throughput can
> be achieved.
I'm only talking about the case where that doesn't work: once you stop
the queue, the xmit function won't get called again until the timer
causes the reclaim be done and restart the queue.
> > - As David Laight pointed out earlier, you must also ensure that
> > you don't have too much /data/ pending in the descriptor ring
> > when you stop the queue. For a 10mbit connection, you have already
> > tested (as we discussed on IRC) that 64 descriptors with 1500 byte
> > frames gives you a 68ms round-trip ping time, which is too much.
>
> When iperf & ping running together and only ping, it is 0.7 ms.
>
> > Conversely, on 1gbit, having only 64 descriptors actually seems
> > a little low, and you may be able to get better throughput if
> > you extend the ring to e.g. 512 descriptors.
>
> OK, Will check throughput of upgrade xmit descriptors.
> But is it said not using too much descripors for xmit since no xmit
> interrupt?
The important part is to limit the time that data spends in the queue,
which is a function of the interface tx speed and the number of bytes
in the queue.
> >> + phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
> >> + if (dma_mapping_error(&ndev->dev, phys)) {
> >> + dev_kfree_skb(skb);
> >> + return NETDEV_TX_OK;
> >> + }
> >> +
> >> + priv->tx_skb[tx_head] = skb;
> >> + priv->tx_phys[tx_head] = phys;
> >> + desc->send_addr = cpu_to_be32(phys);
> >> + desc->send_size = cpu_to_be16(skb->len);
> >> + desc->cfg = cpu_to_be32(DESC_DEF_CFG);
> >> + phys = priv->tx_desc_dma + tx_head * sizeof(struct tx_desc);
> >> + desc->wb_addr = cpu_to_be32(phys);
> >
> > One detail: since you don't have cache-coherent DMA, "desc" will
> > reside in uncached memory, so you try to minimize the number of accesses.
> > It's probably faster if you build the descriptor on the stack and
> > then atomically copy it over, rather than assigning each member at
> > a time.
>
> I am sorry, not quite understand, could you clarify more?
> The phys and size etc of skb->data is changing, so need to assign.
> If member contents keep constant, it can be set when initializing.
I meant you should use 64-bit accesses here instead of multiple 32 and
16 bit accesses, but as David noted, it's actually not that much of
a deal for the writes as it is for the reads from uncached memory.
The important part is to avoid the line where you do 'if (desc->send_addr
!= 0)' as much as possible.
Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists