netdev - Re: [PATCH] Fix e100 on systems that have cache incoherent DMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <474DBFCE.8010101@intel.com>
Date:	Wed, 28 Nov 2007 11:21:50 -0800
From:	"Kok, Auke" <auke-jan.h.kok@...el.com>
To:	David Acker <dacker@...net.com>
CC:	jeff@...zik.org, akpm@...ux-foundation.org, netdev@...r.kernel.org,
	jesse.brandeburg@...el.com, miltonm@....com,
	e1000-devel@...ts.sourceforge.net
Subject: Re: [PATCH] Fix e100 on systems that have cache incoherent DMA

David Acker wrote:
> What is the status of this patch?  Is there anything folks would like me
> to do in order to move it forward?  As an FYI, my company has been using
> this patch since I posted it and so far we have not had any problems
> with it.
> -Ack


Jeff merged it in netdev-2.6#upstream so it is queued for 2.6.25.

However, we could consider pushing this patch to 2.6.24 since it's a valid fix.
Andrew has also been carrying this patch in -mm for a while and so have we over
here without issues.

Jeff, what do you think?

Auke


> 
> Auke Kok wrote:
>> From: David Acker <dacker@...net.com>
>>
>> On the systems that have cache incoherent DMA, including ARM, there
>> is a race condition between software allocating a new receive buffer
>> and hardware writing into a buffer.  The two race on touching the last
>> Receive Frame Descriptor (RFD).  It has its el-bit set and its next
>> link equal to 0.  When hardware encounters this buffer it attempts to
>> write data to it and then update Status Word bits and Actual Count in
>> the RFD.  At the same time software may try to clear the el-bit and
>> set the link address to a new buffer.
>>
>> Since the entire RFD is once cache-line, the two write operations can
>> collide.  This can lead to the receive unit stalling or interpreting
>> random memory as its receive area.
>>
>> The fix is to set the el-bit on and the size to 0 on the next to last
>> buffer in the chain.  When the hardware encounters this buffer it stops
>> and does not write to it at all.  The hardware issues an RNR interrupt
>> with the receive unit in the No Resources state.  Software can write
>> to the tail of the list because it knows hardware will stop on the
>> previous descriptor that was marked as the end of list.
>>
>> Once it has a new next to last buffer prepared, it can clear the el-bit
>> and set the size on the previous one.  The race on this buffer is safe
>> since the link already points to a valid next buffer and the software
>> can handle the race setting the size (assuming aligned 16 bit writes
>> are atomic with respect to the DMA read). If the hardware sees the
>> el-bit cleared without the size set, it will move on to the next buffer
>> and skip this one.  If it sees the size set but the el-bit still set,
>> it will complete that buffer and then RNR interrupt and wait.
>>
>> Signed-off-by: David Acker <dacker@...net.com>
>> Signed-off-by: Auke Kok <auke-jan.h.kok@...el.com>
>> ---
>>
>>  drivers/net/e100.c |  128
>> ++++++++++++++++++++++++++++++++++++++++------------
>>  1 files changed, 99 insertions(+), 29 deletions(-)
>>
>> diff --git a/drivers/net/e100.c b/drivers/net/e100.c
>> index 3dbaec6..2153058 100644
>> --- a/drivers/net/e100.c
>> +++ b/drivers/net/e100.c
>> @@ -106,6 +106,13 @@
>>   *    the RFD, the RFD must be dma_sync'ed to maintain a consistent
>>   *    view from software and hardware.
>>   *
>> + *    In order to keep updates to the RFD link field from colliding with
>> + *    hardware writes to mark packets complete, we use the feature that
>> + *    hardware will not write to a size 0 descriptor and mark the
>> previous
>> + *    packet as end-of-list (EL).   After updating the link, we
>> remove EL
>> + *    and only then restore the size such that hardware may use the
>> + *    previous-to-end RFD.
>> + *
>>   *    Under typical operation, the  receive unit (RU) is start once,
>>   *    and the controller happily fills RFDs as frames arrive.  If
>>   *    replacement RFDs cannot be allocated, or the RU goes non-active,
>> @@ -281,6 +288,7 @@ struct csr {
>>  };
>>  
>>  enum scb_status {
>> +    rus_no_res       = 0x08,
>>      rus_ready        = 0x10,
>>      rus_mask         = 0x3C,
>>  };
>> @@ -952,7 +960,7 @@ static void e100_get_defaults(struct nic *nic)
>>          ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
>>  
>>      /* Template for a freshly allocated RFD */
>> -    nic->blank_rfd.command = cpu_to_le16(cb_el);
>> +    nic->blank_rfd.command = 0;
>>      nic->blank_rfd.rbd = 0xFFFFFFFF;
>>      nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>>  
>> @@ -1791,15 +1799,12 @@ static int e100_rx_alloc_skb(struct nic *nic,
>> struct rx *rx)
>>      }
>>  
>>      /* Link the RFD to end of RFA by linking previous RFD to
>> -     * this one, and clearing EL bit of previous.  */
>> +     * this one.  We are safe to touch the previous RFD because
>> +     * it is protected by the before last buffer's el bit being set */
>>      if(rx->prev->skb) {
>>          struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
>>          put_unaligned(cpu_to_le32(rx->dma_addr),
>>              (u32 *)&prev_rfd->link);
>> -        wmb();
>> -        prev_rfd->command &= ~cpu_to_le16(cb_el);
>> -        pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
>> -            sizeof(struct rfd), PCI_DMA_TODEVICE);
>>      }
>>  
>>      return 0;
>> @@ -1824,8 +1829,19 @@ static int e100_rx_indicate(struct nic *nic,
>> struct rx *rx,
>>      DPRINTK(RX_STATUS, DEBUG, "status=0x%04X\n", rfd_status);
>>  
>>      /* If data isn't ready, nothing to indicate */
>> -    if(unlikely(!(rfd_status & cb_complete)))
>> +    if (unlikely(!(rfd_status & cb_complete))) {
>> +        /* If the next buffer has the el bit, but we think the receiver
>> +         * is still running, check to see if it really stopped while
>> +         * we had interrupts off.
>> +         * This allows for a fast restart without re-enabling
>> +         * interrupts */
>> +        if ((le16_to_cpu(rfd->command) & cb_el) &&
>> +            (RU_RUNNING == nic->ru_running))
>> +
>> +            if (readb(&nic->csr->scb.status) & rus_no_res)
>> +                nic->ru_running = RU_SUSPENDED;
>>          return -ENODATA;
>> +    }
>>  
>>      /* Get actual data size */
>>      actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
>> @@ -1836,9 +1852,18 @@ static int e100_rx_indicate(struct nic *nic,
>> struct rx *rx,
>>      pci_unmap_single(nic->pdev, rx->dma_addr,
>>          RFD_BUF_LEN, PCI_DMA_FROMDEVICE);
>>  
>> -    /* this allows for a fast restart without re-enabling interrupts */
>> -    if(le16_to_cpu(rfd->command) & cb_el)
>> +    /* If this buffer has the el bit, but we think the receiver
>> +     * is still running, check to see if it really stopped while
>> +     * we had interrupts off.
>> +     * This allows for a fast restart without re-enabling interrupts.
>> +     * This can happen when the RU sees the size change but also sees
>> +     * the el bit set. */
>> +    if ((le16_to_cpu(rfd->command) & cb_el) &&
>> +        (RU_RUNNING == nic->ru_running)) {
>> +
>> +        if (readb(&nic->csr->scb.status) & rus_no_res)
>>          nic->ru_running = RU_SUSPENDED;
>> +    }
>>  
>>      /* Pull off the RFD and put the actual data (minus eth hdr) */
>>      skb_reserve(skb, sizeof(struct rfd));
>> @@ -1870,31 +1895,30 @@ static void e100_rx_clean(struct nic *nic,
>> unsigned int *work_done,
>>      unsigned int work_to_do)
>>  {
>>      struct rx *rx;
>> -    int restart_required = 0;
>> -    struct rx *rx_to_start = NULL;
>> -
>> -    /* are we already rnr? then pay attention!!! this ensures that
>> -     * the state machine progression never allows a start with a
>> -     * partially cleaned list, avoiding a race between hardware
>> -     * and rx_to_clean when in NAPI mode */
>> -    if(RU_SUSPENDED == nic->ru_running)
>> -        restart_required = 1;
>> +    int restart_required = 0, err = 0;
>> +    struct rx *old_before_last_rx, *new_before_last_rx;
>> +    struct rfd *old_before_last_rfd, *new_before_last_rfd;
>>  
>>      /* Indicate newly arrived packets */
>>      for(rx = nic->rx_to_clean; rx->skb; rx = nic->rx_to_clean =
>> rx->next) {
>> -        int err = e100_rx_indicate(nic, rx, work_done, work_to_do);
>> -        if(-EAGAIN == err) {
>> -            /* hit quota so have more work to do, restart once
>> -             * cleanup is complete */
>> -            restart_required = 0;
>> +        err = e100_rx_indicate(nic, rx, work_done, work_to_do);
>> +        /* Hit quota or no more to clean */
>> +        if (-EAGAIN == err || -ENODATA == err)
>>              break;
>> -        } else if(-ENODATA == err)
>> -            break; /* No more to clean */
>>      }
>>  
>> -    /* save our starting point as the place we'll restart the
>> receiver */
>> -    if(restart_required)
>> -        rx_to_start = nic->rx_to_clean;
>> +
>> +    /* On EAGAIN, hit quota so have more work to do, restart once
>> +     * cleanup is complete.
>> +     * Else, are we already rnr? then pay attention!!! this ensures that
>> +     * the state machine progression never allows a start with a
>> +     * partially cleaned list, avoiding a race between hardware
>> +     * and rx_to_clean when in NAPI mode */
>> +    if (-EAGAIN != err && RU_SUSPENDED == nic->ru_running)
>> +        restart_required = 1;
>> +
>> +    old_before_last_rx = nic->rx_to_use->prev->prev;
>> +    old_before_last_rfd = (struct rfd *)old_before_last_rx->skb->data;
>>  
>>      /* Alloc new skbs to refill list */
>>      for(rx = nic->rx_to_use; !rx->skb; rx = nic->rx_to_use = rx->next) {
>> @@ -1902,10 +1926,42 @@ static void e100_rx_clean(struct nic *nic,
>> unsigned int *work_done,
>>              break; /* Better luck next time (see watchdog) */
>>      }
>>  
>> +    new_before_last_rx = nic->rx_to_use->prev->prev;
>> +    if (new_before_last_rx != old_before_last_rx) {
>> +        /* Set the el-bit on the buffer that is before the last buffer.
>> +         * This lets us update the next pointer on the last buffer
>> +         * without worrying about hardware touching it.
>> +         * We set the size to 0 to prevent hardware from touching this
>> +         * buffer.
>> +         * When the hardware hits the before last buffer with el-bit
>> +         * and size of 0, it will RNR interrupt, the RUS will go into
>> +         * the No Resources state.  It will not complete nor write to
>> +         * this buffer. */
>> +        new_before_last_rfd =
>> +            (struct rfd *)new_before_last_rx->skb->data;
>> +        new_before_last_rfd->size = 0;
>> +        new_before_last_rfd->command |= cpu_to_le16(cb_el);
>> +        pci_dma_sync_single_for_device(nic->pdev,
>> +            new_before_last_rx->dma_addr, sizeof(struct rfd),
>> +            PCI_DMA_TODEVICE);
>> +
>> +        /* Now that we have a new stopping point, we can clear the old
>> +         * stopping point.  We must sync twice to get the proper
>> +         * ordering on the hardware side of things. */
>> +        old_before_last_rfd->command &= ~cpu_to_le16(cb_el);
>> +        pci_dma_sync_single_for_device(nic->pdev,
>> +            old_before_last_rx->dma_addr, sizeof(struct rfd),
>> +            PCI_DMA_TODEVICE);
>> +        old_before_last_rfd->size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
>> +        pci_dma_sync_single_for_device(nic->pdev,
>> +            old_before_last_rx->dma_addr, sizeof(struct rfd),
>> +            PCI_DMA_TODEVICE);
>> +    }
>> +
>>      if(restart_required) {
>>          // ack the rnr?
>>          writeb(stat_ack_rnr, &nic->csr->scb.stat_ack);
>> -        e100_start_receiver(nic, rx_to_start);
>> +        e100_start_receiver(nic, nic->rx_to_clean);
>>          if(work_done)
>>              (*work_done)++;
>>      }
>> @@ -1937,6 +1993,7 @@ static int e100_rx_alloc_list(struct nic *nic)
>>  {
>>      struct rx *rx;
>>      unsigned int i, count = nic->params.rfds.count;
>> +    struct rfd *before_last;
>>  
>>      nic->rx_to_use = nic->rx_to_clean = NULL;
>>      nic->ru_running = RU_UNINITIALIZED;
>> @@ -1952,6 +2009,19 @@ static int e100_rx_alloc_list(struct nic *nic)
>>              return -ENOMEM;
>>          }
>>      }
>> +    /* Set the el-bit on the buffer that is before the last buffer.
>> +     * This lets us update the next pointer on the last buffer without
>> +     * worrying about hardware touching it.
>> +     * We set the size to 0 to prevent hardware from touching this
>> buffer.
>> +     * When the hardware hits the before last buffer with el-bit and
>> size
>> +     * of 0, it will RNR interrupt, the RU will go into the No Resources
>> +     * state.  It will not complete nor write to this buffer. */
>> +    rx = nic->rxs->prev->prev;
>> +    before_last = (struct rfd *)rx->skb->data;
>> +    before_last->command |= cpu_to_le16(cb_el);
>> +    before_last->size = 0;
>> +    pci_dma_sync_single_for_device(nic->pdev, rx->dma_addr,
>> +        sizeof(struct rfd), PCI_DMA_TODEVICE);
>>  
>>      nic->rx_to_use = nic->rx_to_clean = nic->rxs;
>>      nic->ru_running = RU_SUSPENDED;
>> -
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html