netdev - skb_shinfo(skb)->nr_frags > 0 while skb_is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091007010315.GA26498@neptune.jswright>
Date:	Tue, 6 Oct 2009 19:03:15 -0600
From:	John Wright <john.wright@...com>
To:	netdev@...r.kernel.org, Michael Chan <mchan@...adcom.com>
Cc:	Bob Montgomery <bob.montgomery@...com>
Subject: skb_shinfo(skb)->nr_frags > 0 while skb_is_gso(skb) == 0?

Hello,

Bob Montgomery and I are debugging an OOPS in the bnx2 driver.  The
driver OOPSes in bnx2_tx_int(), getting a NULL pointer dereference when
checking if the skb is GSO.  (This is on 2.6.29, before is_gso was
cached in the tx_buf (commit d62fda08), but bear with me - while kernels
with that commit might not crash in the same place, I think we have
discovered a bug that would manifest itself another way.)

So, first, a question for someone who knows more about sk_buff's than I:
is it reasonable/legal for an skb for which skb_is_gso(skb) == 0 to also
have skb_shinfo(skb)->nr_frags > 0?

If yes, then for Michael, or someone familiar with bnx2 hardware: are
"partial BD completions" (where the hw_tx_cons value is on a ring index
that, on the tx_buf_ring, would have a NULL skb value, and on the
tx_desc_ring, the tx_bd would not have the TX_BD_FLAGS_START flag set)
possible only for gso skb's, or is it possible any time nr_frags > 0?

We have a crash where both of the above conditions seem to be met.  Our
crash actually occurs in skb_is_gso(skb) - when dereferencing skb, which
is NULL - called from this chunk of code in bnx2_tx_int():

...
                tx_buf = &txr->tx_buf_ring[sw_ring_cons];
                skb = tx_buf->skb;

                /* partial BD completions possible with TSO packets */
                if (skb_is_gso(skb)) {
                        u16 last_idx, last_ring_idx;

                        last_idx = sw_cons +
                                skb_shinfo(skb)->nr_frags + 1;
...

An analysis of the disassembly shows that, at the time of the OOPS, RCX
contains the bnapi passed to bnx2_tx_int, RBX contains in its lower two
bytes the value of sw_cons, and R12 contains the value of hw_cons, the
last read value of the hardware consumer pointer.  (See the disassembly
of bnx2_poll_work, which all this was inlined into, at [1].)  Now, from
the OOPS:

[120223.249838] RAX: 000000000000005b RBX: 000000005746575b RCX: ffff8801a43cd680
[120223.249838] R10: 0000000000000006 R11: ffff8800562f6bc0 R12: 000000000000575a

Already, we see hw_cons is 0x575a (corresponding to a ring index of
0x5a), and our sw_cons (which is never supposed to go beyond hw_cons) is
0x575b.  How did this happen?

Unfortunately, it looks like another CPU ran bnx2_start_xmit and put
stuff on this ring before the OOPS handler finished running and
crash_kexec() was called.  So in our actual dump, it looks like the skb
we tried to read was not really NULL:

crash> struct bnx2_napi.tx_ring ffff8801a43cd680
  tx_ring = {
    tx_prod_bseq = 0xf8c4f09b, 
    tx_prod = 0x5783, 
    tx_bidx_addr = 0x12588, 
    tx_bseq_addr = 0x12590, 
    tx_desc_ring = 0xffff8801a3cb4000, 
    tx_buf_ring = 0xffff880325e52000, 
    tx_cons = 0x5746, 
    hw_tx_cons = 0x5746, 
    tx_desc_mapping = 0x1a3cb4000
  }
crash> p *((struct sw_tx_bd *)(0xffff880325e52000) + 0x5b)
$20 = {
  skb = 0xffff8801574c2780
}

But we'll assume for now that the kernel didn't lie, and that it really
was NULL at the time of the OOPS. :)  What's interesting, though, is
the place where the card thinks it is:

crash> p *((struct sw_tx_bd *)(0xffff880325e52000) + 0x5a)
$21 = {
  skb = 0x0
}

Ok, we might expect this if we've just cleared that skb, but we couldn't
have, since that would mean our sw_cons value would equal hw_cons, and
we wouldn't have performed that iteration in the for loop.  (We only try
to clean up a particular ring index after the card has *passed* it.)
Inside of the tx_desc_ring for this bnapi is something more interesting:

crash> p *((struct tx_bd *)(0xffff8801a3cb4000) + 0x5a)
$22 = {
  tx_bd_haddr_hi = 0x2, 
  tx_bd_haddr_lo = 0x5a8acd31, 
  tx_bd_mss_nbytes = 0x36, 
  tx_bd_vlan_tag_flags = 0x42
}

But 0x42 == 0100 0010 == TX_BD_FLAGS_TCP_UDP_CKSUM | TX_BD_FLAGS_END.
Note, TX_BD_FLAGS_START is not one of its flags, and yet it's also not
GSO - if it were, we'd expect TX_BD_FLAGS_SW_LSO (1 << 15) to be set.
See bnx2_start_xmit:

...
        if ((mss = skb_shinfo(skb)->gso_size)) {
                u32 tcp_opt_len;
                struct iphdr *iph;

                vlan_tag_flags |= TX_BD_FLAGS_SW_LSO;
...

(I have verified that this flag, among others, is indeed set on a tx_bd
whose skb we still have and has non-zero gso_size.)

If we go back one, we find a tx_bd with the TX_BD_FLAGS_START set, but
not TX_BD_FLAGS_END:

crash> p *((struct tx_bd *)(0xffff8801a3cb4000) + 0x59)
$24 = {
  tx_bd_haddr_hi = 0x1, 
  tx_bd_haddr_lo = 0xf915d3e, 
  tx_bd_mss_nbytes = 0x42, 
  tx_bd_vlan_tag_flags = 0x82
}

So, presumably, the corresponding skb would have nr_frags = 1.


That said, here's what we *think* is happening:

At the end of one of the iterations through the
while(sw_cons != hw_cons) loop, we have sw_cons == 0x5759, and

  hw_cons = bnx2_get_hw_tx_cons(bnapi);  /* hw_cons = 0x575a */

Then, next time around the loop, sw_cons != hw_cons, so we go back
inside the loop.  But since the skb is not gso, we don't detect that the
card is in the middle of the fragmented skb, and then we go ahead and
increment sw_cons (nr_frags + 1) times, which puts it at 0x575b.  Then,
we clean up an skb and a DMA mapping that the card might still be using,
and go around the loop again.  And we'll keep doing so, until the card
happens to catch up with us or we crash.  I'm not sure what happens to
the card when the skb's and DMA mappings get ripped out from under it,
but I suppose it's not good - in this case, we happened not even to have
any more skb's in the tx_buf_ring to go through, so we crashed before we
could find out. :)

I think this patch (against v2.6.29 - after d62fda08, is_gso and
nr_frags are actually stored in the tx_buf - see attached patch for one
that would apply against linux-2.6.git HEAD) would solve this issue:


diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index a7e688a..db075c8 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2615,8 +2615,8 @@ bnx2_tx_int(struct bnx2 *bp, struct bnx2_napi *bnapi, int 
                tx_buf = &txr->tx_buf_ring[sw_ring_cons];
                skb = tx_buf->skb;
 
-               /* partial BD completions possible with TSO packets */
-               if (skb_is_gso(skb)) {
+               /* partial BD completions possible with fragmented packets */
+               if (skb_shinfo(skb)->nr_frags) {
                        u16 last_idx, last_ring_idx;
 
                        last_idx = sw_cons +


Unfortunately, we haven't been able to reproduce this crash reliably.
(Under heavy network load, we've seen multiple crashes on the same
instruction; but this one seems to be the first that clearly exhibits
the "sw_cons overtakes hw_cons" behavior.  We're still analyzing some
older dumps to see if we can find evidence of the same behavior causing
those crashes.)

Can somebody who knows a bit more about sk_buff's and fragmenting
and/or the bnx2 chipsets comment on this?

I have attached the OOPS message and an equivalent patch that applies
against current linux-2.6.git, and posted a copy of the bnx2_poll_work
disassembly at [1].  I can also provide the debug vmlinux, modules, and
crash dump if necessary.

 [1]: http://free.linux.hp.com/~jswright/bnx2-crash-data/

Thanks!

-- 
+----------------------------------------------------------+
| John Wright <john.wright@...com>                         |
| HP Mission Critical OS Enablement & Solution Test (MOST) |
+----------------------------------------------------------+

View attachment "oops-message.txt" of type "text/plain" (4984 bytes)

View attachment "bnx2-crash.patch" of type "text/x-diff" (568 bytes)