[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAB_+Fg7ATpMVRd-yRtATzdYKwzbCvBdOxD8UvTa=Pj74ErzgjA@mail.gmail.com>
Date: Fri, 23 Sep 2011 17:43:18 -0700
From: Nandita Dukkipati <nanditad@...gle.com>
To: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc: Netdev <netdev@...r.kernel.org>
Subject: Re: BUG_ON -> WARN_ON in tcp_fragment
+netdev
Thanks, Ilpo.
On Fri, Sep 23, 2011 at 1:01 AM, Ilpo Järvinen
<ilpo.jarvinen@...sinki.fi> wrote:
>
> On Thu, 22 Sep 2011, Nandita Dukkipati wrote:
>
> > I am observing several instances hitting the BUG_ON(len > skb->len) in
> > tcp_fragment on older kernels. The function passing incorrect arguments is
> > tcp_mark_head_lost. I saw your patch changing BUG_ON to WARN_ON in this
> > patch. That's a good change, however I am trying to put in some amount of
> > debugging information to get to the root cause.
>
> Right, that change won't cure the cause. I was able to determine that some
> invariant must be violated to trigger this. However, the lack of
> information about the state of variables in tcp_mark_head_lost prevented
> me known exactly what is wrong (not to speak of from where the problem
> originates, I certainly tried to read some other code too but so far
> I've not been able to figure it out). I never remembered to do the debug
> patch while I would have had time (and I'm sitting at the computer).
>
> > Given that (packets - oldcnt) < tcp_skb_pcount, it should follow that
> > (packets - oldcnt) * mss < skb->len.
> >
> > 1) Is it possible that the skb has multiple small packets each smaller than
> > mss, in which case we shouldn't expect the above relation to hold?
>
> It should not happen because we recalculate on each retransmission using
> effective MSS. SACK block combines skbs but should only do that for
> S-bit'ed skbs requiring reneging at minimum to get them flying again and
> still they're retransmitted through with the same eff MSS check being
> done. However, I've seen some indications that there could be some bug in
> this as some drivers had to circumvent for the case of very small
> MSS'ed super skb (pcount > 1) because some hardware fails to handle that.
> I've no idea how such segments can be constructed atm but it certainly
> seems that you might be on the right tracks.
>
> If you're seeing them it would be easy to do sanity checks in
> tcp_mark_head_lost instead of tcp_fragment and trigger printouts with more
> verbose information. However, it might not solve the quest for the cause
> as this might only be symptom but certainly is still a good initial step.
>
> > If not, then it appears to be one of the two cases -
> > 2) mss updated in the midst of recovery (path mtu discovery?), without
> > corresponding updates of skb->gso_segs or skb-?gso_size?
>
> This is not relevant, as mss is stored into skb and we fetch it from
> there: mss = skb_shinfo(skb)->gso_size. All such mss changes only affect
> newly generated stuff.
>
> > 3) skb->len is inaccurately updated st some point.
>
> This is another possibility. But due to the other bug report I'm more
> inclined towards the first option. Also, I think this would lead to some
> noisy truesize inconsitency but I'm not the expert in those calculations
> :-D.
>
> > Your thoughts on which of these is a plausible scenario (or something other
> > than these!) would be great.
>
> Some pcount/cnt_hint misvalue might also explain something.
>
> --
> i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists