[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1407072357320.7769@dtop>
Date: Tue, 8 Jul 2014 00:01:01 -0700 (PDT)
From: dormando <dormando@...ia.net>
To: Eric Dumazet <eric.dumazet@...il.com>
cc: Alexey Preobrazhensky <preobr@...gle.com>,
Steffen Klassert <steffen.klassert@...unet.com>,
David Miller <davem@...emloft.net>, paulmck@...ux.vnet.ibm.com,
netdev@...r.kernel.org, Kostya Serebryany <kcc@...gle.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Lars Bull <larsbull@...gle.com>,
Eric Dumazet <edumazet@...gle.com>,
Bruce Curtis <brutus@...gle.com>,
Maciej Żenczykowski <maze@...gle.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
Subject: Re: [PATCH] ipv4: fix a race in ip4_datagram_release_cb()
On Tue, 8 Jul 2014, Eric Dumazet wrote:
> On Mon, 2014-07-07 at 18:41 -0700, dormando wrote:
>
> > Mostly there, but I think we hit what might be a new bug.. The machines
> > which crashed every few days previously have been stable for weeks.
> >
> > however I had one machine running the new kernel in a larger cluster
> > elsewhere; we had a network event and the one machine on the new kernel
> > panic'ed in ipv4_dst_destroy, but what looks like a new path. Sadly I've
> > had to halt the rollout :( All of the older unfixed kernels survived this
> > particular network event.
> >
> > Unfortunately this is still on 3.10, due to a bad softirq regression in
> > 3.14 I've not had time to track down. I applied all of your patches for
> > what wasn't already in 3.10. The only other change I made was to un-revert
> > 62713c4b6bc10c2d082ee1540e11b01a2b2162ab - which I'd been keeping reverted
> > as it was making crashes much more frequent.
>
> Hmm, always give patch title or a valid sha1 commit, this one is not in
> David trees, so its hard to tell.
>
Damn, sorry. I thought it was valid:
Author: Alexei Starovoitov <ast@...mgrid.com>
Date: Tue Nov 19 19:12:34 2013 -0800
ipv4: fix race in concurrent ip_route_input_slow()
[ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
It's a thing that uses a DST_NOCACHE flag. I can re-add the reversion to
my own tree, but it should probably be reviewed again I guess?
We had another thread about it a while ago. I'd upgraded between stable
revisions of 3.10 (when this patch was added) and machines in one
datacenter started crashing every few hours. Thread never went anywhere.
Tried removing the reversion since your recent patches should've fixed the
underlying problem.
I have no idea if this patch is the problem or not though, just adding the
information for completeness. We had no luck at all reproducing this
latest crash.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists