[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48FC75EF.1020705@motion-twin.com>
Date: Mon, 20 Oct 2008 14:13:35 +0200
From: Nicolas Cannasse <ncannasse@...ion-twin.com>
To: swivel@...lls.gnugeneration.com
CC: linux-kernel@...r.kernel.org
Subject: Re: poll() blocked / packets not received ?
swivel@...lls.gnugeneration.com a écrit :
> When the other end of the TCP is _gone_ that leads me to believe a FIN
> will not be coming, hence the indefinite ESTABLISHED state. Why it's
> gone is a different question, maybe your problem is at the other end?
> The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
> these transitions require the other side to leave ESTABLISHED (receive a
> FIN then ACK) at the very least to proceed.
>
>> I agree with your comment in general, except that we have been running
>> the same application in single-thread environment for years without
>> running into this very specific problem.
>>
>
> Perhaps when you run in multicore/threaded you are stressing the network
> stacks at both ends more, including everything in-between? The
> threading vs. single process relationship is probably not causal, but
> just coincidental.
Not sure why this should happen, since it's the same servers. What only
change is part of the software that we are using to handle our server
requests. It's either embedded in Apache 1.3 with fork() or a standalone
multithread server which acts as Apache backend.
So the only difference for networking is that we have additional
Apache<->MT-Server communications, but they should be on 127.0.0.1 so I
think they are purely software and not hardware-related.
> What is the protocol? Are there any timeouts to take care of these
> situations? Do you schedule an alarm or use SO_RCVTIMEO to shutdown
> dead connections and free up consumed threads?
The protocol is MySQL. Since we had the problem with libmysqlclient, we
reimplemented it again from scratch to make sure that it was not
software-related.
What happens at the protocol-level is the following :
a) we connect to the server
b) we make several requests and get answers back
c) at some (random+rare) point - always after making a request - we're
stuck while waiting for the answer.
Sadly, this can happen inside a transaction while we hold the lock on
some shared resource. This will lock the whole website until we run out
of File Descriptor due to accept'ed pending connections. In that case we
get an exception and the server (the multithread one, not MySQL)
restarts, which release the lock.
In some other cases when we don't hold a lock, the thread remains
blocked in poll() as I described it. After a timeout (I think it's 28800
seconds) the MySQL server closes the connection. The client - which is
waiting in poll() - does not have any timeout activated (it's relying on
the mysql server). But it doesn't notice that the socket has been closed
either.
We investigated a lot about signals since poll() can also be interrupted
by Garbage Collector and child process signals, but we correctly handle
EINTR everywhere it's needed. So unless there's a possibility that
interrupting poll() with a signal might somehow consume the data, this
is not the problem here.
> TCP being reliable can block indefinitely, you can employ TCP keepalive
> to change indefinite to quite a long time.
Sure. We could also use a client timeout, but we don't want to hold the
lock more than required, and we can't make the difference between a
given request that would take too much time to complete and a lost
connection.
Hope we can somehow understand what's going on.
Thanks for the answers so far,
Best,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists