linux-kernel - Re: poll() blocked / packets not received ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <48FC75EF.1020705@motion-twin.com>
Date:	Mon, 20 Oct 2008 14:13:35 +0200
From:	Nicolas Cannasse <ncannasse@...ion-twin.com>
To:	swivel@...lls.gnugeneration.com
CC:	linux-kernel@...r.kernel.org
Subject: Re: poll() blocked / packets not received ?

swivel@...lls.gnugeneration.com a écrit :
> When the other end of the TCP is _gone_ that leads me to believe a FIN
> will not be coming, hence the indefinite ESTABLISHED state.  Why it's
> gone is a different question, maybe your problem is at the other end?
> The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
> these transitions require the other side to leave ESTABLISHED (receive a
> FIN then ACK) at the very least to proceed.
> 
>> I agree with your comment in general, except that we have been running 
>> the same application in single-thread environment for years without 
>> running into this very specific problem.
>>
> 
> Perhaps when you run in multicore/threaded you are stressing the network
> stacks at both ends more, including everything in-between?  The
> threading vs. single process relationship is probably not causal, but
> just coincidental.

Not sure why this should happen, since it's the same servers. What only 
change is part of the software that we are using to handle our server 
requests. It's either embedded in Apache 1.3 with fork() or a standalone 
multithread server which acts as Apache backend.

So the only difference for networking is that we have additional 
Apache<->MT-Server communications, but they should be on 127.0.0.1 so I 
think they are purely software and not hardware-related.

> What is the protocol?  Are there any timeouts to take care of these
> situations?  Do you schedule an alarm or use SO_RCVTIMEO to shutdown
> dead connections and free up consumed threads?

The protocol is MySQL. Since we had the problem with libmysqlclient, we 
reimplemented it again from scratch to make sure that it was not 
software-related.

What happens at the protocol-level is the following :

a) we connect to the server
b) we make several requests and get answers back
c) at some (random+rare) point - always after making a request - we're 
stuck while waiting for the answer.

Sadly, this can happen inside a transaction while we hold the lock on 
some shared resource. This will lock the whole website until we run out 
of File Descriptor due to accept'ed pending connections. In that case we 
get an exception and the server (the multithread one, not MySQL) 
restarts, which release the lock.

In some other cases when we don't hold a lock, the thread remains 
blocked in poll() as I described it. After a timeout (I think it's 28800 
seconds) the MySQL server closes the connection. The client - which is 
waiting in poll() - does not have any timeout activated (it's relying on 
the mysql server). But it doesn't notice that the socket has been closed 
either.

We investigated a lot about signals since poll() can also be interrupted 
by Garbage Collector and child process signals, but we correctly handle 
EINTR everywhere it's needed. So unless there's a possibility that 
interrupting poll() with a signal might somehow consume the data, this 
is not the problem here.

> TCP being reliable can block indefinitely, you can employ TCP keepalive
> to change indefinite to quite a long time.

Sure. We could also use a client timeout, but we don't want to hold the 
lock more than required, and we can't make the difference between a 
given request that would take too much time to complete and a lost 
connection.

Hope we can somehow understand what's going on.
Thanks for the answers so far,

Best,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/