lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110507105152.GA13459@jensen.inet.no>
Date:	Sat, 7 May 2011 12:51:52 +0200
From:	Michael Shuldman <michaels@...t.no>
To:	linux-kernel@...r.kernel.org
Cc:	"David S. Miller" <davem@...emloft.net>, karls@...t.no
Subject: bug in select(2) regarding non-blocking connect(2) completion?


Hello, I am occasionally encountering what I belive is a bug in the
kernel.

Below is a strace that I believe shows how the bug manifests itself,
with my comments.


# first select.  All fd's in the write set ([15 17 ... 51 55]) are 
# non-blocking sockets that have had a connect(2) previously issued on
# them, and which have yet to finish connecting as far as we know
# at the time we call select(2).
03:55:31.808548 select(58, [4 8 11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31
32 33 34 35 36 37 39 40 41 43 44 46 48 49 50 52 53 54 57], [15 17 25 29 45 47 51
 55], [11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31 32 33 34 35 36 37 39 40 4
1 43 44 46 48 49 50 52 53 54 57], {1, 0}) = 3 (in [16 26], out [51], left {1, 0}
)

# As indicated by the results returned by the above select(2), fd 51 should
# have finished the connect attempt, but when we try to find out whether 
# the connect(2) succeeded or failed, the results are conflicting.

03:55:31.808622 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
dpoint is not connected)
03:55:31.808900 getsockopt(51, SOL_SOCKET, SO_ERROR, [0], [4]) = 0

# getpeername(2) failing on a socket that has finished connecting should 
# indicate that the connect(2) failed.  Yet when we try to fetch the
# SO_ERROR of the socket, it says no error is currently set.
# We then loop around with select(2) again, and again the same thing
# happens:

03:55:31.809259 select(58, [4 8 11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31
32 33 34 35 36 37 39 40 41 43 44 46 48 49 50 52 53 54 57], [15 17 25 29 45 47 51
 55], [11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31 32 33 34 35 36 37 39 40 4
1 43 44 46 48 49 50 52 53 54 57], {1, 0}) = 3 (in [16 26], out [51], left {1, 0}
)
03:55:31.809329 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
dpoint is not connected)
03:55:31.809640 getsockopt(51, SOL_SOCKET, SO_ERROR, [0], [4]) = 0

...

# finally, getsockopt(2) returns that the connect(2) failed.
03:55:32.521146 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
dpoint is not connected)
03:55:32.521614 getsockopt(51, SOL_SOCKET, SO_ERROR, [101], [4]) = 0

In other words, select(2) says the socket has finished connecting,
getpeername(2) neither confirms nor denies this (it can only confirm
if the connect finished successfully).  getsockopt(2) and SO_ERROR
however says there is no error on the socket, which coupled
with getpeername(2) failing, indicates that the connect(2) has
not yet finished



This does not happen all the time.  E.g., I watched the system for
an hour yesterday, as things were staring up and the number of
concurrent tcp clients gradually increased from zero to around 700,
with no observable problems.  However after a while, the problem
starts occurring, related to an increasing number of clients or
something else, I do not know.

Currently the system has a little over 3,000 clients and the problem
occurs now and then, sometimes several times a minute, while sometimes
it can take dozens of minutes between each time.  At the moment,
the last time the problem was detected was 40 minutes ago.

The software the above strace is related to is a proxy server, and
if there are 3000 clients (incoming TCP sessions), there would
normally be 3000 outgoing TCP sessions also.  

uname -a on the system in question reports 
2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:13 EDT 2011 x86_64 x86_64
x86_64 GNU/Linux

Thankful for any hints or pointers related to this problem.
With kind regards,

-- 
  _ // 
  \X/ -- Michael Shuldman 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ