[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1E0490CF-9393-485B-BFB9-119249F4EB65@icloud.com>
Date: Thu, 27 Dec 2018 16:13:29 -0500
From: charles cross <xcross59@...oud.com>
To: netdev@...r.kernel.org
Subject: UDP sendto() fails with EINVAL when host under network load
Hi netdev,
I've got an application that handles network traffic using various protocols. The application is comprised of a supervisor process and one or more worker processes that implement a watchdog that enables the supervisor to kill hung workers or detect when they've crashed and start new ones. Originally we had only a single worker process and the watchdog was comprised of a UDP socket on the loopback address through which the supervisor sends a health check to the worker and the healthy worker replies. When we improved the application to support multiple worker processes we were able to simply extend the watchdog to use multicast. This was accomplished with no significant change to the watchdog logic, i.e., just a matter of the workers joining the multicast group and replying with an ID when the the supervisor sends to the multicast group.
The new multicast watchdog works fine except under heavy load. Using the test program curl-loader we ramp up to several thousand http connections to the worker process. As the load builds the supervisor health check starts to fail intermittently and until it reaches 100% failure at peak load. The failure occurs on the origination of the healthcheck when sendto() fails with EINVAL. As the load drops, sendto() begins to succeed again. The arguments to sendto() do not change during the test. Using printk I have isolated the failure to udp_sendmsg() in net/ipv4/udp.c:
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len)
Within this function at this block
/* Lockless fast path for the non-corking case. */
if (!corkreq) {
skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
msg->msg_flags);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4);
printk(KERN_ERR "%s goto out from line: %d\n",__FUNCTION__,__LINE__);
goto out;
}
the function udp_send_skb() is returning EINVAL.
The kernel is v3.10.0 from upstream RHEL 7.5. Can anyone offer advice before I proceed down the stack to look for the root cause? The behavior (failure under load but recovery after the load is removed) suggests contention for resources but the EINVAL return code makes no sense to me given the arguments to sendto() do not change. I am totally unfamiliar with this code so any help is appreciated.
Thanks,
Chris
Powered by blists - more mailing lists