[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46dfed54-3634-af7a-2b14-b64b5d33a8e8@fb.com>
Date: Tue, 24 Jul 2018 16:02:50 -0700
From: Yonghong Song <yhs@...com>
To: John Fastabend <john.fastabend@...il.com>,
netdev <netdev@...r.kernel.org>
CC: Martin Lau <kafai@...com>
Subject: Re: selftests/bpf test_sockmap failure
On 7/24/18 3:40 PM, John Fastabend wrote:
> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>> In one of our production machines, tools/testing/selftests/bpf
>> test_sockmap failed randomly like below:
>>
>> ...
>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>> bpf changes.
>>
>> The failed test number (78, 79, 80, or 83) is random. But they all share
>> similar characteristics:
>> . the option rate is greater than one, i.e., more than one
>> sendmsg/sendpage in the sender forked process.
>> . The txmsg_apply is not 0
>>
>> I debugged a little bit. It happens in msg_loop() function below
>> "unexpected timeout" path.
>>
>> ...
>> slct = select(max_fd + 1, &w, NULL, NULL,
>> &timeout);
>> if (slct == -1) {
>> perror("select()");
>> clock_gettime(CLOCK_MONOTONIC, &s->end);
>> goto out_errno;
>> } else if (!slct) {
>> if (opt->verbose)
>> fprintf(stderr, "unexpected
>> timeout\n");
>> errno = -EIO;
>> clock_gettime(CLOCK_MONOTONIC, &s->end);
>> goto out_errno;
>> }
>> ...
>>
>> It appears that when the error happens, the receive process does not
>> receive all bytes sent from the send process and eventually times out.
>>
>> Has anybody seen this issue as well?
>> John, any comments on this failure?
>
> Can you run the test with verbose enabled so we can determine if the tx
> side is even sending the message? Sample patch below. This will allow
> us to see the tx bytes and rx bytes, although it will be a bit noisy.
>
> I notice that the test program is not smart enough to (re)send bytes if
> the sendmsg call doesn't consume all bytes. This is a valid error if
> we get a enomem or other normal error on the tx side. With apply this
> is more likely because every byte (in apply = 1 case) is being sent
> through BPF prog.
The following are some of logs for the failed tests:
...
[TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s
0.000000GB/s
rx thread exited with err 1. FAILED
...
[TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s
0.000000GB/s
rx thread exited with err 1. FAILED
[TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1
<-> p1, c2 <-> p2
cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s
0.000000GB/s
rx thread exited with err 1. FAILED
[TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s
0.000459GB/s
rx thread exited with err 1. FAILED
>
> If this is not the case I can do some more digging but I've not seen
> this before.
I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
The bug only shows up on our production service which is a lot busier
and has more cpus/memorys and some sysctl configurations are different
from my VM setup.
>
> Thanks!
> John
>
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -1031,6 +1031,7 @@ static int test_exec(int cgrp, struct
> sockmap_options *opt)
> if (err)
> goto out;
>
> + opt->verbose = true;
> err = __test_exec(cgrp, SENDPAGE, opt);
> out:
> return err;
>
>
>>
>> Thanks,
>>
>> Yonghong
>
Powered by blists - more mailing lists