[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff63e289-91db-3dd7-3370-941f8ddb987c@lab.ntt.co.jp>
Date: Wed, 25 Jul 2018 09:49:35 +0900
From: Prashant Bhole <bhole_prashant_q7@....ntt.co.jp>
To: Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
netdev <netdev@...r.kernel.org>
Cc: Martin Lau <kafai@...com>
Subject: Re: selftests/bpf test_sockmap failure
On 7/25/2018 8:02 AM, Yonghong Song wrote:
>
>
> On 7/24/18 3:40 PM, John Fastabend wrote:
>> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>>> In one of our production machines, tools/testing/selftests/bpf
>>> test_sockmap failed randomly like below:
>>>
>>> ...
>>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>>> bpf changes.
>>>
>>> The failed test number (78, 79, 80, or 83) is random. But they all share
>>> similar characteristics:
>>> . the option rate is greater than one, i.e., more than one
>>> sendmsg/sendpage in the sender forked process.
>>> . The txmsg_apply is not 0
>>>
>>> I debugged a little bit. It happens in msg_loop() function below
>>> "unexpected timeout" path.
>>>
>>> ...
>>> slct = select(max_fd + 1, &w, NULL, NULL,
>>> &timeout);
>>> if (slct == -1) {
>>> perror("select()");
>>> clock_gettime(CLOCK_MONOTONIC,
>>> &s->end);
>>> goto out_errno;
>>> } else if (!slct) {
>>> if (opt->verbose)
>>> fprintf(stderr, "unexpected
>>> timeout\n");
>>> errno = -EIO;
>>> clock_gettime(CLOCK_MONOTONIC,
>>> &s->end);
>>> goto out_errno;
>>> }
>>> ...
>>>
>>> It appears that when the error happens, the receive process does not
>>> receive all bytes sent from the send process and eventually times out.
>>>
>>> Has anybody seen this issue as well?
>>> John, any comments on this failure?
>>
>> Can you run the test with verbose enabled so we can determine if the tx
>> side is even sending the message? Sample patch below. This will allow
>> us to see the tx bytes and rx bytes, although it will be a bit noisy.
>>
>> I notice that the test program is not smart enough to (re)send bytes if
>> the sendmsg call doesn't consume all bytes. This is a valid error if
>> we get a enomem or other normal error on the tx side. With apply this
>> is more likely because every byte (in apply = 1 case) is being sent
>> through BPF prog.
>
> The following are some of logs for the failed tests:
>
> ...
> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> ...
>
> [TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s
> 0.000000GB/s
> rx thread exited with err 1. FAILED
>
> [TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1
> <-> p1, c2 <-> p2
> cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s
> 0.000000GB/s
> rx thread exited with err 1. FAILED
>
> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s
> 0.000459GB/s
> rx thread exited with err 1. FAILED
>
>
>>
>> If this is not the case I can do some more digging but I've not seen
>> this before.
>
> I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
> The bug only shows up on our production service which is a lot busier
> and has more cpus/memorys and some sysctl configurations are different
> from my VM setup.
In the past I have noticed that on busy machines sometimes data is
delayed so much that select() timeout is triggered. Although it isn't a
solution but you can try to increase the timeout value to check if that
is the case.
-Prashant
Powered by blists - more mailing lists