netdev - Re: selftests/bpf test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ff63e289-91db-3dd7-3370-941f8ddb987c@lab.ntt.co.jp>
Date:   Wed, 25 Jul 2018 09:49:35 +0900
From:   Prashant Bhole <bhole_prashant_q7@....ntt.co.jp>
To:     Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        netdev <netdev@...r.kernel.org>
Cc:     Martin Lau <kafai@...com>
Subject: Re: selftests/bpf test_sockmap failure



On 7/25/2018 8:02 AM, Yonghong Song wrote:
> 
> 
> On 7/24/18 3:40 PM, John Fastabend wrote:
>> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>>> In one of our production machines, tools/testing/selftests/bpf
>>> test_sockmap failed randomly like below:
>>>
>>> ...
>>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>>> bpf changes.
>>>
>>> The failed test number (78, 79, 80, or 83) is random. But they all share
>>> similar characteristics:
>>>     . the option rate is greater than one, i.e., more than one
>>>       sendmsg/sendpage in the sender forked process.
>>>     . The txmsg_apply is not 0
>>>
>>> I debugged a little bit. It happens in msg_loop() function below
>>> "unexpected timeout" path.
>>>
>>> ...
>>>                          slct = select(max_fd + 1, &w, NULL, NULL,
>>> &timeout);
>>>                          if (slct == -1) {
>>>                                  perror("select()");
>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>> &s->end);
>>>                                  goto out_errno;
>>>                          } else if (!slct) {
>>>                                  if (opt->verbose)
>>>                                          fprintf(stderr, "unexpected
>>> timeout\n");
>>>                                  errno = -EIO;
>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>> &s->end);
>>>                                  goto out_errno;
>>>                          }
>>> ...
>>>
>>> It appears that when the error happens, the receive process does not
>>> receive all bytes sent from the send process and eventually times out.
>>>
>>> Has anybody seen this issue as well?
>>> John, any comments on this failure?
>>
>> Can you run the test with verbose enabled so we can determine if the tx
>> side is even sending the message? Sample patch below. This will allow
>> us to see the tx bytes and rx bytes, although it will be a bit noisy.
>>
>> I notice that the test program is not smart enough to (re)send bytes if
>> the sendmsg call doesn't consume all bytes. This is a valid error if
>> we get a enomem or other normal error on the tx side. With apply this
>> is more likely because every byte (in apply = 1 case) is being sent
>> through BPF prog.
> 
> The following are some of logs for the failed tests:
> 
> ...
> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> ...
> 
> [TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> 
> [TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> 
> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s 
> 0.000459GB/s
> rx thread exited with err 1. FAILED
> 
> 
>>
>> If this is not the case I can do some more digging but I've not seen
>> this before.
> 
> I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
> The bug only shows up on our production service which is a lot busier 
> and has more cpus/memorys and some sysctl configurations are different 
> from my VM setup.

In the past I have noticed that on busy machines sometimes data is 
delayed so much that select() timeout is triggered. Although it isn't a 
solution but you can try to increase the timeout value to check if that 
is the case.

-Prashant