netdev - Re: selftests/bpf test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46dfed54-3634-af7a-2b14-b64b5d33a8e8@fb.com>
Date:   Tue, 24 Jul 2018 16:02:50 -0700
From:   Yonghong Song <yhs@...com>
To:     John Fastabend <john.fastabend@...il.com>,
        netdev <netdev@...r.kernel.org>
CC:     Martin Lau <kafai@...com>
Subject: Re: selftests/bpf test_sockmap failure



On 7/24/18 3:40 PM, John Fastabend wrote:
> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>> In one of our production machines, tools/testing/selftests/bpf
>> test_sockmap failed randomly like below:
>>
>> ...
>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>> bpf changes.
>>
>> The failed test number (78, 79, 80, or 83) is random. But they all share
>> similar characteristics:
>>     . the option rate is greater than one, i.e., more than one
>>       sendmsg/sendpage in the sender forked process.
>>     . The txmsg_apply is not 0
>>
>> I debugged a little bit. It happens in msg_loop() function below
>> "unexpected timeout" path.
>>
>> ...
>>                          slct = select(max_fd + 1, &w, NULL, NULL,
>> &timeout);
>>                          if (slct == -1) {
>>                                  perror("select()");
>>                                  clock_gettime(CLOCK_MONOTONIC, &s->end);
>>                                  goto out_errno;
>>                          } else if (!slct) {
>>                                  if (opt->verbose)
>>                                          fprintf(stderr, "unexpected
>> timeout\n");
>>                                  errno = -EIO;
>>                                  clock_gettime(CLOCK_MONOTONIC, &s->end);
>>                                  goto out_errno;
>>                          }
>> ...
>>
>> It appears that when the error happens, the receive process does not
>> receive all bytes sent from the send process and eventually times out.
>>
>> Has anybody seen this issue as well?
>> John, any comments on this failure?
> 
> Can you run the test with verbose enabled so we can determine if the tx
> side is even sending the message? Sample patch below. This will allow
> us to see the tx bytes and rx bytes, although it will be a bit noisy.
> 
> I notice that the test program is not smart enough to (re)send bytes if
> the sendmsg call doesn't consume all bytes. This is a valid error if
> we get a enomem or other normal error on the tx side. With apply this
> is more likely because every byte (in apply = 1 case) is being sent
> through BPF prog.

The following are some of logs for the failed tests:

...
[TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED
...

[TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED

[TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED

[TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s 
0.000459GB/s
rx thread exited with err 1. FAILED


> 
> If this is not the case I can do some more digging but I've not seen
> this before.

I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
The bug only shows up on our production service which is a lot busier 
and has more cpus/memorys and some sysctl configurations are different 
from my VM setup.

> 
> Thanks!
> John
> 
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -1031,6 +1031,7 @@ static int test_exec(int cgrp, struct
> sockmap_options *opt)
>          if (err)
>                  goto out;
> 
> +       opt->verbose = true;
>          err = __test_exec(cgrp, SENDPAGE, opt);
>   out:
>          return err;
> 
> 
>>
>> Thanks,
>>
>> Yonghong
>