lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c0bebae-e3b4-8a69-0b51-df7de9414d64@gmail.com>
Date:   Tue, 24 Jul 2018 15:40:14 -0700
From:   John Fastabend <john.fastabend@...il.com>
To:     Yonghong Song <yhs@...com>, netdev <netdev@...r.kernel.org>
Cc:     Martin Lau <kafai@...com>
Subject: Re: selftests/bpf test_sockmap failure

On 07/24/2018 08:45 AM, Yonghong Song wrote:
> In one of our production machines, tools/testing/selftests/bpf
> test_sockmap failed randomly like below:
> 
> ...
> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
> memory. The kernel is based on 4.16 but backported with latest bpf-next
> bpf changes.
> 
> The failed test number (78, 79, 80, or 83) is random. But they all share
> similar characteristics:
>    . the option rate is greater than one, i.e., more than one
>      sendmsg/sendpage in the sender forked process.
>    . The txmsg_apply is not 0
> 
> I debugged a little bit. It happens in msg_loop() function below
> "unexpected timeout" path.
> 
> ...
>                         slct = select(max_fd + 1, &w, NULL, NULL,
> &timeout);
>                         if (slct == -1) {
>                                 perror("select()");
>                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
>                                 goto out_errno;
>                         } else if (!slct) {
>                                 if (opt->verbose)
>                                         fprintf(stderr, "unexpected
> timeout\n");
>                                 errno = -EIO;
>                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
>                                 goto out_errno;
>                         }
> ...
> 
> It appears that when the error happens, the receive process does not
> receive all bytes sent from the send process and eventually times out.
> 
> Has anybody seen this issue as well?
> John, any comments on this failure?

Can you run the test with verbose enabled so we can determine if the tx
side is even sending the message? Sample patch below. This will allow
us to see the tx bytes and rx bytes, although it will be a bit noisy.

I notice that the test program is not smart enough to (re)send bytes if
the sendmsg call doesn't consume all bytes. This is a valid error if
we get a enomem or other normal error on the tx side. With apply this
is more likely because every byte (in apply = 1 case) is being sent
through BPF prog.

If this is not the case I can do some more digging but I've not seen
this before.

Thanks!
John

--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -1031,6 +1031,7 @@ static int test_exec(int cgrp, struct
sockmap_options *opt)
        if (err)
                goto out;

+       opt->verbose = true;
        err = __test_exec(cgrp, SENDPAGE, opt);
 out:
        return err;


> 
> Thanks,
> 
> Yonghong

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ