lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161013231923.j2fidfbtzdp66x3t@thunk.org>
Date:   Thu, 13 Oct 2016 19:19:23 -0400
From:   Theodore Ts'o <tytso@....edu>
To:     Jens Axboe <axboe@...com>
Cc:     Dave Chinner <david@...morbit.com>, linux-ext4@...r.kernel.org,
        fstests@...r.kernel.org, tarasov@...ily.name
Subject: Re: Test generic/299 stalling forever

On Wed, Oct 12, 2016 at 08:39:57PM -0600, Jens Axboe wrote:
> 
> If I just wanted to invoke generic/299 with a setup similar to yours,
> what should I do? I can attempt to reproduce it.

Well, if you're willing to get a Google Compute Engine account (it
will come with a $300 free trial credit, which is plenty for this
exercise), the simplest way to do an exact reproduction would be to
follow the instructions here:

https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

If you use the current default gce-xfstests image, which is
xfstests-201610131131 in the xfstests-cloud project, what I run which
triggers it quite reliably is:

gce-xfstests -C 10 generic/299

On a run that I just did using a 4.8.0+ kernel, it ran 12 times (10
times with the 4k configuration, and then 2 times with the 1k
configuration) before it hung.

FSTESTIMG: xfstests-cloud/xfstests-201610131131
FSTESTVER: e2fsprogs	v1.43.3-30-g8df85fb (Sun, 4 Sep 2016 21:32:35 -0400)
FSTESTVER: fio		fio-2.14-27-gafd2cef (Wed, 12 Oct 2016 08:59:25 -0600)
FSTESTVER: quota	81aca5c (Tue, 12 Jul 2016 16:15:45 +0200)
FSTESTVER: xfsprogs	v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100)
FSTESTVER: xfstests-bld	de1a881 (Fri, 30 Sep 2016 18:23:15 -0400)
FSTESTVER: xfstests	linux-v3.8-1200-g218a62d (Thu, 29 Sep 2016 23:30:27 -0400)
FSTESTVER: kernel	4.8.0-ext4-14234-gfb44543 #21 SMP Wed Oct 12 23:40:33 EDT 2016 x86_64

So on a system which was hung in this way, I ran "apt-get update ;
apt-get install gdb", and uploaded the build tree (plus source code
for the fio used to build the image) to /root/fio.

root@...tests-201610131148:~# ps axm -o pid,tid,class,pri,psr,pcpu,stat,wchan:20,comm
  PID   TID CLS PRI PSR %CPU STAT WCHAN                COMMAND
	....
19781     - -     -   -  1.6 -    -                    fio
    - 19781 TS   19   0  1.6 Ssl  hrtimer_nanosleep    -
    - 19800 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
    - 19801 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
    - 19802 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
    - 19803 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
	....
root@...tests-201610131148:~# gdb /root/fio/fio
	....
(gdb) dir /root/fio
Source directories searched: /root/fio:$cdir:$cwd
(gdb) attach 19781
Attaching to program: /root/fio/fio, process 19781
	....
(gdb) thread apply all bt

Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
#0  0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
#3  0x000000000045d4d3 in run_threads (sk_out=sk_out@...ry=0x0) at backend.c:2268
#4  0x000000000045d80d in fio_backend (sk_out=sk_out@...ry=0x0) at backend.c:2400
#5  0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65

(gdb) thread apply all bt full

Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
        list = {next = 0x7fb3ddfeced0, prev = 0x7fb3ddfeced0}
        td = 0x7fb3c7590610
        io_u = 0x202fac0
        ret = 0
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
        __res = <optimized out>
        pd = 0x7fb3ddfed700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500339456, -2782054375790187915, 0, 140410502103136, 140410120428088, 
                140410500339456, 2739195402349996661, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
        list = {next = 0x7fb3ddfe4ed0, prev = 0x7fb3ddfe4ed0}
        td = 0x7fb3c7590610
        io_u = 0x203d280
        ret = 0
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
        __res = <optimized out>
        pd = 0x7fb3ddfe5700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500306688, -2782054375790187915, 0, 140410502103136, 140410120428088, 
                140410500306688, 2739195406644963957, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
---Type <return> to continue, or q <return> to quit---
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
        list = {next = 0x7fb3ddfdced0, prev = 0x7fb3ddfdced0}
        td = 0x7fb3c7590610
        io_u = 0x203cf80
        ret = 0
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
        __res = <optimized out>
        pd = 0x7fb3ddfdd700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500273920, -2782054375790187915, 0, 140410502103136, 140410120428088, 
                140410500273920, 2739195393760062069, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
        list = {next = 0x7fb3ddfd4ed0, prev = 0x7fb3ddfd4ed0}
        td = 0x7fb3c7590610
        io_u = 0x203ccc0
        ret = 0
#2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
        __res = <optimized out>
        pd = 0x7fb3ddfd5700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500241152, -2782054375790187915, 0, 140410502103136, 140410120428088, 
                140410500241152, 2739195398055029365, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
#0  0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
No locals.
---Type <return> to continue, or q <return> to quit---
#1  0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
        ts = {tv_sec = 0, tv_nsec = 1000000}
#2  0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
        verify_bytes = 419209216
        elapsed_us = {0, 0, 0}
        td = 0x7fb3c7590610
        sk_out = 0x2039600
        clear_state = 33789440
#3  0x000000000045d4d3 in run_threads (sk_out=sk_out@...ry=0x0) at backend.c:2268
        ret = <optimized out>
        pid = <optimized out>
        this_jobs = 6
        fd = 0x2029720
        map = {0x7fb3c754c000, 0x7fb3c7559ad0, 0x7fb3c75675a0, 0x7fb3c7575070, 0x7fb3c7582b40, 0x7fb3c7590610, 0x0 <repeats 11 times>, 
          0x10000000, 0x0 <repeats 771 times>, 0x7fb3ddf83c1c <check_match+300>, 0x0, 0xd827524, 0x5, 0x7, 0x7fb3de192040, 
          0x7fb3ddf8458e <do_lookup_x+2334>, 0x0, 0x7ffd24b8ea80, 0x7fb3dd2a9d28, 0x7fb3dd2af020, 0x7ffd24b8eb90, 0x3609d4, 
          0x7ffd24b8eb80, 0x0, 0x0, 0x7fb3de1920c0, 0x7fb3de1939d8, 0x406851, 0x7fb3dd2b6d78, 0x401e18, 0x100000000, 0x100000375, 0x0, 
          0x7fb3de1920c0, 0x7ffd24b8ec20, 0x7fb3de19d500, 0x7ffd24b8ec48, 0x7fb3de19d1a8, 0x1, 
          0x7fb3ddf8473d <_dl_lookup_symbol_x+349>, 0x0, 0x7fb3de1920c0, 0x1, 0x0, 0x1, 0x7fb3de19d1a8, 0x0, 0x0, 0x0, 0x0, 0x0, 
          0x7fb3de19d500, 0x7ffd24b8eb90, 0x7ffd24b8eb80, 0xd827524, 0x406851, 0xffffffff, 0x0, 0x7fb3dd2af020, 0x7fb3de1939d8, 
          0x0 <repeats 11 times>, 0x698620 <memcmp@....plt>, 0x0, 0x7ffd24b8ecf8, 0x7ffd24b8ecf8, 0x3, 0x7ffd24b8f6f0, 
          0x7fb3ddf88ecc <_dl_fixup+508>, 0x9000000000001, 0x0, 0x8ff5c28f5c28f, 0x7fb3dd2af020, 0x140, 0x7ffd24b8f6f0, 
          0x7ffd24b91720, 0x7fb3ddf8f2e5 <_dl_runtime_resolve+53>, 0x0, 0x20262a0, 0x2bc4, 0x7ffd24b92530, 0x7ffd24b8f960, 
          0x8ff5c28f5c200, 0x2026270, 0x7fb3de19d1a8, 0x54, 0x46144b <fio_test_cconv+3595>, 0x0, 0x7ffd24b8ece0, 
          0x0 <repeats 16 times>, 0x2026210, 0x0, 0x2026270, 0x0, 0x1, 0x40000000400, 0x1, 0x0, 0x1, 0x100000001, 0x100000000, 0x0, 
          0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x100000001000, 0x1000, 0x0 <repeats 14 times>, 0x1, 0x0, 0x100000000, 0x100000001, 0x0, 0x0, 
          0x0, 0x100000001, 0x400, 0x0 <repeats 87 times>, 0x100000000, 0x100000000, 0x0, 0x89, 0x0 <repeats 20 times>, 0x6400000000, 
          0x6400000064, 0x200000, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1f400000000, 0x1000001f4...}
        this_start = {tv_sec = 0, tv_usec = 0}
        left = <optimized out>
        td = 0x7fb3c7590610
        i = 5
        todo = 6
        nr_running = 0
        m_rate = 0
        t_rate = 0
        nr_started = 6
        spent = <optimized out>
#4  0x000000000045d80d in fio_backend (sk_out=sk_out@...ry=0x0) at backend.c:2400
        td = <optimized out>
        i = <optimized out>
#5  0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65
        ret = 1

> > As I said, I can attach to the hanging fio using a gdb and give you
> > stackdumps for all of the threads if that would be helpful.
> 
> That would definitely be helpful as well, especially if I cannot
> reproduce it.

Here you go!

If I kill -9 the fio process, the tests continue to run, and there are
no kernel complaints, so I'm pretty sure this is a fio issue.

Many thanks!!

   	  	      	     	    	 - Ted

P.S.  In answer to your question from another e-mail message on this
thread, the block device in volved is a Google Compute Engine
Persistent Disk device backed using an SSD.  To a very *rough*
approximation, think about something kinda like a qemu qcow image on
an very fast flash device in terms of performance.  I've seen
the problem before on a 5400 RPM laptop drive, but it's ***much***
rarer.   It reproduces much more easily faster block devices.

P.P.S.   And here's the fio job file that it was running:

###########
# 299 test fio activity
# Filenames derived from jobsname and jobid like follows:
# ..
[global]
ioengine=libaio
bs=128k
directory=/xt-vdc
filesize=5368709120
size=999G
iodepth=128*1
continue_on_error=write
ignore_error=,ENOSPC
error_dump=0
create_on_open=1
fallocate=none
exitall=1

## Perform direct aio, to files which may be truncated
## by external task
[direct_aio]
direct=1
buffered=0
numjobs=4
rw=randwrite
runtime=100*1
time_based

# Perform direct aio and verify data
# This test case should check use-after-free issues
[aio-dio-verifier]
numjobs=1
verify=crc32c-intel
verify_fatal=1
verify_dump=1
verify_backlog=1024
verify_async=4
verifysort=1
direct=1
bs=4k
rw=randrw
filename=aio-dio-verifier

# Perform buffered aio and verify data
# This test case should check use-after-free issues
[buffered-aio-verifier]
numjobs=1
verify=crc32c-intel
verify_fatal=1
verify_dump=1
verify_backlog=1024
verify_async=4
verifysort=1
direct=0
buffered=1
bs=4k
rw=randrw
filename=buffered-aio-verifier
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ