linux-ext4 - Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0094198b6c3382ee2efbd4431e4ad1bfb8cef269.camel@unipv.it>
Date:   Tue, 24 Dec 2019 09:51:16 +0100
From:   Andrea Vai <andrea.vai@...pv.it>
To:     Ming Lei <ming.lei@...hat.com>, "Theodore Y. Ts'o" <tytso@....edu>
Cc:     "Schmid, Carsten" <Carsten_Schmid@...tor.com>,
        Finn Thain <fthain@...egraphics.com.au>,
        Damien Le Moal <Damien.LeMoal@....com>,
        Alan Stern <stern@...land.harvard.edu>,
        Jens Axboe <axboe@...nel.dk>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        USB list <linux-usb@...r.kernel.org>,
        SCSI development list <linux-scsi@...r.kernel.org>,
        Himanshu Madhani <himanshu.madhani@...ium.com>,
        Hannes Reinecke <hare@...e.com>,
        Omar Sandoval <osandov@...com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Greg KH <gregkh@...uxfoundation.org>,
        Hans Holmberg <Hans.Holmberg@....com>,
        Kernel development list <linux-kernel@...r.kernel.org>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: AW: Slow I/O on USB media after commit
 f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Il giorno mar, 24/12/2019 alle 09.27 +0800, Ming Lei ha scritto:
> Hi Ted,
> 
> On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> > On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > > basically, it's:
> > > 
> > >   mount UUID=$uuid /mnt/pendrive
> > >   SECONDS=0
> > >   cp $testfile /mnt/pendrive
> > >   umount /mnt/pendrive
> > >   tempo=$SECONDS
> > > 
> > > and it copies one file only. Anyway, you can find the whole
> script
> > > attached.
> > 
> > OK, so whether we are doing the writeback at the end of cp, or
> when
> > you do the umount, it's probably not going to make any
> difference.  We
> > can get rid of the stack trace in question by changing the script
> to
> > be basically:
> > 
> > mount UUID=$uuid /mnt/pendrive
> > SECONDS=0
> > rm -f /mnt/pendrive/$testfile
> > cp $testfile /mnt/pendrive
> > umount /mnt/pendrive
> > tempo=$SECONDS
> > 
> > I predict if you do that, you'll see that all of the time is spent
> in
> > the umount, when we are trying to write back the file.
> > 
> > I really don't think then this is a file system problem at
> all.  It's
> > just that USB I/O is slow, for whatever reason.  We'll see a stack
> > trace in the writeback code waiting for the I/O to be completed,
> but
> > that doesn't mean that the root cause is in the writeback code or
> in
> > the file system which is triggering the writeback.
> 
> Wrt. the slow write on this usb storage, it is caused by two
> writeback
> path, one is the writeback wq, another is from ext4_release_file()
> which
> is triggered from exit_to_usermode_loop().
> 
> When the two write path is run concurrently, the sequential write
> order
> is broken, then write performance drops much on this particular usb
> storage.
> 
> The ext4_release_file() should be run from read() or write() syscall
> if
> Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected
> behavior
> for ext4_release_file() to be run thousands of times when just
> running 'cp' once, see comment of ext4_release_file():
> 
> 	/*
> 	 * Called when an inode is released. Note that this is
> different
> 	 * from ext4_file_open: open gets called at every open, but
> release
> 	 * gets called only when /all/ the files are closed.
> 	 */
> 	static int ext4_release_file(struct inode *inode, struct file
> *filp)
> 
> > 
> > I suspect the next step is use a blktrace, to see what kind of I/O
> is
> > being sent to the USB drive, and how long it takes for the I/O to
> > complete.  You might also try to capture the output of "iostat -x
> 1"
> > while the script is running, and see what the difference might be
> > between a kernel version that has the problem and one that
> doesn't,
> > and see if that gives us a clue.
> 
> That isn't necessary, given we have concluded that the bad write
> performance is caused by broken write order.
> 
> > 
> > > > And then send me
> > > btw, please tell me if "me" means only you or I cc: all the
> > > recipients, as usual
> > 
> > Well, I don't think we know what the root cause is.  Ming is
> focusing
> > on that stack trace, but I think it's a red herring.....  And if
> it's
> > not a file system problem, then other people will be best suited
> to
> > debug the issue.
> 
> So far, the reason points to the extra writeback path from
> exit_to_usermode_loop().
> If it is not from close() syscall, the issue should be related with
> file reference
> count. If it is from close() syscall, the issue might be in 'cp''s
> implementation.
> 
> Andrea, please collect the following log or the strace log requested
> by Ted, then
> we can confirm if the extra writeback is from close() or
> read/write() syscall:
> 
> # pass PID of 'cp' to this script
> #!/bin/sh
> PID=$1
> /usr/share/bcc/tools/trace -P $PID  -t -C \
>     't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector,
> args->nr_sector' \
>     't:syscalls:sys_exit_close ' \
>     't:syscalls:sys_exit_read ' \
>     't:syscalls:sys_exit_write '

Meanwhile, I tried to run the test and obtained an error (...usage:
trace [-h] [-b BUFFER_PAGES] [-p PID]...), so assumed the "-P" should
be "-p", corrected and obtained the attached log with ext4 and a slow
copy (2482 seconds) by doing:

- start the test
- look at the cp pid
- run the trace
- wait for the test to finish
- stop the trace.

Thanks,
Andrea

Download attachment "20191224_test_ming.zip" of type "application/zip" (20830 bytes)