linux-kernel - Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191224012707.GA13083@ming.t460p>
Date:   Tue, 24 Dec 2019 09:27:07 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Andrea Vai <andrea.vai@...pv.it>,
        "Schmid, Carsten" <Carsten_Schmid@...tor.com>,
        Finn Thain <fthain@...egraphics.com.au>,
        Damien Le Moal <Damien.LeMoal@....com>,
        Alan Stern <stern@...land.harvard.edu>,
        Jens Axboe <axboe@...nel.dk>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        USB list <linux-usb@...r.kernel.org>,
        SCSI development list <linux-scsi@...r.kernel.org>,
        Himanshu Madhani <himanshu.madhani@...ium.com>,
        Hannes Reinecke <hare@...e.com>,
        Omar Sandoval <osandov@...com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Greg KH <gregkh@...uxfoundation.org>,
        Hans Holmberg <Hans.Holmberg@....com>,
        Kernel development list <linux-kernel@...r.kernel.org>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: AW: Slow I/O on USB media after commit
 f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

Hi Ted,

On Mon, Dec 23, 2019 at 02:53:01PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Dec 23, 2019 at 07:45:57PM +0100, Andrea Vai wrote:
> > basically, it's:
> > 
> >   mount UUID=$uuid /mnt/pendrive
> >   SECONDS=0
> >   cp $testfile /mnt/pendrive
> >   umount /mnt/pendrive
> >   tempo=$SECONDS
> > 
> > and it copies one file only. Anyway, you can find the whole script
> > attached.
> 
> OK, so whether we are doing the writeback at the end of cp, or when
> you do the umount, it's probably not going to make any difference.  We
> can get rid of the stack trace in question by changing the script to
> be basically:
> 
> mount UUID=$uuid /mnt/pendrive
> SECONDS=0
> rm -f /mnt/pendrive/$testfile
> cp $testfile /mnt/pendrive
> umount /mnt/pendrive
> tempo=$SECONDS
> 
> I predict if you do that, you'll see that all of the time is spent in
> the umount, when we are trying to write back the file.
> 
> I really don't think then this is a file system problem at all.  It's
> just that USB I/O is slow, for whatever reason.  We'll see a stack
> trace in the writeback code waiting for the I/O to be completed, but
> that doesn't mean that the root cause is in the writeback code or in
> the file system which is triggering the writeback.

Wrt. the slow write on this usb storage, it is caused by two writeback
path, one is the writeback wq, another is from ext4_release_file() which
is triggered from exit_to_usermode_loop().

When the two write path is run concurrently, the sequential write order
is broken, then write performance drops much on this particular usb
storage.

The ext4_release_file() should be run from read() or write() syscall if
Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected behavior
for ext4_release_file() to be run thousands of times when just
running 'cp' once, see comment of ext4_release_file():

	/*
	 * Called when an inode is released. Note that this is different
	 * from ext4_file_open: open gets called at every open, but release
	 * gets called only when /all/ the files are closed.
	 */
	static int ext4_release_file(struct inode *inode, struct file *filp)

> 
> I suspect the next step is use a blktrace, to see what kind of I/O is
> being sent to the USB drive, and how long it takes for the I/O to
> complete.  You might also try to capture the output of "iostat -x 1"
> while the script is running, and see what the difference might be
> between a kernel version that has the problem and one that doesn't,
> and see if that gives us a clue.

That isn't necessary, given we have concluded that the bad write
performance is caused by broken write order.

> 
> > > And then send me
> > btw, please tell me if "me" means only you or I cc: all the
> > recipients, as usual
> 
> Well, I don't think we know what the root cause is.  Ming is focusing
> on that stack trace, but I think it's a red herring.....  And if it's
> not a file system problem, then other people will be best suited to
> debug the issue.

So far, the reason points to the extra writeback path from exit_to_usermode_loop().
If it is not from close() syscall, the issue should be related with file reference
count. If it is from close() syscall, the issue might be in 'cp''s
implementation.

Andrea, please collect the following log or the strace log requested by Ted, then
we can confirm if the extra writeback is from close() or read/write() syscall:

# pass PID of 'cp' to this script
#!/bin/sh
PID=$1
/usr/share/bcc/tools/trace -P $PID  -t -C \
    't:block:block_rq_insert "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
    't:syscalls:sys_exit_close ' \
    't:syscalls:sys_exit_read ' \
    't:syscalls:sys_exit_write '


Thanks,
Ming