linux-ext4 - Re: Problem with direct IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOOPZo7DfbwO5tmpbpNw7T9AgN7ALDc2S8N+0TsDnvEqMZzMmg@mail.gmail.com>
Date:   Thu, 21 Oct 2021 20:11:43 +0800
From:   Zhengyuan Liu <liuzhengyuang521@...il.com>
To:     Jan Kara <jack@...e.cz>
Cc:     viro@...iv.linux.org.uk, Andrew Morton <akpm@...ux-foundation.org>,
        tytso@....edu, linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-ext4@...r.kernel.org,
        刘云 <liuyun01@...inos.cn>,
        Zhengyuan Liu <liuzhengyuan@...inos.cn>
Subject: Re: Problem with direct IO

On Thu, Oct 21, 2021 at 4:03 PM Jan Kara <jack@...e.cz> wrote:
>
> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
> > On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@...e.cz> wrote:
> > > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > > > we are encounting following Mysql crash problem while importing tables :
> > > >
> > > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > > >     fsync() returned EIO, aborting.
> > > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > > >
> > > > At the same time , we found dmesg had following message:
> > > >
> > > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > > >     Possible data corruption due to collision with buffered I/O!
> > > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > > >     625 Comm: kworker/42:1
> > > >
> > > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > > buffered IO interlaced, but after some checking we found it did only
> > > > do direct IO using aio. The problem is exactly from direct-io
> > > > interface (__generic_file_write_iter) itself.
> > > >
> > > > ssize_t __generic_file_write_iter()
> > > > {
> > > > ...
> > > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > > >                 loff_t pos, endbyte;
> > > >
> > > >                 written = generic_file_direct_write(iocb, from);
> > > >                 /*
> > > >                  * If the write stopped short of completing, fall back to
> > > >                  * buffered writes.  Some filesystems do this for writes to
> > > >                  * holes, for example.  For DAX files, a buffered write will
> > > >                  * not succeed (even if it did, DAX does not handle dirty
> > > >                  * page-cache pages correctly).
> > > >                  */
> > > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > > >                         goto out;
> > > >
> > > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > > ...
> > > > }
> > > >
> > > > From above code snippet we can see that direct io could fall back to
> > > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > > it could interleave with buffered IO when fall back occurred. I have
> > > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > > seems  breaking the semantics of direct IO.
> > > >
> > > > The reproduced  environment is:
> > > > Platform:  Kunpeng 920 (arm64)
> > > > Kernel: V5.15-rc
> > > > PAGESIZE: 64K
> > > > Mysql:  V8.0
> > > > Innodb_page_size: default(16K)
> > >
> > > Thanks for report. I agree this should not happen. How hard is this to
> > > reproduce? Any idea whether the fallback to buffered IO happens because
> > > iomap_dio_rw() returns -ENOTBLK or because it returns short write?
> >
> > It is easy to reproduce in my test environment, as I said in the previous
> > email replied to Andrew this problem is related to kernel page size.
>
> Ok, can you share a reproducer?

I don't have a simple test case to reproduce, the whole procedure shown as
following is somewhat complex.

1. Prepare Mysql installation environment
a.  Prepare a SSD partition  (at least 100G) as the Mysql data
partition, format to Ext3 and mount to /data
# mkfs.ext3 /dev/sdb1
# mount /dev/sdb1 /data
b. Create Mysql user and user group
# groupadd mysql
# useradd -g mysql mysql
c. Create Mysql directory
# mkdir -p /data/mysql
# cd /data/mysql
#  mkdir data tmp run log

2. Install Mysql
a. Download mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar from
https://downloads.mysql.com/archives/community/
b. Install Mysql
#  tar -xvf mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar
# yum install openssl openssl-devel
# rpm -ivh mysql-community-common-8.0.25-1.el8.aarch64.rpm
mysql-community-client-plugins-8.0.25-1.el8.aarch64.rpm \
mysql-community-libs-8.0.25-1.el8.aarch64.rp
mysql-community-client-8.0.25-1.el8.aarch64.rpm \
 mysql-community-server-8.0.25-1.el8.aarch64.rpm
mysql-community-devel-8.0.25-1.el8.aarch64.rpm

3. Configure Mysql
a. # chown mysql:mysql /etc/my.cnf
b. # vim /etc/my.cnf
innodb_flush_method = O_DIRECT
default-storage-engine=INNODB
datadir=/data/mysql/data
socket=/data/mysql/run/mysql.sock
tmpdir=/data/mysql/tmp
log-error=/data/mysql/log/mysqld.log
pid-file=/data/mysql/run/mysqld.pid
port=3306
user=mysql
c. initialize Mysql (problem may reproduce at this stage)
# mysqld --defaults-file=/etc/my.cnf --initialize
d. Start Mysql
# mysqld --defaults-file=/etc/my.cnf &
e. Login into Mysql
# mysql -uroot -p -S /data/mysql/run/mysql.sock
You can see the temporary password from step 3.c
f. Configure access
mysql> alter user 'root'@...calhost' identified by "123456";
mysql> create user 'root'@'%' identified by '123456';
mysql> grant all privileges on *.* to 'root'@'%'; flush privileges;
mysql> create database sysbench;

4. Use sysbench to test Mysql
a. Install sysbench from https://github.com/akopytov/sysbench/archive/master.zip
b. Use following script to reproduce problem (may need dozens of minutes)
    while true ; do
      sysbench /usr/local/share/sysbench/oltp_write_only.lua
--table-size=1000000 --tables=100 \
      --threads=32 --db-driver=mysql --mysql-db=sysbench
--mysql-host=127.0.0.1 --mysql- port=3306 \
      --mysql-user=root --mysql-password=123456
--mysql-socket=/var/lib/mysql/mysql.sock  prepare

      sleep 5
      sysbench /usr/local/share/sysbench/oltp_write_only.lua
--table-size=1000000 --tables=100 \
      --threads=32 --db-driver=mysql --mysql-db=sysbench
--mysql-host=127.0.0.1 --mysql- port=3306 \
      --mysql-user=root --mysql-password=123456
--mysql-socket=/var/lib/mysql/mysql.sock  cleanup

      sleep 5
  done

If you can't reproduce, we could provide a remote environment for you or
connect to your machine to build a reproduced environment.

> > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > problem happens? Thanks!
> >
> > Sure, the output is:
> >
> > # dumpe2fs -h /dev/sda3
> > dumpe2fs 1.45.3 (14-Jul-2019)
> > Filesystem volume name:   <none>
> > Last mounted on:          /data
> > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > Filesystem magic number:  0xEF53
> > Filesystem revision #:    1 (dynamic)
> > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > filetype needs_recovery sparse_super large_file
>
> Thanks for the data. OK, a filesystem without extents. Does your test by
> any chance try to do direct IO to a hole in a file? Because that is not
> (and never was) supported without extents. Also the fact that you don't see
> the problem with ext4 (which means extents support) would be pointing in
> that direction.

I am not sure if it trys to do direct IO to a hole or not, is there
any way to check?
If you have a simple test to reproduce please let me know, we are glad to try.

Thanks,