linux-kernel - [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <5557A4EC.6000508@oracle.com>
Date:	Sat, 16 May 2015 13:13:32 -0700
From:	santosh shilimkar <santosh.shilimkar@...cle.com>
To:	Ming Lei <ming.lei@...onical.com>, Jens Axboe <axboe@...com>
CC:	Christoph Hellwig <hch@....de>, linux-kernel@...r.kernel.org
Subject: [Regression] Guest fs corruption with 'block: loop: improve performance
 via blk-mq'

Hi Ming Lei, Jens,

While doing few tests with recent kernels with Xen Server,
we saw guests(DOMU) disk image getting corrupted while booting it.
Strangely the issue is seen so far only with disk image over ocfs2
volume. If the same image kept on the EXT3/4 drive, no corruption
is observed. The issue is easily reproducible. You see the flurry
of errors while guest is mounting the file systems.

After doing some debug and bisects, we zeroed down the issue with
commit "b5dd2f6 block: loop: improve performance via blk-mq". With
that commit reverted the corruption goes away.

Some more details on the test setup:
1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
which includes commit b5dd2f6. Boot the Server.
2. On DOM0 file system create a ocfs2 volume
3. Keep the Guest(VM) disk image on ocfs2 volume.
4. Boot guest image. (xm create vm.cfg)
5. Observe the VM boot console log. VM itself use the EXT3 fs.
You will see errors like below and after this boot, that file
system/disk-image gets corrupted and mostly won't boot next time.

Trimmed Guest kernel boot log...
--->
EXT3-fs (dm-0): using internal journal
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (xvda1): using internal journal
EXT3-fs (xvda1): mounted filesystem with ordered data mode
Adding 1048572k swap on /dev/VolGroup00/LogVol01.  Priority:-1 extents:1 
across:1048572k

[...]

EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad 
block 843250
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a 
risk of filesystem corruption in case of system crash.
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394

[...]

EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394

[...]

EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory 
#777661: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, 
name_len=0

[...]

automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in 
ld-2.5.so[b774c000+1b000]
EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block 
bitmap - block_group = 34, block = 1114112
EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block 
bitmap - block_group = 0, block = 221
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad 
block 370280
ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in 
ntpd[b777d000+74000]
EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in 
directory #618360: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory 
#709178: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, 
name_len=0
EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad 
block 372184
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393
--------------------

 From the debug of the actual data on the disk vs what is read by
the guest VM, we suspect the *reads* are actually not going all
the way to disk and possibly returning the wrong data. Because
the actual data on ocfs2 volume at those locations seems
to be non-zero where as the guest seems to be read it as zero.

I tried few experiment without much success so far. One of the
thing I suspected was "requests are now submitted to backend
file/device concurrently so tried to move them under lo->lo_lock
so that they get serialized. Also moved the blk_mq_start_request()
inside the actual work like patch below. But it didn't help. Thought
of reporting the issue to get more ideas on what could be going
wrong. Thanks for help in advance !!

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 39a83c2..22713b2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
  		const struct blk_mq_queue_data *bd)
  {
  	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct loop_device *lo = cmd->rq->q->queuedata;

-	blk_mq_start_request(bd->rq);
-
+	spin_lock_irq(&lo->lo_lock);
  	if (cmd->rq->cmd_flags & REQ_WRITE) {
-		struct loop_device *lo = cmd->rq->q->queuedata;
  		bool need_sched = true;

-		spin_lock_irq(&lo->lo_lock);
  		if (lo->write_started)
  			need_sched = false;
  		else
  			lo->write_started = true;
  		list_add_tail(&cmd->list, &lo->write_cmd_head);
-		spin_unlock_irq(&lo->lo_lock);

  		if (need_sched)
  			queue_work(loop_wq, &lo->write_work);
@@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
  		queue_work(loop_wq, &cmd->read_work);
  	}

+	spin_unlock_irq(&lo->lo_lock);
  	return BLK_MQ_RQ_QUEUE_OK;
  }

@@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
  	if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
  		goto failed;

+	blk_mq_start_request(cmd->rq);
+
  	ret = 0;
  	__rq_for_each_bio(bio, cmd->rq)
  		ret |= loop_handle_bio(lo, bio);
-- 
1.7.1

Regards,
Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/