linux-ext4 - Re: Ext4 corruption with VM images as 3 > drop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20200321032242.B43A711C06F@d06av25.portsmouth.uk.ibm.com>
Date:   Sat, 21 Mar 2020 08:52:40 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, "Theodore Y. Ts'o" <tytso@....edu>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
Subject: Re: Ext4 corruption with VM images as 3 > drop_caches



On 3/20/20 5:19 PM, Jan Kara wrote:
> On Fri 20-03-20 11:04:50, Ritesh Harjani wrote:
>> On 3/19/20 6:54 PM, Ritesh Harjani wrote:
>>> On 3/18/20 9:17 AM, Aneesh Kumar K.V wrote:
>>>> Hi,
>>>>
>>>> With new vm install I am finding corruption with the vm image if I
>>>> follow up the install with echo 3 > /proc/sys/vm/drop_caches
>>>>
>>>> The file system reports below error.
>>>>
>>>> Begin: Running /scripts/local-bottom ... done.
>>>> Begin: Running /scripts/init-bottom ...
>>>> [    4.916017] EXT4-fs error (device vda2): ext4_lookup:1700: inode
>>>> #787185: comm sh: iget: checksum invalid
>>>> done.
>>>> [    5.244312] EXT4-fs error (device vda2): ext4_lookup:1700: inode
>>>> #917954: comm init: iget: checksum invalid
>>>> [    5.257246] EXT4-fs error (device vda2): ext4_lookup:1700: inode
>>>> #917954: comm init: iget: checksum invalid
>>>> /sbin/init: error while loading shared libraries: libc.so.6: cannot
>>>> open shared object file: Error 74
>>>> [    5.271207] Kernel panic - not syncing: Attempted to kill init!
>>>> exitcode=0x00007f00
>>>>
>>>> And debugfs reports
>>>>
>>>> debugfs:  stat <917954>
>>>> Inode: 917954   Type: bad type    Mode:  0000   Flags: 0x0
>>>> Generation: 0    Version: 0x00000000
>>>> User:     0   Group:     0   Size: 0
>>>> File ACL: 0
>>>> Links: 0   Blockcount: 0
>>>> Fragment:  Address: 0    Number: 0    Size: 0
>>>> ctime: 0x00000000 -- Wed Dec 31 18:00:00 1969
>>>> atime: 0x00000000 -- Wed Dec 31 18:00:00 1969
>>>> mtime: 0x00000000 -- Wed Dec 31 18:00:00 1969
>>>> Size of extra inode fields: 0
>>>> Inode checksum: 0x00000000
>>>> BLOCKS:
>>>> debugfs:
>>>>
>>>> Bisecting this finds
>>>> Commit 244adf6426ee31a83f397b700d964cff12a247d3("ext4: make
>>>> dioread_nolock the default")
>>>> as bad. If I revert the same on top of linus
>>>> upstream(fb33c6510d5595144d585aa194d377cf74d31911)
>>>> I don't hit the corrupttion anymore.
>>>
>>> Tried replicating this and could easily replicate it on Power box.
>>> I tried to reproduce this on x86 too, but could not reproduce on x86.
>>> Now one difference on Power could be that pagesize is 64K and fs
>>> blocksize is 4K.
>>>
>>> The issue looks like the guest qemu image file is not properly written
>>> back, after host does echo 3 > drop_caches. (correct me if this is not
>>> the case).
>>
>> Ok. So tried this issue with passing "cache=directsync" parameter to
>> drive file. This parameter says it should bypass the host side page
>> cache. With this parameter, I don't see this issue on Power box.
> 
> OK, so this likely means that there is something hosed in the writeback
> path using unwritten extents when blocksize < pagesize. Maybe we miss some
> conversion of unwritten extent to a written one and thus after dropping
> caches we effectively loose data?
> 

Yes, that seems like it. I will try and create a small test case
considering this. Also will go over the unwritten to written path and
check what did I miss there.

Thanks
ritesh





> 
>>> I tried replicating via below test, but it could not reproduce.
>>>
>>> Any idea what kind of unit test could be written for this?
>>> I am not sure how exactly qemu is writing to it's image file.
>>>
>>>
>>> 1. Create 2 files. "mmap-file", "mmap-data".
>>> 2. "mmap-file" is a 2GB sparse file. Then at some random offsets (tried
>>> with both 64KB align and 4KB align offsets), try to write
>>> pagesize/blocksize amount of known data pattern.
>>> 3. These offsets (which are pagesize/blocksize align) are recorded into
>>> "mmap-data" file via normal read/write calls.
>>> 4. Then after we wrote to both files, we munmap the "mmap-file" and
>>> close both of these files.
>>> 5. Then we do echo 3 > drop_caches.
>>> 6. Then in the verify phase, using the offsets written in "mmap-data"
>>> file, I read the "mmap-file" to verify if it's contents are proper or
>>> not.
>>> With that could not reproduce this issue.
>>>
>>>
>>> -ritesh
>>>
>>>
>>