linux-ext4 - Re: [BUG][ext2] XIP does not work on ext2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOvWMLa38zOigzVkcF78ivtszd6F02aNKsB28=Sd58OMeqb9sQ@mail.gmail.com>
Date:	Thu, 7 Nov 2013 12:14:13 -0800
From:	Andiry Xu <andiry@...il.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Wang Shilong <wangsl-fnst@...fujitsu.com>,
	linux-kernel@...r.kernel.org, linux-ext4@...r.kernel.org,
	Andiry Xu <andiry.xu@...il.com>
Subject: Re: [BUG][ext2] XIP does not work on ext2

On Wed, Nov 6, 2013 at 1:18 PM, Jan Kara <jack@...e.cz> wrote:
> On Tue 05-11-13 17:28:35, Andiry Xu wrote:
>> Hi,
>>
>> On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara <jack@...e.cz> wrote:
>> >   Hello,
>> >
>> > On Mon 04-11-13 18:37:40, Andiry Xu wrote:
>> >> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara <jack@...e.cz> wrote:
>> >> >   Hello,
>> >> >
>> >> > On Mon 04-11-13 14:31:34, Andiry Xu wrote:
>> >> >> When I'm trying XIP on ext2, I find that xip does not work on ext2
>> >> >> with latest kernel.
>> >> >>
>> >> >> Reproduce steps:
>> >> >> Compile kernel with following configs:
>> >> >> CONFIG_BLK_DEV_XIP=y
>> >> >> CONFIG_EXT2_FS_XIP=y
>> >> >>
>> >> >> And run following commands:
>> >> >> # mke2fs -b 4096 /dev/ram0
>> >> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/
>> >> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16
>> >> >>
>> >> >> And it shows:
>> >> >> dd: writing `/mnt/ramdisk/test1': No space left on device
>> >> >>
>> >> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a
>> >> >> 16MB write should only occupy 1/4 capacity.
>> >> >>
>> >> >> Criminal commit:
>> >> >> After git bisect, it points to the following commit:
>> >> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b
>> >> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
>> >> >   Thanks for report and the bisection!
>> >> >
>> >> >> Particularly, the following code:
>> >> >> @@ -1412,9 +1415,11 @@ allocated:
>> >> >> *errp = 0;
>> >> >> brelse(bitmap_bh);
>> >> >> -    dquot_free_block_nodirty(inode, *count-num);
>> >> >> -    mark_inode_dirty(inode);
>> >> >> -    *count = num;
>> >> >> +    if (num < *count) {
>> >> >> +        dquot_free_block_nodirty(inode, *count-num);
>> >> >> +        mark_inode_dirty(inode);
>> >> >> +        *count = num;
>> >> >> +    }
>> >> >>       return ret_block;
>> >> >>
>> >> >> Not mark_inode_dirty() is called only when num is less than *count.
>> >> >> However, I've seen
>> >> >> with the dd command, there is case where num >= *count.
>> >> >>
>> >> >> Fix:
>> >> >> I've verified that the following patch fixes the issue:
>> >> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
>> >> >> index 9f9992b..5446a52 100644
>> >> >> --- a/fs/ext2/balloc.c
>> >> >> +++ b/fs/ext2/balloc.c
>> >> >> @@ -1406,11 +1406,10 @@ allocated:
>> >> >>
>> >> >>         *errp = 0;
>> >> >>         brelse(bitmap_bh);
>> >> >> -       if (num < *count) {
>> >> >> +       if (num <= *count)
>> >> >>                 dquot_free_block_nodirty(inode, *count-num);
>> >> >> -               mark_inode_dirty(inode);
>> >> >> -               *count = num;
>> >> >> -       }
>> >> >> +       mark_inode_dirty(inode);
>> >> >> +       *count = num;
>> >> >>         return ret_block;
>> >> >>
>> >> >>  io_error:
>> >> >>
>> >> >> However, I'm not familiar with ext2 source code and cannot tell if
>> >> >> this is the correct fix. At least it fixes my issue.
>> >> >   With this, you have essentially reverted a hunk from commit
>> >> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it
>> >> > should be reverted. num should never ever be greater than *count and when
>> >> > num == count, we the code inside if doesn't do anything useful.
>> >> >
>> >> > I've looked into the code and I think I see the problem. It is a long
>> >> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls
>> >> > ext2_get_block() asking for 0 blocks to map (while we really want 1 block).
>> >> > ext2_get_block() just passes that request and ext2_get_blocks() actually
>> >> > allocates 1 block. And that's were the commit you have identified makes a
>> >> > difference because previously we returned that 1 block was allocated while
>> >> > now we return that 0 blocks were allocated and thus allocation is repeated
>> >> > until all free blocks are exhaused.
>> >> >
>> >> > Attached patch should fix the problem.
>> >> >
>> >>
>> >> Thanks for the reply. I've verified that your patch fixes my issue.
>> >> And it's absolutely better than my solution.
>> >>
>> >> Tested-by: Andiry Xu <andiry.xu@...il.com>
>> >   Thanks for testing!
>> >
>> >> I have another question about ext2 XIP performance, although it's not
>> >> quite related to this thread.
>> >>
>> >> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c.
>> >> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages
>> >> reside in main memory.
>> >> Then I use two different applications to write to the ram disk. One is
>> >> open() with O_DIRECT flag, and writing with Posix write(). Another is
>> >> open() with O_DIRECT, mmap() it to user space, then use memcpy() to
>> >> write data. I use different request size to write data, from 512 bytes
>> >> to 64MB.
>> >>
>> >> In my understanding, the mmap version bypasses the file system and
>> >> does not go to kernel space, hence it should have better performance
>> >> than the Posix-write version. However, my test result shows it's not
>> >> always true: when the request size is between 8KB and 1MB, the
>> >> Posix-write() version has bandwidth about 7GB/s and mmap version only
>> >> has 5GB/s. The test is performed on a i7-3770K machine with 8GB
>> >> memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap
>> >> has really bad performance, only 2GB/s for all request sizes.
>> >>
>> >> Do you know the reason why write() outperforms mmap() in some cases? I
>> >> know it's not related the thread but I really appreciate if you can
>> >> answer my question.
>> >   Well, I'm not completely sure. mmap()ed memory always works on page-by-page
>> > basis - you first access the page, it gets faulted in and you can further
>> > access it. So for small (sub page size) accesses this is a win because you
>> > don't have an overhead of syscall and fs write path. For accesses larger
>> > than page size the overhead of syscall and some initial checks is well
>> > hidden by other things. I guess write() ends up being more efficient
>> > because write path taken for each page is somewhat lighter than full page
>> > fault. But you'd need to look into perf data to get some hard numbers on
>> > where the time is spent.
>> >
>>
>> Thanks for the reply. However I have filled up the whole RAM disk
>> before doing the test, i.e. asked the brd driver to allocate all the
>> pages initially.
>   Well, pages in ramdisk are always present, that's not an issue. But you
> will get a page fault to map a particular physical page in process'
> virtual address space when you first access that virtual address in the
> mapping from the process. The cost of setting up this virtual->physical
> mapping is what I'm talking about.
>

Yes, you are right, there are page faults observed with perf. I
misunderstood page fault as copying pages between backing store and
physical memory.

> If you had a process which first mmaps the file and writes to all pages in
> the mapping and *then* measure the cost of another round of writing to the
> mapping, I would expect you should see speeds close to those of memory bus.
>

I've tried this as well. mmap() performance improves but still not as
good as write().
I used the perf report to compare write() and mmap() applications. For
write() version, top of perf report shows as:
33.33%  __copy_user_nocache
4.72%    ext2_get_blocks
4.42%    mutex_unlock
3.59%    __find_get_block

which looks reasonable.

However, for mmap() version, the perf report looks strange:
94.98% libc-2.15.so       [.] 0x000000000014698d
2.25%   page_fault
0.18%   handle_mm_fault

I don't know what the first item is but it took the majority of cycles.

>> Moreover I have done the test on PMFS, a file system
>> for persistent memory, and the result is the same. PMFS reserves some
>> physical memory during system boot and then use it to emulate the
>> persistent memory device, so there shouldn't be any page fault.
>   Again I don't think page faults would be avoided here.
>

Yes, page faults cannot be avoided in this case as well.

Thanks,
Andiry
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html