lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 21 May 2008 00:19:30 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Hisashi Hifumi <hifumi.hisashi@....ntt.co.jp>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=
 blocksize environment

On Wed, 21 May 2008 15:52:04 +0900 Hisashi Hifumi <hifumi.hisashi@....ntt.co.jp> wrote:

> Hi.
> 
> When we read some part of a file through pagecache, if there is a pagecache 
> of corresponding index but this page is not uptodate, read IO is issued and
> this page will be uptodate.
> I think this is good for pagesize == blocksize environment but there is room
> for improvement on pagesize != blocksize environment. Because in this case
> a page can have multiple buffers and even if a page is not uptodate, some buffers 
> can be uptodate. So I suggest that when all buffers which correspond to a part
> of a file that we want to read are uptodate, use this pagecache and copy data
> from this pagecache to user buffer even if a page is not uptodate. This can
> reduce read IO and improve system throughput.

I suppose that makes sense.

> I did a performance test using the sysbench.

That's not a terribly good benchmark, IMO.  It's too complex.

To work out the best-case for a change like this I'd suggest a
microbenchmark which does something such as seeking all around a file
doing single-byte reads.

Then one should think up a benchmark which demonstrates the worst-case,
such as reading one-byte-quantities from a file at offsets 0, 0x2000,
0x4000, 0x6000, ...  and then read more one-byte-quantities at offsets
0x1000, 0x3000, 0x5000, etc.  That would be a pretty cruel comparison,
but as one tosses in more such artificial worklaods, one is in a better
position to work out whether the change is an aggregate benefit.

The results from a great big lumped-together benchmark such as sysbench 
aren't a lot of use to us in predicting how effective this change will
be across all the workloads which the kernel implements.

> @@ -932,8 +932,16 @@ find_page:
>  					ra, filp, page,
>  					index, last_index - index);
>  		}
> -		if (!PageUptodate(page))
> -			goto page_not_up_to_date;
> +		if (!PageUptodate(page)) {
> +			if (inode->i_blkbits == PAGE_CACHE_SHIFT)
> +				goto page_not_up_to_date;
> +			if (TestSetPageLocked(page))
> +				goto page_not_up_to_date;
> +			if (!page_has_buffers(page) ||
> +			      !check_buffers_uptodate(offset, desc, page))

We shouldn't do this.

> +				goto page_not_up_to_date_locked;
> +			unlock_page(page);
> +		}

See, the code which you have here is assuming that if PagePrivate is
set, then the thing which is at page.private is a ring of buffer_heads.

But this code (do_generic_file_read) doesn't know that!  Take a look at
afs, nfs, perhaps other filesystems, grep for set_page_private().

Only the address_space implementation (ie: the filesystem) knows
whether page.private holds buffer_heads and only the
address_space_operations functions are allowed to call into library
functions which treat page.private as a buffer_head ring.

Now, your code _may_ not crash, because perhaps there is no filesystem
which puts something else into page.private which also uses
do_generic_file_read().  But it's still wrong.

I guess a suitable fix might be to implement the above using a new
address_space_operations callback:

	if (PagePrivate(page) && aops->is_partially_uptodate) {
		if (aops->is_partially_uptodate(page, desc, offset))
			<OK, we can copy the data>

then implement a generic_file_is_partially_uptodate() in fs/buffer.c
and wire that up in the filesystems.

Note that things like network filesystems can then implement this also.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ