lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 24 Jul 2021 12:12:36 -0700
From:   "Andres Freund" <andres@...razel.de>
To:     "Matthew Wilcox" <willy@...radead.org>
Cc:     "James Bottomley" <James.Bottomley@...senpartnership.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org,
        "Linus Torvalds" <torvalds@...ux-foundation.org>,
        "Andrew Morton" <akpm@...ux-foundation.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        "Christoph Hellwig" <hch@....de>,
        "Michael Larabel" <Michael@...haellarabel.com>
Subject: Re: Folios give an 80% performance win

Hi,

On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote:
> On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote:
> > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote:
> > > Well, I cut the previous question deliberately, but if you're going to
> > > force me to answer, my experience with storage tells me that one test
> > > being 10x different from all the others usually indicates a problem
> > > with the benchmark test itself rather than a baseline improvement, so
> > > I'd wait for more data.
> > 
> > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area).
> > 
> > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken?
> > 
> > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages.
> 
> It's going to depend substantially on the access pattern.  If the 16MB
> file (oof, that's tiny!) was read in in large chunks or even in small
> chunks, but consecutively, the folio changes will allocate larger pages
> (16k, 64k, 256k, ...).  Theoretically it might get up to 2MB pages and
> start using PMDs, but I've never seen that in my testing.

The 16MB files are just for the WAL/journal, and are write only in a benchmark like this. With pgbench it'll be written in small consecutive chunks (a few pages at a time, for each group commit). Each page is only written once, until after a checkpoint the entire file is "recycled" (renamed into the future of the WAL stream) and reused from start.

The data files are 1GB.


> fdatasync() could indeed have got much faster.  If we're writing back a
> 256kB page as a unit, we're handling 64 times less metadata than writing
> back 64x4kB pages.  We'll track 64x less dirty bits.  We'll find only
> 64 dirty pages per 16MB instead of 4096 dirty pages.

The dirty writes will be 8-32k or so in this workload - the constant commits require the WAL to constantly be flushed.


> It's always possible I just broke something.  The xfstests aren't
> exhaustive, and no regressions doesn't mean no problems.
> 
> Can you guide Michael towards parameters for pgbench that might give
> an indication of performance on a more realistic workload that doesn't
> entirely fit in memory?

Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed...

Michael, where do I find more details about the codification used during the run?

Regards,

Andres

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ