lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 10 Oct 2012 23:27:25 +0200
From:	Jan Kara <jack@...e.cz>
To:	Viktor Nagy <viktor.nagy@...4games.com>
Cc:	Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org,
	"Darrick J. Wong" <djwong@...ibm.com>, chris.mason@...ionio.com
Subject: Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
> On 10/10/2012 06:57 PM, Jan Kara wrote:
> >   Hello,
> >
> >On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
> >>Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
> >>are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
> >>The kernel 2.6.39 works nice.
> >>
> >>How this hurt us in the real life: We have a very high performance
> >>game server where the MySQL have to do many writes along the reads.
> >>All writes and reads are very simple and have to be very quick. If
> >>we run the system with Linux 3.2 we get unacceptable performance.
> >>Now we are stuck with 2.6.32 kernel here because this problem.
> >>
> >>I attach the test program wrote by me which shows the problem. The
> >>program just writes blocks continously to random position to a given
> >>big file. The write rate limited to 100 MByte/s. In a well-working
> >>kernel it have to run with constant 100 MBit/s speed for indefinite
> >>long. The test have to be run on a simple HDD.
> >>
> >>Test steps:
> >>1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
> >>Ext4 forces flushes periodically. I recommend to use XFS.
> >>2. create a big file on the test partiton. For 8 GByte RAM you can
> >>create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
> >>file. File creation can be done with this command:  dd if=/dev/zero
> >>of=bigfile2048M.bin bs=1M count=2048
> >>3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
> >>4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
> >>
> >>In the beginning there can be some slowness even on well-working
> >>kernels. If you create the bigfile in the same run then it runs
> >>usually smootly from the beginning.
> >>
> >>I don't know a setting of /proc/sys/vm variables which runs this
> >>test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
> >>bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
> >>testfile size the test program should never be blocked.
> >   I've run your program and I can confirm your results. As a side note,
> >your test program as a bug as it uses 'int' for offset arithmetics so when
> >the file is larger than 2 GB, you can hit some problems but for our case
> >that's not really important.
> Sorry for the bug and maybe the poor implementation. I am much
> better in Pascal than in C.
> (You can not make such mistake in Pascal (FreePascal). Is there a
> way (compiler switch) in C/C++ to get there a warning?)
  Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...

> >The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
> >writeback when grabbing pages to begin a write". At the first sight I was
> >somewhat surprised when I saw that code path in the traces but later when I
> >did some math it's clear. What the commit does is that when a page is just
> >being written out to disk, we don't allow it's contents to be changed and
> >wait for IO to finish before letting next write to proceed. Now if you have
> >1 GB file, that's 256000 pages. By the observation from my test machine,
> >writeback code keeps around 10000 pages in flight to disk at any moment
> >(this number fluctuates a lot but average is around that number). Your
> >program dirties about 25600 pages per second. So the probability one of
> >dirtied pages is a page under writeback is equal to 1 for all practical
> >purposes (precisely it is 1-(1-10000/256000)^25600). Actually, on average
> >you are going to hit about 1000 pages under writeback per second which
> >clearly has a noticeable impact (even single page can have). Pity I didn't
> >do the math when we were considering those patches.
> >
> >There were plans to avoid waiting if underlying storage doesn't need it but
> >I'm not sure how far that plans got (added a couple of relevant CCs).
> >Anyway you are about second or third real workload that sees regression due
> >to "stable pages" so we have to fix that sooner rather than later... Thanks
> >for your detailed report!
> >
> >								Honza
> Thank you for your response!
> 
> I'm very happy that I've found the right people.
> 
> We develop a game server which gets very high load in some
> countries. We are trying to serve as much players as possible with
> one server.
> Currently the CPU usage is below the 50% at the peak times. And with
> the old kernel it runs smoothly. The pdflush runs non-stop on the
> database disk with ~3 MByte/s write (minimal read).
> This is at 43000 active sockets, 18000 rq/s, ~40000 packets/s.
> I think we are still below the theoratical limits of this server...
> but only if the disk writes are never done in sync.
> 
> I will try the 3.2.31 kernel without the problematic commit
> (3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a
> write").
> Is it a good idea? Will it be worse than 2.6.32?
  Running without that commit should work just fine unless you use
something exotic like DIF/DIX or similar. Whether things will be worse than
in 2.6.32 I cannot say. For me, your test program behaves fine without that
commit but whether your real workload won't hit some other problem is
always a question. But if you hit another regression I'm interested in
hearing about it :).

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ