lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 11 Oct 2012 13:58:00 +0100 From: Viktor Nagy <viktor.nagy@...4games.com> To: Jan Kara <jack@...e.cz> CC: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, "Darrick J. Wong" <djwong@...ibm.com>, chris.mason@...ionio.com Subject: Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour (resended, without html) On 2012.10.11. 11:10, Jan Kara wrote: > On Thu 11-10-12 11:52:54, Viktor Nagy wrote: >> On 2012.10.10. 22:27, Jan Kara wrote: >>> On Wed 10-10-12 22:44:41, Viktor Nagy wrote: >>>> On 10/10/2012 06:57 PM, Jan Kara wrote: >>>>> Hello, >>>>> >>>>> On Tue 09-10-12 11:41:16, Viktor Nagy wrote: >>>>>> Since Kernel version 3.0 pdflush blocks writes even the dirty bytes >>>>>> are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio. >>>>>> The kernel 2.6.39 works nice. >>>>>> >>>>>> How this hurt us in the real life: We have a very high performance >>>>>> game server where the MySQL have to do many writes along the reads. >>>>>> All writes and reads are very simple and have to be very quick. If >>>>>> we run the system with Linux 3.2 we get unacceptable performance. >>>>>> Now we are stuck with 2.6.32 kernel here because this problem. >>>>>> >>>>>> I attach the test program wrote by me which shows the problem. The >>>>>> program just writes blocks continously to random position to a given >>>>>> big file. The write rate limited to 100 MByte/s. In a well-working >>>>>> kernel it have to run with constant 100 MBit/s speed for indefinite >>>>>> long. The test have to be run on a simple HDD. >>>>>> >>>>>> Test steps: >>>>>> 1. You have to use an XFS, EXT2 or ReiserFS partition for the test, >>>>>> Ext4 forces flushes periodically. I recommend to use XFS. >>>>>> 2. create a big file on the test partiton. For 8 GByte RAM you can >>>>>> create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte >>>>>> file. File creation can be done with this command: dd if=/dev/zero >>>>>> of=bigfile2048M.bin bs=1M count=2048 >>>>>> 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c) >>>>>> 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin >>>>>> >>>>>> In the beginning there can be some slowness even on well-working >>>>>> kernels. If you create the bigfile in the same run then it runs >>>>>> usually smootly from the beginning. >>>>>> >>>>>> I don't know a setting of /proc/sys/vm variables which runs this >>>>>> test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel >>>>>> bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the >>>>>> testfile size the test program should never be blocked. >>>>> I've run your program and I can confirm your results. As a side note, >>>>> your test program as a bug as it uses 'int' for offset arithmetics so when >>>>> the file is larger than 2 GB, you can hit some problems but for our case >>>>> that's not really important. >>>> Sorry for the bug and maybe the poor implementation. I am much >>>> better in Pascal than in C. >>>> (You can not make such mistake in Pascal (FreePascal). Is there a >>>> way (compiler switch) in C/C++ to get there a warning?) >>> Actually I somewhat doubt that even FreePascal is able to give you a >>> warning that arithmetic can overflow... >> Well, you get a hint at least (FPC 2.6). >> >> program inttest; >> >> var >> i,j : integer; >> >> procedure Test(x : int64); >> begin >> Writeln('x=',x); >> end; >> >> begin >> i := 1000000; >> j := 1000000; >> Test(1000000*1000000); >> Test(int64(i)*j); >> Test(i*j); // result is wrong, but you get a hint here > You get a hint about automatic conversion from 'integer' to 'int64'? I > don't have a fpc compiler at hand to check that but I'd be surprised > because that tends to be rather common. I imagine you get the warning if > the compiler can figure out the numbers in advance. But in your test > program the situation was more like: > ReadLn(i); > j = 4096; > Test(i*j); > > And there the compiler nows nothing about the resulting value... The FPC is not so clever. If you write Test(i*10). It gives the same hint: Converting the operands to "Int64" before doing the multiply could prevent overflow errors. > >>>>> The regression you observe is caused by commit 3d08bcc8 "mm: Wait for >>>>> writeback when grabbing pages to begin a write". At the first sight I was >>>>> somewhat surprised when I saw that code path in the traces but later when I >>>>> did some math it's clear. What the commit does is that when a page is just >>>>> being written out to disk, we don't allow it's contents to be changed and >>>>> wait for IO to finish before letting next write to proceed. Now if you have >>>>> 1 GB file, that's 256000 pages. By the observation from my test machine, >>>>> writeback code keeps around 10000 pages in flight to disk at any moment >>>>> (this number fluctuates a lot but average is around that number). Your >>>>> program dirties about 25600 pages per second. So the probability one of >>>>> dirtied pages is a page under writeback is equal to 1 for all practical >>>>> purposes (precisely it is 1-(1-10000/256000)^25600). Actually, on average >>>>> you are going to hit about 1000 pages under writeback per second which >>>>> clearly has a noticeable impact (even single page can have). Pity I didn't >>>>> do the math when we were considering those patches. >>>>> >>>>> There were plans to avoid waiting if underlying storage doesn't need it but >>>>> I'm not sure how far that plans got (added a couple of relevant CCs). >>>>> Anyway you are about second or third real workload that sees regression due >>>>> to "stable pages" so we have to fix that sooner rather than later... Thanks >>>>> for your detailed report! >>>>> >>>>> Honza >>>> Thank you for your response! >>>> >>>> I'm very happy that I've found the right people. >>>> >>>> We develop a game server which gets very high load in some >>>> countries. We are trying to serve as much players as possible with >>>> one server. >>>> Currently the CPU usage is below the 50% at the peak times. And with >>>> the old kernel it runs smoothly. The pdflush runs non-stop on the >>>> database disk with ~3 MByte/s write (minimal read). >>>> This is at 43000 active sockets, 18000 rq/s, ~40000 packets/s. >>>> I think we are still below the theoratical limits of this server... >>>> but only if the disk writes are never done in sync. >>>> >>>> I will try the 3.2.31 kernel without the problematic commit >>>> (3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a >>>> write"). >>>> Is it a good idea? Will it be worse than 2.6.32? >>> Running without that commit should work just fine unless you use >>> something exotic like DIF/DIX or similar. Whether things will be worse than >>> in 2.6.32 I cannot say. For me, your test program behaves fine without that >>> commit but whether your real workload won't hit some other problem is >>> always a question. But if you hit another regression I'm interested in >>> hearing about it :). >> I've just tested it. After I've set the dirty_bytes over the file >> size the writes are never blocked. >> So it's working nice without the mentioned commit. >> >> The problem is that if you read the kernel's documentation about the >> dirty page handling it does not work that way (with the commit) It >> works unpredictable. > Which documentation do you mean exatly? The process won't be throttled > because of dirtying too much memory but we can still block it for other > reasons - e.g. because we decide to evict it's code from memory and have to > reload it again when the process gets scheduled. Or we can block during > memory allocation (which may be needed to allocate a page you write to) if > we find it necessary. There are no promises really... > > Honza Ok, it is very hard to get an overview about this whole thing. I thought I understood the behaviour checking the file Documentation/sysctl/vm.txt: " dirty_bytes Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. ... " Ok, it not says exactly that other things can influence too. Several people are trying to get over the problem caused by the commit with setting the value of /sys/block/sda/queue/nr_requests to 4 (from 128). This helped a lot but was not enough for us. I attach two performance graphs which shows our own CPU usage measurement (red). One minute averages, the blue line is the SQL time %. And a nice question: Without reverting the patch is it possible to get a smooth performance (in our case)? Viktor Download attachment "perfgraph_linux_3.2.PNG" of type "image/png" (40049 bytes) Download attachment "perfgraph_linux_2.6.32.PNG" of type "image/png" (20920 bytes)
Powered by blists - more mailing lists