lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 02 Oct 2007 08:57:53 -0400
From:	Ric Wheeler <>
CC:	"Feld, Andy" <>,
	Jens Axboe <>
Subject: batching support for transactions

After several years of helping tune file systems for normal (ATA/S-ATA) 
drives, we have been doing some performance work on ext3 & reiserfs on 
disk arrays.

One thing that jumps out is that the way we currently batch synchronous 
work loads into transactions does really horrible things to performance 
for storage devices which have really low latency.

For example, one a mid-range clariion box, we can use a single thread to 
write around 750 (10240 byte) files/sec to a single directory in ext3. 
That gives us an average time around 1.3ms per file.

With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.

The culprit seems to be the assumptions in journal_stop() which throw in 
a call to schedule_timeout_uninterruptible(1):

          * Implement synchronous transaction batching.  If the handle
          * was synchronous, don't force a commit immediately.  Let's
          * yield and let another thread piggyback onto this transaction.
          * Keep doing that while new threads continue to arrive.
          * It doesn't cost much - we're about to run a commit and sleep
          * on IO anyway.  Speeds up many-threaded, many-dir operations
          * by 30x or more...
          * But don't do this if this process was the most recent one to
          * perform a synchronous write.  We do this to detect the case 
where a
          * single process is doing a stream of sync writes.  No point 
in waiting
          * for joiners in that case.
         pid = current->pid;
         if (handle->h_sync && journal->j_last_sync_writer != pid) {
                 journal->j_last_sync_writer = pid;
                 do {
                         old_handle_count = transaction->t_handle_count;
                 } while (old_handle_count != transaction->t_handle_count);

reiserfs and ext4 have similar if not exactly the same logic.

What seems to be needed here is either a static per file system/storage 
device tunable to allow us to change this timeout (maybe with "0" 
defaulting back to the old reiserfs trick of simply doing a yield()?) or 
a more dynamic, per device way to keep track of the average time it 
takes to commit a transaction to disk. Based on that rate, we could 
dynamically adjust our logic to account for lower latency devices.

A couple of last thoughts. One, if for some reason you don't have a low 
latency storage array handy and want to test this for yourselves, you 
can test the worst case by using a ram disk.

The test we used was fs_mark with 10240 bytes files, writing to one 
shared directory with varying the numbers of threads from 1 up to 40. In 
the ext3 case, it takes 8 concurrent threads to catch up to the single 
thread writing case.

We are continuing to play with the code and try out some ideas, but I 
wanted to bounce this off the broader list to see if this makes sense...


To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to
More majordomo info at

Powered by blists - more mailing lists