Beruflich Dokumente
Kultur Dokumente
In this post, I present two approaches to directory scanning, one serial and one parallel. Both
approaches return a collection of all files and directories contained by a given directory,
including the contents (recursively) of each sub-directory thereof. For example, scanning the
following directory tree:
…/dir/
…/dir/picture0.jpg
…/dir/picture1.jpg
…/dir/vacation-pictures/
…dir/vacation-pictures/grand-canyon1.jpg
Serial Scanning
In serial scanning, a recursive method lists all files in a given directory, adds these to a list, and,
if the file being added is a directory, invokes itself: The SerialDirScanner class, shown below,
performs this task:
The “scan” method invokes the recursive “scan0” method, which does the real work of scanning
directories. That method fills up a concurrent map with files and directories as it works. Finally,
the “scan” method returns all keys of the concurrent map.
Parallel scanning
Looking at the “scan0” method of the serial scanner shown above, many opportunities for
parallelization become apparent. For example, the listing of different sibling directories (that is,
directories that both have the same parent) can proceed in parallel, as can the listing of
directories that are unrelated to each other (that is, one directory is not an ancestor of the other).
The “DirScan” class shown below performs parallel scanning:
In common with the “SerialDirScanner” class, the “DirScanner” class has a public method
named “scan” that takes a directory as an argument and returns a Set of File objects. The “scan”
method acquires a semaphore of count 1, calls the “scan0” method, and waits to re-acquire the
same semaphore. The “scan0” method performs the actual work of scanning directory contents.
The “DirScanner” class contains an ExecutorService object named “executor” that is used to
schedule scanning tasks.
Note that the “DirScanner” class accepts an integer constructor parameter named “threads”. This
parameter is used to initialize the ExecutorService with a fixed thread pool of this size.
The “scan0” method submits a new task to the executor whenever it finds a sub-directory while
performing a directory listing. The new “task”, in this case, happens to be a recursive invocation
of itself. The “scan0” method increments a counter (an AtomicInteger instance object named
“counter”) before the task is actually submitted. Each time the “scan0” method completes listing
a directory, it decrements the counter, releasing the semaphore when the counter’s value falls
below zero. This allows the “scan” method, which has been waiting to acquire this semaphore, to
“wake up” and continue, effectively returning a Set of File objects from the key set of the
concurrent map “files”.
To test the performance improvement, if any, of the parallelized directory scanner vis-a-vis the
serial scanner, I wrote the following “main” method:
public static void main(String[] args) {
int status = 0;
try {
Set<File> files0 = null;
Set<File> files1 = null;
long t0 = System.nanoTime();
files0 = SerialDirScanner
.listAllContentsUnder(new File(args[0]));
t0 = System.nanoTime() - t0;
System.out.println(t0 + " ticks.");
t0 = System.nanoTime();
files1 = DirScanner.listAllContentsUnder(new File(args[0]), 5);
t0 = System.nanoTime() - t0;
System.out.println(t0 + " ticks.");
System.out.println(files0.size() + "," + files1.size());
System.out.println(files0.size() == files1.size());
} catch (Throwable exc) {
status = 1;
exc.printStackTrace();
} finally {
System.exit(status);
}
}
The argument I pass in to the main method is a directory that is well populated and quite deep.
Running this on my development laptop (with the specs Core 2 Duo 2.26 MHz, 4GB, running
Windows 7 Evaluation build No. 7100), I get:
3666120093 ticks.
2375654123 ticks.
39592,39592
true
3551936784 ticks.
2340802880 ticks.
39592,39592
true
So, there is a performance gain of roughly 33% using the parallel scanner using five threads
(note that the “listAllContentsUnder” method accepts a thread count as the second argument).
Using three threads, I get:
3594733052 ticks.
2335042793 ticks.
39592,39592
true
which is not much different from the readings obtained from using five threads. Using nine
threads, I get:
3563738170 ticks.
2368012499 ticks.
39592,39592
true
which shows that the performance has actually been reduced by increasing the number of
threads. The explanation for this curious observation is that the concurrent map, into which all
threads write data, faces increased contention as the number of threads goes up.
Another interesting observation is that the parallel scanner is actually slower at scanning shallow
directories. Consider the following output while scanning an example:
2602454 ticks.
10136304 ticks.
17,17
true
Conclusion
Using the parallel directory scanner speeds up the scanning of deep (and well populated)
directories by roughly 33%. Shallow directories are better scanned with the serial directory
scanner.