Sie sind auf Seite 1von 5

Fast directory scanning

In this post, I present two approaches to directory scanning, one serial and one parallel. Both
approaches return a collection of all files and directories contained by a given directory,
including the contents (recursively) of each sub-directory thereof. For example, scanning the
following directory tree:

…/dir/

…/dir/picture0.jpg

…/dir/picture1.jpg

…/dir/vacation-pictures/

…dir/vacation-pictures/grand-canyon1.jpg

returns a collection containing four entries, viz. …/dir/picture0.jpg, …/dir/picture1.jpg,


…/dir/vacation-pictures/ and …/dir/vacation-pictures/grand-canyon1.jpg.

Serial Scanning

In serial scanning, a recursive method lists all files in a given directory, adds these to a list, and,
if the file being added is a directory, invokes itself: The SerialDirScanner class, shown below,
performs this task:

public class SerialDirScanner {


    private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File,
Boolean>();

    public SerialDirScanner() {


        super();
    }

    public Set<File> scan(File dir) throws InterruptedException,


            ExecutionException {
        scan0(dir);
        return files.keySet();
    }

    private void scan0(File dir) {


        for (final File file : dir.listFiles()) {
            files.putIfAbsent(file, Boolean.TRUE);
            if (file.isDirectory()) {
                scan0(file);
            }
        }
    } 
    public static Set<File> listAllContentsUnder(File dir)
            throws InterruptedException, ExecutionException {
        SerialDirScanner scanner = new SerialDirScanner();
        return scanner.scan(dir);
    }
}

The “scan” method invokes the recursive “scan0” method, which does the real work of scanning
directories. That method fills up a concurrent map with files and directories as it works. Finally,
the “scan” method returns all keys of the concurrent map.

Parallel scanning

Looking at the “scan0” method of the serial scanner shown above, many opportunities for
parallelization become apparent. For example, the listing of different sibling directories (that is,
directories that both have the same parent) can proceed in parallel, as can the listing of
directories that are unrelated to each other (that is, one directory is not an ancestor of the other).
The “DirScan” class shown below performs parallel scanning:

public class DirScanner {


    private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File,
Boolean>();
    private final ExecutorService executor;
    private final Semaphore sem = new Semaphore(1);
    private final AtomicInteger count = new AtomicInteger();

    public DirScanner(int threads) {


        super();
        executor = Executors.newFixedThreadPool(threads);
    }

    public Set<File> scan(File dir) throws InterruptedException,


            ExecutionException {
        sem.acquire();
        scan0(dir);
        sem.acquire();
        return files.keySet();
    }

    private void scan0(File dir) {


        for (final File file : dir.listFiles()) {
            files.putIfAbsent(file, Boolean.TRUE);
            if (file.isDirectory()) {
                count.incrementAndGet();
                executor.submit(new Runnable() {
                    public void run() {
                        DirScanner.this.scan0(file);
                    }
                });
            }
        }
        if (count.decrementAndGet() < 0) {
            sem.release();
        }
    }
    public void close(){
        executor.shutdown();
    }
    public static Set<File> listAllContentsUnder(File dir, int threads)
            throws InterruptedException, ExecutionException {
        DirScanner scanner = new DirScanner(threads);
        return scanner.scan(dir);
    }
}

In common with the “SerialDirScanner” class, the “DirScanner” class has a public method
named “scan” that takes a directory as an argument and returns a Set of File objects. The “scan”
method acquires a semaphore of count 1, calls the “scan0” method, and waits to re-acquire the
same semaphore. The “scan0” method performs the actual work of scanning directory contents.
The “DirScanner” class contains an ExecutorService object named “executor” that is used to
schedule scanning tasks.

Note that the “DirScanner” class accepts an integer constructor parameter named “threads”. This
parameter is used to initialize the ExecutorService with a fixed thread pool of this size.

The “scan0” method submits a new task to the executor whenever it finds a sub-directory while
performing a directory listing. The new “task”, in this case, happens to be a recursive invocation
of itself. The “scan0” method increments a counter (an AtomicInteger instance object named
“counter”) before the task is actually submitted. Each time the “scan0” method completes listing
a directory, it decrements the counter, releasing the semaphore when the counter’s value falls
below zero. This allows the “scan” method, which has been waiting to acquire this semaphore, to
“wake up” and continue, effectively returning a Set of File objects from the key set of the
concurrent map “files”.

Performance comparison and thread counts

To test the performance improvement, if any, of the parallelized directory scanner vis-a-vis the
serial scanner, I wrote the following “main” method:
public static void main(String[] args) {
    int status = 0;
    try {
        Set<File> files0 = null;
        Set<File> files1 = null;
        long t0 = System.nanoTime();
        files0 = SerialDirScanner
                .listAllContentsUnder(new File(args[0]));
        t0 = System.nanoTime() - t0;
        System.out.println(t0 + " ticks.");
        t0 = System.nanoTime();
        files1 = DirScanner.listAllContentsUnder(new File(args[0]), 5);
        t0 = System.nanoTime() - t0;
        System.out.println(t0 + " ticks.");
        System.out.println(files0.size() + "," + files1.size());
        System.out.println(files0.size() == files1.size());
    } catch (Throwable exc) {
        status = 1;
        exc.printStackTrace();
    } finally {
        System.exit(status);
    }
}

The argument I pass in to the main method is a directory that is well populated and quite deep.
Running this on my development laptop (with the specs Core 2 Duo 2.26 MHz, 4GB, running
Windows 7 Evaluation build No. 7100), I get:

1. Initially, before the directory structure is cached by the operating system:

3666120093 ticks.
2375654123 ticks.
39592,39592
true

2. Subsequently, after the directory structure has been cached:

3551936784 ticks.
2340802880 ticks.
39592,39592
true

So, there is a performance gain of roughly 33% using the parallel scanner using five threads
(note that the “listAllContentsUnder” method accepts a thread count as the second argument).
Using three threads, I get:
3594733052 ticks.
2335042793 ticks.
39592,39592
true

which is not much different from the readings obtained from using five threads. Using nine
threads, I get:

3563738170 ticks.
2368012499 ticks.
39592,39592
true

which shows that the performance has actually been reduced by increasing the number of
threads. The explanation for this curious observation is that the concurrent map, into which all
threads write data, faces increased contention as the number of threads goes up.

Another interesting observation is that the parallel scanner is actually slower at scanning shallow
directories. Consider the following output while scanning an example:

2602454 ticks.
10136304 ticks.
17,17
true

Conclusion

Using the parallel directory scanner speeds up the scanning of deep (and well populated)
directories by roughly 33%. Shallow directories are better scanned with the serial directory
scanner.

Das könnte Ihnen auch gefallen