Sie sind auf Seite 1von 13

Article Compressing and Decompressing Data Using Java APIs Print-friendly VersionPrint-friendly Version Articles Index by Qusay H.

Mahmoud with contributions from Konstantin Kladko February 2002 Many sources of information contain redundant data or data that adds little to t he stored information. This results in tremendous amounts of data being transfer red between client and server applications or computers in general. The obvious solution to the problems of data storage and information transfer is to install additional storage devices and expand existing communication facilities. To do s o, however, requires an increase in an organization's operating costs. One metho d to alleviate a portion of data storage and information transfer is through the representation of data by more efficient code. This article presents a brief in troduction to data compression and decompression, and shows how to compress and decompress data, efficiently and conveniently, from within your Java application s using the java.util.zip package. While it is possible to compress and decompress data using tools such as WinZip, gzip, and Java ARchive (or jar), these tools are used as standalone application s. It is possible to invoke these tools from your Java applications, but this is not a straightforward approach and not an efficient solution. This is especiall y true if you wish to compress and decompress data on the fly (before transferri ng it to a remote machine for example). This article: * * * * * mance Gives you a brief overview of data compression Describes the java.util.zip package Shows how to use this package to compress and decompress data Shows how to compress and decompress serialized objects to save disk space Shows how to compress and decompress data on the fly to improve the perfor of client/server applications

Overview of Data Compression The simplest type of redundancy in a file is the repetition of characters. For e xample, consider the following string: BBBBHHDDXXXXKKKKWWZZZZ This string can be encoded more compactly by replacing each repeated string of c haracters by a single instance of the repeated character and a number that repre sents the number of times it is repeated. The earlier string can be encoded as f ollows: 4B2H2D4X4K2W4Z Here "4B" means four B's, and 2H means two H's, and so on. Compressing a string in this way is called run-length encoding. As another example, consider the storage of a rectangular image. As a single col or bitmapped image, it can be stored as shown in Figure 1. figure 1 Figure 1: A bitmap with information for run-length encoding

Another approach might be to store the image as a graphics metafile: Rectangle 11, 3, 20, 5 This says, the rectangle starts at coordinate (11, 3) of width 20 and length 5 p ixels. The rectangular image can be compressed with run-length encoding by counting ide ntical bits as follows: 0, 40 0, 40 0,10 1,20 0,10 0,10 1,1 0,18 1,1 0,10 0,10 1,1 0,18 1,1 0,10 0,10 1,1 0,18 1,1 0,10 0,10 1,20 0,10 0,40 The first line above says that the first line of the bitmap consists of 40 0's. The third line says that the third line of the bitmap consists of 10 0's followe d by 20 1's followed by 10 more 0's, and so on for the other lines. Note that run-length encoding requires separate representations for the file and its encoded version. Therefore, this method cannot work for all files. Other co mpression techniques include variable-length encoding (also known as Huffman Cod ing), and many others. For more information, there are many books available on d ata and image compression techniques. There are many benefits to data compression. The main advantage of it, however, is to reduce storage requirements. Also, for data communications, the transfer o f compressed data over a medium results in an increase in the rate of informatio n transfer. Note that data compression can be implemented on existing hardware b y software or through the use of special hardware devices that incorporate compr ession techniques. Figure 2 shows a basic data-compression block diagram. figure 2 Figure 2: Data-compression block diagram ZIP vs. GZIP If you are working on Windows, you might be familiar with the WinZip tool, which is used to create a compressed archive and to extract files from a compressed a rchive. On UNIX, however, things are done a bit differently. The tar command is used to create an archive (not compressed) and another program (gzip or compress ) is used to compress the archive. Tools such as WinZip and PKZIP act as both an archiver and a compressor. They co mpress files and store them in an archive. On the other hand, gzip does not arch ive files. Therefore, on UNIX, the tar command is usually used to create an arch ive then the gzip command is used to compress the archived file. The java.util.zip Package Java provides the java.util.zip package for zip-compatible data compression. It provides classes that allow you to read, create, and modify ZIP and GZIP file fo rmats. It also provides utility classes for computing checksums of arbitrary inp ut streams that can be used to validate input data. This package provides one in terface, fourteen classes, and two exception classes as shown in Table 1. Table 1: The java.util.zip package Item Type Description

Checksum Interface Represents a data checksum. Implemented by the c lasses Adler32 and CRC32 Adler32 Class Used to compute the Adler32 checksum of a data stream CheckedInputStream Class An input stream that maintains the checksum of t he data being read CheckedOutputStream Class An output stream that maintains the checksum of the data being written CRC32 Class Used to compute the CRC32 checksum of a data stream Deflater Class Supports general compression using the ZLIB compression library DeflaterOutputStream Class An output stream filter for compressing data in the deflate compression format GZIPInputStream Class An input stream filter for reading compressed da ta in the GZIP file format GZIPOutputStream Class An output stream filter for writing compressed d ata in the GZIP file format Inflater Class Supports general decompression using the ZLIB compressio n library InlfaterInputStream Class An input stream filter for decompressing data in the deflate compression format ZipEntry Class Represents a ZIP file entry ZipFile Class Used to read entries from a ZIP file ZipInputStream Class An input stream filter for reading files in the ZIP file format ZipOutputStream Class An output stream filter for writing files in the ZIP file format DataFormatException Exception Class Thrown to signal a data format e rror ZipException Exception Class Thrown to signal a zip error Note: The ZLIB compression library was initially developed as part of the Po rtable Network Graphics (PNG) standard that is not protected by patents. Decompressing and Extracting Data from a ZIP file The java.util.zip package provides classes for data compression and decompressio n. Decompressing a ZIP file is a matter of reading data from an input stream. Th e java.util.zip package provides a ZipInputStream class for reading ZIP files. A ZipInputStream can be created just like any other input stream. For example, th e following segment of code can be used to create an input stream for reading da ta from a ZIP file format: FileInputStream fis = new FileInputStream("figs.zip"); ZipInputStream zin = new ZipInputStream(new BufferedInputStream(fis)); Once a ZIP input stream is opened, you can read the zip entries using the getNex tEntry method which returns a ZipEntry object. If the end-of-file is reached, ge tNextEntry returns null: ZipEntry entry; while((entry = zin.getNextEntry()) != null) { // extract data // open output streams } Now, it is time to set up a decompressed output stream, which can be done as fol lows:

int BUFFER = 2048; FileOutputStream fos = new FileOutputStream(entry.getName()); BufferedOutputStream dest = new BufferedOutputStream(fos, BUFFER); Note: In this segment of code we have used the BufferedOutputStream instead of the ZIPOutputStream. The ZIPOutputStream and the GZIPOutputStream use interna l buffer sizes of 512. The use of the BufferedOutputStream is only justified whe n the size of the buffer is much more than 512 (in this example it is set to 204 8). While the ZIPOutputStream doesn't allow you to set the buffer size, in the c ase of the GZIPOutputStream however, you can specify the internal buffer size as a constructor argument. In this segment of code, a file output stream is created using the entry's name, which can be retrieved using the entry.getName method. Source zipped data is th en read and written to the decompressed stream: while ((count = zin.read(data, 0, BUFFER)) != -1) { //System.out.write(x); dest.write(data, 0, count); } And finally, close the input and output streams: dest.flush(); dest.close(); zin.close(); The source program in Code Sample 1 shows how to decompress and extract files fr om a ZIP archive. To test this sample, compile the class and run it by passing a compressed file in ZIP format: prompt> java UnZip somefile.zip Note that somefile.zip could be a ZIP archive created using any ZIP-compatible t ool, such as WinZip. Code Sample 1: UnZip.java import java.io.*; import java.util.zip.*; public class UnZip { final int BUFFER = 2048; public static void main (String argv[]) { try { BufferedOutputStream dest = null; FileInputStream fis = new FileInputStream(argv[0]); ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis)); ZipEntry entry; while((entry = zis.getNextEntry()) != null) { System.out.println("Extracting: " +entry); int count; byte data[] = new byte[BUFFER]; // write the files to the disk FileOutputStream fos = new

FileOutputStream(entry.getName()); dest = new BufferedOutputStream(fos, BUFFER); while ((count = zis.read(data, 0, BUFFER)) != -1) { dest.write(data, 0, count); } dest.flush(); dest.close(); } zis.close(); } catch(Exception e) { e.printStackTrace(); } } } It is important to note that the ZipInputStream class reads ZIP files sequential ly. The class ZipFile, however, reads the contents of a ZIP file using a random access file internally so that the entries of the ZIP file do not have to be rea d sequentially. Note: Another fundamental difference between ZIPInputStream and ZipFile is i n terms of caching. Zip entries are not cached when the file is read using a com bination of ZipInputStream and FileInputStream. However, if the file is opened u sing ZipFile(fileName) then it is cached internally, so if ZipFile(fileName) is called again the file is opened only once. The cached value is used on the secon d open. If you work on UNIX, it is worth noting that all zip files opened using ZipFile are memory mapped, and therefore the performance of ZipFile is superior to ZipInputStream. If the contents of the same zip file, however, are be to freq uently changed and reloaded during program execution, then using ZipInputStream is preferred. This is how a ZIP file can be decompressed using the ZipFile class: 1. Create a ZipFile object by specifying the ZIP file to be read either as a String filename or as a File object: ZipFile zipfile = new ZipFile("figs.zip"); 2. Use the entries method, returns an Enumeration object, to loop through all the ZipEntry objects of the file: while(e.hasMoreElements()) { entry = (ZipEntry) e.nextElement(); // read contents and save them } 3. Read the contents of a specific ZipEntry within the ZIP file by passing th e ZipEntry to getInputStream, which will return an InputStream object from which you can read the entry's contents: is = new BufferedInputStream(zipfile.getInputStream(entry));

4. Retrieve the entry's filename and create an output stream to save it: byte data[] = new byte[BUFFER];

FileOutputStream fos = new FileOutputStream(entry.getName()); dest = new BufferedOutputStream(fos, BUFFER); while ((count = is.read(data, 0, BUFFER)) != -1) { dest.write(data, 0, count); } 5. Finally, close all input and output streams: dest.flush(); dest.close(); is.close(); The complete source program is shown in Code Sample 2. Again, to test this class , compile it and run it by passing a file in a ZIP format as an argument: prompt> java UnZip2 somefile.zip Code Sample 2: UnZip2.java import java.io.*; import java.util.*; import java.util.zip.*; public class UnZip2 { static final int BUFFER = 2048; public static void main (String argv[]) { try { BufferedOutputStream dest = null; BufferedInputStream is = null; ZipEntry entry; ZipFile zipfile = new ZipFile(argv[0]); Enumeration e = zipfile.entries(); while(e.hasMoreElements()) { entry = (ZipEntry) e.nextElement(); System.out.println("Extracting: " +entry); is = new BufferedInputStream (zipfile.getInputStream(entry)); int count; byte data[] = new byte[BUFFER]; FileOutputStream fos = new FileOutputStream(entry.getName()); dest = new BufferedOutputStream(fos, BUFFER); while ((count = is.read(data, 0, BUFFER)) != -1) { dest.write(data, 0, count); } dest.flush(); dest.close(); is.close(); } } catch(Exception e) { e.printStackTrace(); } } }

Compressing and Archiving Data in a ZIP File The ZipOutputStream can be used to compress data to a ZIP file. The ZipOutputStr eam writes data to an output stream in a ZIP format. There are a number of steps involved in creating a ZIP file. 1. The first step is to create a ZipOutputStream object, to which we pass the output stream of the file we wish to write to. Here is how you create a ZIP fil e entitled "myfigs.zip": FileOutputStream dest = new FileOutputStream("myfigs.zip"); ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(dest)); 2. Once the target zip output stream is created, the next step is to open the source data file. In this example, source data files are those files in the cur rent directory. The list command is used to get a list of files in the current d irectory: File f = new File("."); String files[] = f.list(); for (int i=0; i<files.length; i++) { System.out.println("Adding: "+files[i]); FileInputStream fi = new FileInputStream(files[i]); // create zip entry // add entries to ZIP file } Note: This code sample is capable of compressing all files in the current directory. It doesn't handle subdirectories. As an exercise, you may want to mod ify Code Sample 3 to handle subdirectories. 3. Create a zip entry for each file that is read: 4. ZipEntry entry = new ZipEntry(files[i])) Before you can write data to the ZIP output stream, you must first put the zip entry object using the putNextEntr y method: 5. out.putNextEntry(entry); Write the data to the ZIP file: int count; while((count = origin.read(data, 0, BUFFER)) != -1) { out.write(data, 0, count); } 6. Finally, you close the input and output streams: origin.close(); out.close(); The complete source program is shown in Code Sample 3. Code Sample 3: Zip.java import java.io.*; import java.util.zip.*; public class Zip { static final int BUFFER = 2048;

public static void main (String argv[]) { try { BufferedInputStream origin = null; FileOutputStream dest = new FileOutputStream("c:\\zip\\myfigs.zip"); ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(dest)); //out.setMethod(ZipOutputStream.DEFLATED); byte data[] = new byte[BUFFER]; // get a list of files from current directory File f = new File("."); String files[] = f.list(); for (int i=0; i<files.length; i++) { System.out.println("Adding: "+files[i]); FileInputStream fi = new FileInputStream(files[i]); origin = new BufferedInputStream(fi, BUFFER); ZipEntry entry = new ZipEntry(files[i]); out.putNextEntry(entry); int count; while((count = origin.read(data, 0, BUFFER)) != -1) { out.write(data, 0, count); } origin.close(); } out.close(); } catch(Exception e) { e.printStackTrace(); } } } Note: Entries can be added to a ZIP file either in a compressed (DEFLATED) or un compressed (STORED) form. The setMethod can be used to set the method of storage . For example, to set the method to DEFLATED (compressed) use: out.setMethod(Zip OutputStream.DEFLATED) and to set it to STORED (not compressed) use: out.setMeth od(ZipOutputStream.STORED). ZIP File Properties The ZipEntry class describes a compressed file stored in a ZIP file. The various methods contained in this class can be used to set and get pieces of informatio n about the entry. The ZipEntry class is used by the ZipFile and ZipInputStream to read ZIP files, and the ZipOutputStream to write ZIP files. Some of the most useful methods available in the ZipEntry class are shown, along with a descripti on, in Table 2. Table 2: Some useful methods from the ZipEntry class Method Signature Description public String getComment() Returns the comment string for the entry, null i f none public long getCompressedSize() Returns the compressed size of the entry , -1 if not known public int getMethod() Returns the compression method of the entry, -1 if not s pecified public String getName() Returns the name of the entry public long getSize() Returns the uncompressed zip of the entry, -1 if unknown public long getTime() Returns the modification time of the entry, -1 if not sp

ecified public void entry public void y public void public void Checksums

setComment(String c) setMethod(int method)

Sets the optional comment string for the Sets the compression method for the entr

setSize(long size) Sets the uncompressed size of the entry setTime(long time) Sets the modification time of the entry

Some of the other important classes in the java.util.zip package are the Adler32 and CRC32 classes, which implement the java.util.zip.Checksum interface and com pute the checksums required for data compression. The Adler32 algorithm is known to be faster than the CRC32 and it is as reliable. The getValue method can be u sed to obtain the current value of the checksum. The reset method can be used to reset the checksum to its default value. Checksums can be used to mask corrupted files or messages. For example, suppose you want to create a ZIP file then transfer it to a remote machine. Once it is a t the remote machine, using the checksum you can check whether the file got corr upted during the transmission. To demonstrate how to create checksums, we modify Code Sample 1 and Code Sample 3 to use CheckedInputStream and CheckedOutputStre am as shown in Code Sample 4 and Code Sample 5. Code Sample 4: Zip.java import java.io.*; import java.util.zip.*; public class Zip { static final int BUFFER = 2048; public static void main (String argv[]) { try { BufferedInputStream origin = null; FileOutputStream dest = new FileOutputStream("c:\\zip\\myfigs.zip"); CheckedOutputStream checksum = new CheckedOutputStream(dest, new Adler32()); ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream(checksum)); //out.setMethod(ZipOutputStream.DEFLATED); byte data[] = new byte[BUFFER]; // get a list of files from current directory File f = new File("."); String files[] = f.list(); for (int i=0; i<files.length; i++) { System.out.println("Adding: "+files[i]); FileInputStream fi = new FileInputStream(files[i]); origin = new BufferedInputStream(fi, BUFFER); ZipEntry entry = new ZipEntry(files[i]); out.putNextEntry(entry); int count; while((count = origin.read(data, 0, BUFFER)) != -1) { out.write(data, 0, count); } origin.close();

} out.close(); System.out.println("checksum: "+checksum.getChecksum().getValue()); } catch(Exception e) { e.printStackTrace(); } } } Code Sample 5: UnZip.java import java.io.*; import java.util.zip.*; public class UnZip { public static void main (String argv[]) { try { final int BUFFER = 2048; BufferedOutputStream dest = null; FileInputStream fis = new FileInputStream(argv[0]); CheckedInputStream checksum = new CheckedInputStream(fis, new Adler32()); ZipInputStream zis = new ZipInputStream(new BufferedInputStream(checksum)); ZipEntry entry; while((entry = zis.getNextEntry()) != null) { System.out.println("Extracting: " +entry); int count; byte data[] = new byte[BUFFER]; // write the files to the disk FileOutputStream fos = new FileOutputStream(entry.getName()); dest = new BufferedOutputStream(fos, BUFFER); while ((count = zis.read(data, 0, BUFFER)) != -1) { dest.write(data, 0, count); } dest.flush(); dest.close(); } zis.close(); System.out.println("Checksum: "+checksum.getChecksum().getValue()); } catch(Exception e) { e.printStackTrace(); } } }

To test Code Sample 4 and 5, compile the classes and then run the Zip class to c reate a ZIP archive (a checksum value will be calculated and printed on the scre en for your information) and then run the UnZip class to decompress the archive (a checksum value will be printed on the console). The two values must be exactl

y the same, otherwise the file is corrupted. Checksums are very useful in valida ting data. For example, you can create a ZIP file and send it to your friend alo ng with a checksum. Your friend unzips the file and compares the checksum with t he one you provided, if they are the same your friend knows that the file is aut hentic. Compressing Objects We have seen how to compress data available in file form and add it to an archiv e. But what if the data you wish to compress is not available in a file? Assume for example, that you are transferring large objects over sockets. To improve th e performance of your application, you may want to compress the objects before s ending them across the network and uncompress them at the destination. As anothe r example, let's say you want to save objects on the disk in compressed format. The ZIP format, which is record-based, is not really suitable for this job. The GZIP is more appropriate as it operates on a single stream of data. Now, let's see an example of how to compress objects before writing them on disk and how to decompress them after reading them from the disk. Code Sample 6 is a simple class that implements the Serializable interface to signal the JVM1 that we wish to serialize instances of this class. Code Sample 6: Employee.java import java.io.*; public class Employee implements Serializable { String name; int age; int salary; public Employee(String name, int age, int salary) { this.name = name; this.age = age; this.salary = salary; } public void print() { System.out.println("Record for: "+name); System.out.println("Name: "+name); System.out.println("Age: "+age); System.out.println("Salary: "+salary); } } Now, write another class that creates a couple of objects from the Employee clas s. Code Sample 7 creates two objects (sarah and sam) of the Employee class, then saves their state in a file in a compressed format. Code Sample 7 SaveEmployee.java import java.io.*; import java.util.zip.*; public class SaveEmployee { public static void main(String argv[]) throws Exception { // create some objects Employee sarah = new Employee("S. Jordan", 28, 56000);

Employee sam = new Employee("S. McDonald", 29, 58000); // serialize the objects sarah and sam FileOutputStream fos = new FileOutputStream("db"); GZIPOutputStream gz = new GZIPOutputStream(fos); ObjectOutputStream oos = new ObjectOutputStream(gz); oos.writeObject(sarah); oos.writeObject(sam); oos.flush(); oos.close(); fos.close(); } } Now, the ReadEmployee class shown in Code Sample 8 is used to reconstruct the st ate of the two objects. Once the state has been constructed the print method is invoked on them. Code Sample 8: ReadEmployee.java import java.io.*; import java.util.zip.*; public class ReadEmployee { public static void main(String argv[]) throws Exception{ //deserialize objects sarah and sam FileInputStream fis = new FileInputStream("db"); GZIPInputStream gs = new GZIPInputStream(fis); ObjectInputStream ois = new ObjectInputStream(gs); Employee sarah = (Employee) ois.readObject(); Employee sam = (Employee) ois.readObject(); //print the records after reconstruction of state sarah.print(); sam.print(); ois.close(); fis.close(); } } The same idea can be used to compress large objects that are sent over sockets. The following segment of code show how to write objects in a compressed format, from the server to the client: // write to client GZIPOutputStream gzipout = new GZIPOutputStream(socket.getOutputStream()); ObjectOutputStream oos = new ObjectOutputStream(gzipout); oos.writeObject(obj); gzipos.finish(); And, the following segment of code shows how to decompress the objects at the cl ient side once received from the server:

// read from server Socket socket = new Socket(remoteServerIP, PORT); GZIPInputStream gzipin = new GZIPInputStream(socket.getInputStream()); ObjectInputStream ois = new ObjectInputStream(gzipin); Object o = ois.readObject(); What about JAR Files? The Java ARchive (JAR) format is based on the standard ZIP file format with an o ptional manifest file. If you wish to create JAR files or extract files from a J AR file from within your Java applications, use the java.util.jar package, which provides classes for reading and writing JAR files. Using the classes provided by the java.util.jar package is very similar to using the classes provided by th e java.util.zip package as described in this article. Therefore, you should be a ble to adapt much of the code in this article if you wish to use the java.util.j ar package. Conclusion This article discussed the APIs that you can use to compress and decompress data from within your applications, with code samples throughout the article to show how to use the java.util.zip package to compress and decompress data. Now you h ave the tools to utilize data compression and decompression in your applications . The article also shows how to compress and decompress data on the fly in order t o reduce network traffic and improve the performance of your client/server appli cations. Compressing data on the fly, however, improves the performance of clien t/server applications only when the objects being compressed are more than a cou ple of hundred bytes. You would not be able to observe improvement in performanc e if the objects being compressed and transferred are simple String objects, for example. For more information * * * * The java.util.zip Package The java.util.jar Package Object Serialization Transporting Objects over Sockets

Coffecup Logo

Das könnte Ihnen auch gefallen