Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents

$
:onstantinos D. ;echlivanis -uffman Coding> a Case #tud of a Comparison Det!een Three Different T pe Documents
Abstract We examine the results of applying Huffman coding to three different style documents: a novel from the 19 th century, an HTML document of a modern ne s e!site and a " language source code# Index Terms - $ata compression, Lossless source coding, %ntropy rate of a source, &nformation theory
source. The outcome of the certain coding consists of a varia"le=length code ta"le for encoding a source, !here the varia"le=length code ta"le has "een derived in a particular !a , "ased on the estimated pro"a"ilit of occurrence for each possi"le value of the source s m"ol +A,. B. The algorithm )ccording to the algorithm of -uffman, for the "inar encoding of the source s m"ols follo! the ne5t steps> $. /. The source s m"ols are arranged in decreasing transmission pro"a"ilit . The last t!o s m"ols of the source !ith the lo!est li?elihood of producing com"ined into one, pro"a"ilit e.ual to the sum of the ro"a"ilities of the t!o s m"ols, resulting in the reduction of one of the pluralit of conventional of all of the source alpha"et. #teps $ and / are repeated until the source alpha"et consists of onl t!o s m"ols. In these t!o s m"ols assigned to B and $ of the "inar code. ) (B( and a ($( is assigned in place of one and the other s m"ol respectivel , !hich in step / !ere merged into one. This step relates to all mergers. The code!ords of s m"ols formed " all the "its (B( and ($( associated !ith these s m"ols 1from "ottom to top2, ie the digits are assigned directl to them or in merged s m"ols involved.
I. INTRODUCTION VER since the development of electronic means for the transmission and information processing, has emerged the need of reducing the volume of data transmitted. Data compression theor !as formulated " Claude E. #hannon, in his $%&' paper () *athematical Theor of Communication( +$,. #hannon proved that there is a fundamental limit to lossless data compression. This limit, called entrop rate, is denoted " -. It is possi"le to compress the source in a lossless manner, !ith compression rate close to - and it can "e mathematicall proven that it is impossi"le to have a "etter compression rate than -. One important method of transmitting messages is to transmit in their place se.uences of s m"ols +/,. 0or the "est performance of communication s stems, sought as much as possi"le compact representation of messages !hich is achieved " removing the redundanc inherent in them. This process is called source coding. *ore specificall source coding is the process of converting the se.uence of s m"ols generated " a source, in s m"ol se.uences of the code 1usuall "inar se.uences2, so as to remove the redundanc and the resulting compressed representation of messages. Compression can "e either loss or lossless. 3ossless compression reduces "its " identif ing and eliminating statistical redundanc . No information is lost in lossless compression. 3oss compression reduces "its " identif ing marginall important information and removing it +4,. 3ossless compression e5amples of methods for source coding are #hannon coding +$,, #hannon 6 0ano coding +7,, #hannon 6 0ano 6 Elias coding +&,, and -uffman coding +/,, !here the last one can "een proven to "e the optimum for a specific set of s m"ols !ith specific pro"a"ilities.
7.
&. 4.
C. Example The fre.uenc of the letters in the English language 1according to Ci?ipedia2 is the follo!ing>
II.-U00*)N CODIN8 A. Description - prerequisites -uffman coding is the optimal code for a given pro"a"ilit of occurrence of the source s m"ols that can "e o"tained " means of a simple algorithm coding. It can "e sho!n that no other algorithm can lead to the construction of a code !ith smaller average length code!ords, for a given alpha"et of the
9 :. D. ;echlivanis is head of the department of Informatics and Organi<ation of the *ental -ospital of Corfu, 8reece 1e=mail> ?onpe5l@gmail.com2.
/ life on the *ississippi River, around the time it !as !ritten. The second is a plain -T*3 document of the 0ront;age from the DDC ne!s channel !e"site, on the B'=$$=/B$/. It represents the modern spo?en English in the !estern !orld. 0inall the last document is the C language source code implementation of the -uffman algorithm itself. It represents a strictl technical document, !ith a limited set of !ords. B. Implementation T!o programs !ere used in order to appl -uffman coding to the three documents. The first one, NletterPcount.cppO, accepts as input a file in te5t format and then counts the e5act num"er of the letters and the spaces presented in the !hole document. It also calculates the fre.uenc of each s m"ol and !rites these values to a te5t file named Nfre.s.t5tO. The second, Nhuffman.cO, is an implementation of the -uffman algorithm, having as output the compressed "inar code in a te5t file, calculating also the percentage of memor saved " means of the use of the algorithm. C. Results The first document, NThe )dventures of Tom #a! erO, !as found to have a total of 7I%.B44 letters and spaces altogether. The uncompressed te5t !ould have "een encoded using 44/ original "its in total for all letters and after the application of the -uffman coding the total "its used are 7//. This ields to a 4'.77Q saved memor . The second document, from !!!.""c.com, !as found to have a total of ''.A'I letters and spaces altogether. The uncompressed te5t !ould have "een encoded using '/& original "its in total for all letters and after the application of the -uffman coding the total "its used are A/7. This ields to a I4.A$Q saved memor . The third document, Nhuffman.cO, !as found to have a total of 7.7AI letters and spaces altogether. The uncompressed te5t !ould have "een encoded using $&& original "its in total for all letters and after the application of the -uffman coding the total "its used are I'. This ields to a 4&.$IQ saved memor . IV. CONC3U#ION )ccording to the results of the -uffman coding application to the three documents, !e can sa that the gain !as greater to the ""c.com -T*3 document. This ma "e a result of the fre.uent appearance in this document of certain tags li?e RdivS, RpS, etc. )lso interesting is the fact that the amount of compression "et!een the other t!o documents is almost e.ual. There has to "e a further research !ith more documents to e5amine in order to end up !ith a safe conclusion. );;ENDIE #ource code of the first C program> letterPcount.cpp Tinclude RiostreamS int main12
)ssuming that !e have to use the -uffman algorithm to encode the follo!ing su"set of the English letters, having the a"ove pro"a"ilities E F Ga, ", c, d, eH and ;1E2 F GB.B'/, B.B$4, B.B/', B.B&7, B.$/IH )ccording to the algorithm, !e first arrange the s m"ols in descending order of the transmission pro"a"ilit . The first column of the follo!ing ta"le contains the s m"ols and the second column their chances. In the ne5t step, the s m"ols !ith the least chance are com"ined !ith a pro"a"ilit e.ual to the sum of these pro"a"ilities. Then !e re=arrange the s m"ols considering the Joining of the last t!o. In the ne5t step, the s m"ols !ith the least chance are merged again and the remaining s m"ols are arranged again. Ce repeat the steps of merging and arranging of the s m"ols that are left until !e finall end up !ith the merging of the last t!o s m"ols, having a pro"a"ilit e.ual to the sum of those of them. #tarting from the last column of pro"a"ilit , !e assign to them the s m"ols (B( and ($(, respectivel . In the prior pro"a"ilities column !e also assign the s m"ols (B( and ($( ne5t to the previous com"ined, etc. 0inall the resulting codes respectivel for each letter of the chosen !ould "e
D. Limitations )lthough -uffmanKs algorithm is optimal for a s m"ol=" = s m"ol coding !ith a ?no!n input pro"a"ilit distri"ution, it is not optimal !hen the s m"ol=" =s m"ol restriction is dropped, or !hen the pro"a"ilit mass functions are un?no!n, not identicall distri"uted, or not independent. Other methods such as arithmetic coding and 3LC coding often have "etter compression capa"ilit > "oth of these methods can com"ine an ar"itrar num"er of s m"ols for more efficient coding, and generall adapt to the actual input statistics, the latter of !hich is useful !hen input pro"a"ilities are not precisel ?no!n or var significantl !ithin the stream +A,. III. C)#E #TUDM A. Description In order to compare the results of the -uffman coding implementation "et!een different t pes of human=produced te5ts, !e have chosen to pic? three representative documents of human activit . The first one is in the field of literature, a $'IA novel NThe )dventures of Tom #a erO " *ar? T!ain +I,. The novel is clearl indicative of the fol?lore surrounding
7 G 0I3E Uinput, UoutputV char cV char letters+/I, F GKaK,K"K,KcK,KdK,KeK,KfK,KgK,KhK,KiK,KJK,K?K,KlK,KmK,KnK,KoK,KpK,K.K,KrK,KsK,KtK,KuK,KvK,K !K,K5K,K K,K<K,K KHV int count+/I,V int i F B, lettercount F BV char filename+/B,V WUget input details from userUW printf1(T pe the name of the file to process> (2V scanf1(Qs(,filename2V input F fopen1filename,(r(2V output F fopen1(fre.s.t5t(,(!(2V for 1iFBVi RF /AViXX2 G count+i, F BV H if 1input FF NU332 printf1(0ile doesnKt e5istYn(2V else G do G c F getc1input2V WU get one character from the file UW c F tolo!er1c2V WU all characters to lo!ercase UW s!itch 1c2 G case KaK> count+B,XXV lettercountXXV "rea?V case K"K> count+$,XXV lettercountXXV "rea?V case KcK> count+/,XXV lettercountXXV "rea?V case KdK> count+7,XXV lettercountXXV "rea?V case KeK> count+&,XXV lettercountXXV "rea?V case KfK> count+4,XXV lettercountXXV "rea?V case KgK> count+A,XXV lettercountXXV "rea?V case KhK> count+I,XXV lettercountXXV "rea?V case KiK> count+',XXV lettercountXXV "rea?V case KJK> count+%,XXV lettercountXXV "rea?V case K?K> count+$B,XXV lettercountXXV "rea?V case KlK> count+$$,XXV lettercountXXV "rea?V case KmK> count+$/,XXV lettercountXXV "rea?V case KnK> count+$7,XXV lettercountXXV "rea?V case KoK> count+$&,XXV lettercountXXV "rea?V case KpK> count+$4,XXV lettercountXXV "rea?V case K.K> count+$A,XXV lettercountXXV "rea?V case KrK> count+$I,XXV lettercountXXV "rea?V case KsK> count+$',XXV lettercountXXV "rea?V case KtK> count+$%,XXV lettercountXXV "rea?V case KuK> count+/B,XXV lettercountXXV "rea?V case KvK> count+/$,XXV lettercountXXV "rea?V case K!K> count+//,XXV lettercountXXV "rea?V case K5K> count+/7,XXV lettercountXXV "rea?V case K K> count+/&,XXV lettercountXXV "rea?V case K<K> count+/4,XXV lettercountXXV "rea?V case K K> count+/A,XXV lettercountXXV "rea?V default> "rea?V H H !hile 1c ZF EO02V WU repeat until EO0 1end of file2 UW H fclose1input2V for 1iFBViRF/AViXX2 G printf1(Qc> QdYn(,letters+i,,count+i,2V fprintf1output, (QdYn(, count+i,2V H fclose1output2V printf1(There are Qd letters in the te5tYn(,lettercount2V return BV H #ource code of the second C program> huffman.c Tinclude Rstdio.hS Tinclude Rstdli".hS Tinclude Rmath.hS Tdefine len152 11int2log$B152X$2 int fre.uencies+/I,V WU Node of the huffman tree UW struct nodeG int valueV char letterV struct node Uleft,UrightV HV t pedef struct node NodeV WUfinds and returns the small su"=tree in the forrestUW int find#maller 1Node Uarra +,, int different0rom2G int smallerV int i F BV !hile 1arra +i,=SvalueFF=$2 iXXV smallerFiV if 1iFFdifferent0rom2G iXXV !hile 1arra +i,=SvalueFF=$2 iXXV smallerFiV H for 1iF$ViR/IViXX2G if 1arra +i,=SvalueFF=$2 continueV if 1iFFdifferent0rom2 continueV if 1arra +i,=SvalueRarra +smaller,=Svalue2 smaller F iV H return smallerV H WU"uilds the huffman tree and returns its address " referenceUW
& void "uild-uffmanTree1Node UUtree2G Node UtempV Node Uarra +/I,V int i, su"Trees F /IV int smallOne,smallT!oV for 1iFBViR/IViXX2G arra +i, F malloc1si<eof1Node22V arra +i,=Svalue F fre.uencies+i,V arra +i,=Sletter F iV arra +i,=Sleft F NU33V arra +i,=Sright F NU33V H !hile 1su"TreesS$2G smallOneFfind#maller1arra ,=$2V smallT!oFfind#maller1arra ,smallOne2V temp F arra +smallOne,V arra +smallOne, F malloc1si<eof1Node22V arra +smallOne,=SvalueFtemp= SvalueXarra +smallT!o,=SvalueV arra +smallOne,=SletterF$/IV arra +smallOne,=SleftFarra +smallT!o,V arra +smallOne,=SrightFtempV arra +smallT!o,=SvalueF=$V su"Trees==V H Utree F arra +smallOne,V returnV H WU "uilds the ta"le !ith the "its for each letter. $ stands for "inar B and / for "inar $ 1used to facilitate arithmetic2UW void fillTa"le1int codeTa"le+,, Node Utree, int Code2G if 1tree=SletterR/I2 codeTa"le+1int2tree=Sletter, F CodeV elseG fillTa"le1codeTa"le, tree=Sleft, CodeU$BX$2V fillTa"le1codeTa"le, tree=Sright, CodeU$BX/2V H returnV H WUfunction to compress the inputUW void compress0ile10I3E Uinput, 0I3E codeTa"le+,2G char "it, c, 5 F BV int n,length,"its3eft F 'V int originalDits F B, compressedDits F BV !hile 11cFfgetc1input22ZF$B2G originalDitsXXV if 1cFF7/2G length F len1codeTa"le+/A,2V n F codeTa"le+/A,V H Uoutput, int elseG lengthFlen1codeTa"le+c=%I,2V n F codeTa"le+c=%I,V H !hile 1lengthSB2G compressedDitsXXV "it F n Q $B = $V n WF $BV 5 F 5 [ "itV "its3eft==V length==V if 1"its3eftFFB2G fputc15,output2V 5 F BV "its3eft F 'V H 5 F 5 RR $V H H if 1"its3eftZF'2G 5 F 5 RR 1"its3eft=$2V fputc15,output2V H WUprint details of compression on the screenUW fprintf1stderr,(Original "its F QdYn(,originalDitsU'2V fprintf1stderr,(Compressed "its F QdYn(,compressedDits2V fprintf1stderr,(#aved Q./fQQ of memor Yn(, 11float2compressedDitsW1originalDitsU'22U$BB2V returnV H WUinvert the codes in codeTa"le/ so the can "e used !ith mod operator " compress0ile functionUW void invertCodes1int codeTa"le+,,int codeTa"le/+,2G int i, n, cop V for 1iFBViR/IViXX2G n F codeTa"le+i,V cop F BV !hile 1nSB2G cop F cop U $B X n Q$BV n WF $BV H codeTa"le/+i,Fcop V H returnV H int main12G Node UtreeV int codeTa"le+/I,, codeTa"le/+/I,V int i, nV char filename+/B,V 0I3E Uinput, Ufre.sin, UoutputV
4 WUget input details from userUW printf1(T pe the name of the file to process> (2V scanf1(Qs(,filename2V input F fopen1filename, (r(2V fre.sin F fopen1(fre.s.t5t(, (r(2V output F fopen1(output.t5t(,(!(2V for1i F BV i RF /AV iXX2 G fscanf1fre.sin, (Qd(, \n2V fre.uencies+i, F nV printf1(fre.uencies+Qd,FQdYn(,i,fre.uencies+i,2V H "uild-uffmanTree1\tree2V fillTa"le1codeTa"le, tree, B2V invertCodes1codeTa"le,codeTa"le/2V compress0ile1input,output,codeTa"le/2V return BV H RE0ERENCE#
+$, +/, +7, +&, +4, +A, +I, +', C. E. #hannon, N) *athematical Theor of CommunicationO. The Dell # stem Technical ]ournal, Vol. /I, pp. 7I%6&/7, A/76A4A, ]ul , Octo"er, $%&'. D. ). -uffman, N) *ethod for the Construction of *inimum=Redundanc CodesO. ;roceedings of the I.R.E., #eptem"er $%4/. R.*. 0ano, (The transmission of information(. Technical Report No. A4 1Cam"ridge 1*ass.2, U#)> Research 3a"orator of Electronics at *IT2, $%&%. T. *. Cover and ]o ). Thomas, Elements of information theor 1/nd ed.2. ]ohn Cile and #ons. pp. $/I6$/'. I#DN %I'=B=&I$=/&$%4=%, /BBA. Data Compression, Ci?ipedia. )vaila"le> http>WWen.!i?ipedia.orgW!i?iWDataPcompression -uffman Coding, Ci?ipedia. )vaila"le> http>WWen.!i?ipedia.orgW!i?iW-uffmanPcoding The ;roJect 8uten"erg. )vaila"le> http>WW!!!.guten"erg.orgWfilesWI&WI&= hWI&=h.htm DDC 6 -omepage. )vaila"le> http>WW!!!.""c.com
'onstantinos $# (echlivanis !as "orn in Thessaloni?i in $%AI. -e studied mathematics at the )ristotle^s Universit of Thessaloni?i, 8reece 1$%%/2. Ne5t, he studied informatics at the )le5andrium Technological and Educational Institute of Thessaloni?i, 8reece 1/BBA2. -e has "een !or?ing as a Teacher of mathematics in private and pu"lic schools from $%%7 until /BBI. During his studies in informatics, he has "een !or?ing for Center for #oft!are Innovation, #onder"org, Denmar? for vocational training, underta?en a scholarship from the 3eonardo ;rogramme of the European Union, from /BB4 until /BBA. )lso he has "een !or?ing as a Teacher of informatics in primar and high school from /BBA = /BBI. #ince /BBI he is head of the department of Informatics and Organi<ation of the *ental -ospital of Corfu, 8reece. -is research interests are in the field of formal methods for re.uirements specification.

Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Huffman Coding: A Case Study of A Comparison Between Three Different Type Documents

Hochgeladen von

Copyright:

Verfügbare Formate

$

Das könnte Ihnen auch gefallen