An Efficient Technique for Text Compression
Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad, S. M. Kamruzzaman
AAn Efficient Technique for Text Compression
Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad and S. M. Kamruzzaman Department of Computer Science & Engineering, International Islamic University Chittagong, Chittagong, Bangladesh. Department of Computer Science & Engineering, Manarat International University, Dhaka, Bangladesh. Email: {azadarif, r_sharmin_79, bappi_51, smk_iiuc}@yahoo.com
Abstract
For storing a word or the whole text segment, we need a huge storage space. Typically a character requires 1 Byte for storing it in memory. Compression of the memory is very important for data management. In case of memory requirement compression for text data, loseless memory compression is needed. We are suggesting a lossless memory requirement compression method for text data compression. The proposed compression method will compress the text segment or the text file based on two level approaches firstly reduction and secondly compression. Reduction will be done using a word lookup table not using traditional indexing system, then compression will be done using currently available compression methods. The word lookup table will be a part of the operating system and the reduction will be done by the operating system. According to this method each word will be replaced by an address value. This method can quite effectively reduce the size of persistent memory required for text data. At the end of the first level compression with the use of word lookup table, a binary file containing the addresses will be generated. Since the proposed method does not use any compression algorithm in the first level so this file can be compressed using the popular compression algorithms and finally will provide a great deal of data compression on purely English text data.
Keywords
Text, reduction, compression, lookup table, size.
1. Introduction
A text segment is a collection of words and a word consists of characters. All the characters are unique and they are the basic units of word. That’s why to store a text segment it is needed to store all the words separately. To store a word, all the characters that the word contains needed to be stored. For this type of storing mechanism a huge amount of disk space is needed and in the current world this technique is used. But this type of system of storing text makes the text segment consume more space. Let us suppose that a text segment contains some word “n” times, which is 7 characters in lengths (take it as average length) then for repeated presence of the same word of “n” times it need n*7 Bytes. If some sort of indexing within the text segment is done, then the text segment size can be reduced. But this process is still not effective because it needs extra space to make the indexing table and sometimes it may increase the file size rather than decreasing. Now if a word index table which will be used to index the text segment for reducing text segment size can be made and the index table will not be a part of the text segment but rather will be a part of the operating system then it will be an effective way to reduce the text segment memory requirement size. This proposed method decreases the persistent memory requirement approximately 50% and more. In this process it will generate a binary file. After the reduction work, the traditional compression will be done over the binary file. The compression characteristics of the file is, as the file generated the machine through lookup table address has already decreased the memory requirement of the text data and generated a binary file then if Deflate algorithm is carried over the file, which is the improved form of Huffman coding algorithm and LZ-77 coding algorithm, can reduce the memory requirement quite dramatically. The lookup table is a special related table in which numeric item values are classified into categories. An INFO lookup table contains at least two items: the relate item and an item named either symbol or level. In our proposal the lookup table will have an address value of a word. In the time of memory requirement compression the word in the memory will be replaced by that value. Since the value needs a small amount of bits to represent itself, thus it will reduce the storage requirement. As any language is not a rigid body that is. Languages are always expanding and day by day is enriching with the invention of new words and also by
The 1st International Conference on Information Management and Business (IMB2005)467 doption of new words. So some challenges were facedin choosing the boundary of languages. More over thefollowing things had to be considered.
The base of the our proposed methodology is thewords in English dictionary but due to spelling mistakethere may be lots of mistakes which will not be matchedwith no entries in the dictionary.
Technical word, Chemical names and names of different species of animals is in amount more than 1millions but those words are not generally added in thedictionaries. So it is needed to decide whether thosenames will be allowed in the lookup table.
Sometimes words from other language that have notbeen yet adopted in the English may be found in the textsegment. Though the word may be common in use itwill be considered as a special word that is a constantword.
2. Proposed method
The proposed compression will be carried of thebasis of two leveled approach. In the first level the textwill be reduced using word lookup table then in thesecond level text will be compressed using Deflatealgorithm.
To reduce the size of the text segment the proposedapproach will use a word lookup table, which will beacting like a word store and each particular word will beassigned an address value and any particular word will be determined by that value. For that reason, during thetime of storing the word, the word will not be stored asa Byte stream based on each of the character’s ASCII value, rather it will be stored as a fixed bits size longbits stream that will form the address value which willreference the address of the text in the word table.
A word lookup table is a special tabular data file containing the text dimension of a word as an attribute of an address, which is used topop up text to display the possible text data for a field.For our purpose we have decided to use a 19-bit wordlookup table. A 19-bit word lookup table can contain a number of,2^19 = 524,288 entries.That is if a 19-bit lookup table is used then it canindex a number of 0.524 millions of different words.Where is in current English language there is an approximation of having a total of 0.470 million ofwords [3, 6]. It shows that by using a 19-bit lookuptable it can easily index any English text. As there isonly 0.470 millions of different word in Englishlanguage that is there are still about (0.524-0.470) =0.054 Millions of empty entries. That is after deductingsome entries for special situation handling there areapproximately 52,000 entries empty. These emptyentries will be used for farther improvement on theproposed methodology. The block diagram of the wordlookup table is shown below in Figure 1:
Special Situation Handling AddressEmpty EntryFilled WordLookup Table 02000472000524288
Figure 1:
Block diagram of the word lookup table
From experiment itwas found that English language word consists of 6.91characters in an average. That's why, in the word lookup table, the size of any word is taken 7 characterslong in an average. The storing architecture of the wordwill be shown below in Figure 2:
Index Address Word Stored in Details Ending signal
Figure 2:
Word storing architecture in lookup table
According to the proposed architecture in the wordlookup table the first 19 bits is used to determine theaddress of the word in the word lookup table and thenext couple of 6-bit combination is for the particularword to be stored in details. Here it is taken 7 charactersas an average. Finally the last portion is a 6-bit 0 value,which is usually a combination 000000. Here 6 bitcharacter is used because the word lookup table willconsist of dictionary words so general characters andpunctuation symbols are needed only and that is
The 1st International Conference on Information Management and Business (IMB2005)468 ossible using 6-bit character. The example of an entryin the word lookup table is shown below in Figure 3,
Index Address Word Stored in Details Ending signal
Figure 3:
Words in lookup table
The word lookup tablehas some special situation handling address in thelookup table for many reasons. In text data there may beany name or a constant word or a spelling mistake. As the proposed reduction is a lossless reduction this wordwill needed to be represented too. But those words will not be found in the word lookup table. In that case atermination signal which is certain valued 19-bitaddress will be placed in the file. This value will tell thereduction machine that from now until a new word fromlookup table entry is not encountered all the data will beconsidered as an ASCII character which will be 6-bitlong. Then after the ASCII values if any lookup tableword is encountered it will add a zero valued 6-bitcombination which will represent the termination of ASCII values and restart of address values. The addresspattern of the special sit n handling address is shown i le 1: dresse uation the following table, Table 1.
Special situation handling ad
Tab s Indexaddress Word stored in details Endingsignal Ta ntry of different punctuation signs address Word stored in ils w One of themain exceptions in the word lookup table is that, in theword lookup table for each punctuation sign there is two different entries. One entry consists of only thepunctuation sign and the other consists of thepunctuations and with a no space protection. Becausethough after each punctuation a space is generallyplaced but in some case if after punctuation sign nospace is provided in the source file then the machinewill count the words before and after punctuation as a single word which was a spelling mistake. But forprotecting memory consumption the machine have willhandle this problem in the follob ble 2: E Index deta Endingsignal… … . … … , … … The text segmentwill be indexed as a form of 19 bit long addressesconsecutively. The white space between each respectiveword is excluded here. The reason is that, in the binarytext stream each 19 bit represents a word and after eachword a white space will be automatically added. Aquestion may arise, how multiple space will be solved.The answer is for each character in the ASCII table thelookup table will have three entries. They are single,double and triple alphabetic. In this way the presence of multiple alphabets will be resolved. Here are someexamples showty n in Figure 3,4,5,6 where LUT stands forlookup table:-
Figure 4: xample 1 E
The 1st International Conference on Information Management and Business (IMB2005)469 igure 5:
Example 2
Figure 6:
Example 3
Figure 7:
Example 4
Now, to build up aword lookup table of 19-bit and with 75 bit (average)the proposed approach needs a memory space of about,Space = 2^19 * 67 bits = 524288 * 67 bits= 35127296 bits= 4390912 Bytes= 4288 Kilo Bytes= 4.1875 Mega Bytes Here in 19-bit word lookup table needs only 4.1875MB memory space to generate and store.
The word lookup table will be a continuous bit stream like thefollowing pattern. Now as the word lookup table will bea very huge database. Now since the proposedmethodology must be faster methodology, to do that thesearching is needed to be faster. Thus binary search wasa better option. Since, the lookup table is a continuous database so binary search is not possible. But using aHash table and then doing linear search is a far bettersolution.
Table 3:
The Hash Table
Index address Starting Character
Figure 8:
How the word lookup table will be stored.
For thecompression of the text data a fixed algorithm will becarried out over the text. General algorithm forconverting unreduced text to reduced text is shown in the below:
Algorithm UnRedToRed( ){
1. Read file 2. Read character to form a word until empty.3. Finds its appropriate address from Hash table.4. Find the word in Lookup Table.5. If found then6. { 7. Check case8. If case = lower then9. Fetch addresses10. else11. {12. Do the case management13. Fetch Address14. }15. Print the address16. }17. else18. { 19. Give termination symbol.
The 1st International Conference on Information Management and Business (IMB2005)470
0. Start ASCII storage (word) 21. } 22. Go to step 1. 23. End. } General algorithm for converting compressed text to uncompressed text is shown in the below:
Algorithm RedToUnRed( ) {
1. Read file 2. Fetch address. 3. Check Address status. 4. If word then, 5. Print the word. 6. If situation handing operator then, 7. Do according to it. 8. Go to step 2. 9. End. } This actual traditional compression is done in this level using the Deflate compression algorithm. This algorithm compresses the text data using both Huffman coding and LZ-77 algorithm. The reduced text file is regenerated as a binary file. The Deflate compression will be carried over the binary file.
3. Experimental result
From the Example 1 it is seen that the general size of the text is 176 bits where is by using 19-bit word lookup table the size becomes 133 bits and the percentage of reduction is 24.43%. For Example 2, the result is in general condition 248 bits, but in 19-bit Word Lookup table is 133 bits and the percentage of reduction is 46.37%. Then the result for Example 3, is in general is 376 bits, in 19-bit word lookup table is 133 bits and percentage of reduction is 64.62%. Consequently, according to Example 4 the result is in general is 512 bits, in 19-bit word lookup table is 171 bits and percentage of reduction is 66.60%. Here is another example of the experimental data; the following text segment Example 5 was copied 24 times to make a large text segment. The experimental data Example 5 is shown below:- “Although computers may have basic similarities, performance will differ markedly between them, and just the same as it does with cars. The PC contains several processes running at the same time, often at different speeds, so a fair amount of coordination is required to ensure that they don't work against each other. Most performance problems arise from bottlenecks between components that are not necessarily the best for a particular job, but a result of compromise between price and performance. Usually, price wins out and you have to work around the problems this creates. The trick to getting the most out of any machine is to make sure that each component is giving of its best, and then eliminate potential bottlenecks between them. You can get a bottleneck simply by having an old piece of equipment that is not designed to work at modern high speed - a computer is only as fast as its slowest component, but bottlenecks can also be due to badly written software.” Another experimental data was created, copying the following text segment in Example 6, 32 times, to make a large text segment. The experimental data Example 6 is shown below:- “In the current world we have high powerful processors and high capability storages devices not only in the micro computer but also in PDA’s. That’s why it not difficult to store or to manipulate a file. But it is still difficult to transfer file or data through communication medium. The reason is that the signal capacities of the carriers are not sufficient enough. And this problem is deeply felt in internet communication. In the case of text transfer if we can minimize the text size it will increase the faster portability of the text files. This can be done by indexing the text and by generating a lookup table which will be used to index the text and that will decrease the number of Bytes needed to define a particular text.” The reduction result for the Examples 5 and 6 is shown below in Table 4,
Table 4:
Size reduction result
In General Situation Example 5 Example 6Words 3984 4544Characters 19392 19328Characters with white space 23378 23519Text Size Bytes23378 Bytes23519In 21-bit Word Lookup-Table Words 3984 4320Punctuation 361 224Words with punctuation 4345 4544Per word text size bits 19 bits 19Text size Bytes10320 Bytes10792Text Size Reduction Status General situation 23378 2351921-bit word lookup-table 10320 10792Size reduced 13058 12727Reducing percentage 55.86% 54.11%
The 1st International Conference on Information Management and Business (IMB2005)471 e also had experimented the text size for twostories of Leo Tolstoy; and some couple of articlespublished in local English newspapers. The results forthe stories of Tolstoy were 55.91% and 47.32%. For articles in daily newspaper, the results were 40.24%,55.64%, 64.36%, 52.75%, 49.16% and 56.97%. That is,finally we got an average of 53.4188% reduction rate.
In general the all compression methods have therecompression rate from 12% to highest 50%. But in ourmethod we have found 53% reduction in the starting ofthe approach without any compression. That is furtherimprovement of the approach will increase thereduction rate.
In comparison with currently available zip softwarewe found that the following outputs in the case of a same file. The comparison is shown in the Table 5.
Table 5:
Compression between zip software
Compression Type Size
Normal 78.53 KB Proposed Method 14.38 KB Gzip 29.61 KB Winzip 31.27 KB
Chart 1:
Compression between zip software
4. Conclusion
In this paper it was desired to provide a whole newcompression method. As the world is moving towards the goal of providing highest service at a lowest expense, this method of word lookup table will makeany text segment able to use lesser memory space butwill not decrease its features rather will increase its usability and portability. It will decrease the memoryarea occupied by text segment in any type of file, whichwill make a huge amount of memory area free. And alsodecrease its transfer time through FTP or SMTP too.
5. Limitations and future work
This method may be applied more efficiently ifsuitable algorithms are applied for determining theaddress value and doing memory management. Ourintention is to use the Deflate algorithm to decrease theindex address memory requirement as well the constantwords memory requirements.
6. References
NormalAccording toProposedMethod +DeflateAlgorithmGzipWinzip
The 1st International Conference on Information Management and Business (IMB2005)472 roceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. [20] Silberschatz Abraham, Peter Baer Galvin, Gerg Gagne, "Operating System Concept", John Wiley & Sons, Inc. [21]