Data Compression: Theory and Demonstration

In this project, we explored the foundations of data compression using elements from information theory. This topic is very well summarized in Chapters (5) in Covers & Thomas: Elements of Information Theory. Our task was to read the relevant sections carefully and then prepare a self-contained report on the subject. The project is broken down into four milestones that lead us through understanding the various components of this project.

Report Link

Overview

Information theorists think of a discrete source as a discrete random variable with a finite number of elements in the sample space (these are typically referred to as alphabets). Data is generated from the source by sampling (independently) the discrete random variable. Coding is concerned with finding a unique 1-1 mapping between the alphabets and sequences of 0s and 1s. For example, a source that has four possible elements in the sample space can be successfully encoded with two bits. In general, a source with $N$ alphabets can be encoded with $Log(N)$ bits. The code refers to the coding scheme, and the code word refers to the representation of one element of the alphabet.

Probability theory has been successfully used in the context of data compression. The main idea behind compression is to avoid encoding all alphabets with the same length codes, as we suggested earlier. Instead, you assign the shortest code word to the most frequent out- comes (the ones that have the highest probability). If this is done in a clever way, you may be able to represent a source with a code that is much shorter (on average) than the number of bits needed to encode the number of elements in the sample space ($Log(N)$). Information theory presented a systematic way of answering the question of obtaining the minimum av- erage code length for a given source. The main mathematical quantity introduced to answer this question is Entropy.

Project Elements and Required Reading

There four main elements to this project that are necessary to address the question of Data Compression:

Understanding the concepts behind variable length codes as well as and the computation of an average code length.
Understanding the concept of one-to-one codes and prefix codes.
Understanding the characterization of the minimum average code length of a source.
Understanding Huffman Codes as a constructive procedure to obtain an optimal code (minimizing average code length).

To complete these components of the project, we first experimented with different sources (Random Variables) using different coding schemes. For that, you will need to use some software package that enables sampling random variables. Matlab is one possible environment. Secondly, we need to read specific sections from Elements of Information Theory by Covers and Thomas. The book provides the theoretical development of Data Compression. The theoretical development of Data Compression is the main core of this project.

Milestones and Deliverables

This project will be divided into four milestones, each with an associated set of deliverables.

Milestone 1

In this part of the project, we generated samples of discrete random variable with a given probability mass function (pmf). To do this, we used Matlab as a simulation environment. The function ‘rand' in matlab computes random samples of a continuous random variable that is uniformly distributed over $[0,1]$. It is possible to generate samples of a general discrete random variable $X$ over a finite alphabet (with a given pmf) by recognizing the following: Let $F_X$ be the cumulative mass function associated with $X$. If you define $Y = F_X(X)$, then the new derived random variable is uniformly distributed on $[0, 1]$. If you use ‘rand' to generate uniform samples, then samples of $X$ can be obtained from ${F_X^{−1}}(Y)$. You also wrote a short matlab function that accepts the pmf of $X$ and outputs $M$ independent samples of $X$.

We will consider three pmfs defined on an alphabet of size 8:

$ p_1(X) = \left\{ \begin{array}{cc} 1/8 & x=1 \\ 1/8 & x=2 \\ 1/8 & x=3 \\ 1/8 & x=4 \\ 1/8 & x=5 \\ 1/8 & x=6 \\ 1/8 & x=7 \\ 1/8 & x=8 \end{array} \right. $ $\qquad$ $ p_2(X) = \left\{ \begin{array}{cc} 1/16 & x=1 \\ 1/16 & x=2 \\ 5/16 & x=3 \\ 4/16 & x=4 \\ 2/16 & x=5 \\ 2/16 & x=6 \\ 0 & x=7 \\ 1/16 & x=8 \end{array} \right. $ $\qquad$ $ p_3(X) = \left\{ \begin{array}{cc} 0 & x=1 \\ 0 & x=2 \\ 1/4 & x=3 \\ 1/4 & x=4 \\ 0 & x=5 \\ 1/4 & x=6 \\ 1/4 & x=7 \\ 0 & x=8 \end{array} \right. $

As mentioned in the introduction, random variables are good probabilistic models of an alphabet. Think of a source being the english language. Sentences are constructed by sampling words from a probability mass function. However, it is true that some words are used more often than others (‘the' is one of the most commonly used words in english). Hence, the corresponding pmf is not uniform. In the three examples above, imagine a language comprising 8 symbols (we denote by 1, . . . 8), and sentences constructed by randomly selecting symbols according to one of the pmfs. It is in this sense that we think of a random variable as a source.

Deliverables: For this part of the project, we wrote a matlab function that provides independent samples of a general discrete random variable (using the idea described above). Using this function, we generated 100 samples from each of the three pmfs described above. We also plotted a histogram of the data set you generated (a histogram has the alphabet on the x-axis and the percentage of occurrence on the $Y$-axis). All findings are explained using the law of large numbers.

Milestone 2

In this part of the project, we evaluated two coding schemes for each of the above examples. The first code is a fixed-length code as all symbols are assigned a code of length 3, while the second code is variable length code as codes with different lengths are assigned to different symbols. The coding schemes are described as follows:

$ \begin{array}{cc} Code1: & \begin{array}{lll} 1 & \rightarrow & 000 \\ 2 & \rightarrow & 100 \\ 3 & \rightarrow & 010 \\ 4 & \rightarrow & 001 \\ 5 & \rightarrow & 100 \\ 6 & \rightarrow & 110 \\ 7 & \rightarrow & 101 \\ 8 & \rightarrow & 111 \\ \end{array} \end{array} $ $\qquad$ $ \begin{array}{cc} Code1: & \begin{array}{lll} 1 & \rightarrow & 1101101 \\ 2 & \rightarrow & 1101100 \\ 3 & \rightarrow & 10 \\ 4 & \rightarrow & 001 \\ 5 & \rightarrow & 111 \\ 6 & \rightarrow & 1100 \\ 7 & \rightarrow & 1101101 \\ 8 & \rightarrow & 11010 \\ \end{array} \end{array} $

Deliverables: The first deliverable of this part was related to codes. after verifying that the variable code has two important properties:

Property 1: Each symbol is mapped uniquely to a code word.
Property 2: Any sequence of possible codes words can be decoded without any ambi- guity. This is straightforward with the fixed length code but can be tricky with the variable length code. For example, using the second code, the sequence $011001110010$ uniquely refers to the symbols $365334$.

The second deliverable of this part was related to the computation of the average code length. From the data we generated in the previous example, we used the two coding schemes to compute the ‘sample' mean of the code length. We did this by adding the lengths of each code word for each sample and then dividing by the number of samples. We did this for each of the pmf given above. We also derived another prefix code and repeat the computation above to compare our code to the above codes in terms of average code length?

Finally, compute the expected value of the code length for the two codes and the three pmfs. Compare the exact calculations to the sample averages that you computed earlier. Comment on the results.

Milestone 3: Optimal Code Length

In this part of the project, you learned about optimal compression strategies.

Deliverables: In this part of the report, we summarized the main theoretical results on Data compression. We did that in the following order:

Explaining Kraft's Inequality and explaining how it imposes constraints on the code lengths of a given prefix variable code. We used a binary graph to explain this inequality.
Describing the direct formulation of the problem as a constrained minimization of the expected code length.
Explaining how the solution depends on the entropy of the source. We computed the Entropy for the three examples above.

Milestone 4: Huffman Codes

The previous section demonstrated how Entropy provides a characterization of the optimal compression of a source. Huffman codes provide a constructive procedure to obtain an optimal code.

Deliverables: For this part of the report, we explained graphically how Huffman codes are constructed through various examples. We gave an intuitive explanation why such codes are optimal. Apply this scheme to the examples above to derive optimal codes.

Final Report and Interactions

We prepared a clearly written report that summarizes all our findings and includes all the deliverables of all the four milestones.

Love,
Ahmed Jazzar