All sources i’ve read quote the below process to get Huffman codes:
- Arrange the elements in ascending order of their frequencies.
- Create a binary tree by repeatedly combining the two least frequent elements into a single node, with the sum of their frequencies as the frequency of the new node.
- Assign ‘0’ to the left branch and ‘1’ to the right branch at each step.
- The Huffman code for each element is obtained by traversing from the root of the tree to the respective leaf node, recording the path taken (0 for left, 1 for right).
Let’s try this for the following Frequency Map:
A -> 1 B -> 2 C -> 3 D -> 4
Making the Tree
(ABCD, 10)
/ \
(ABC, 6) D:4
/ \
(AB, 3) C:3
/ \
A:1 B:2
Following this tree to get the codes:
A -> 000 B -> 001 C -> 01 D -> 1
Obviously, the Max depth and the max code length does not follow the logical rule of being log2(nElement), which in this case should be two.
Optimally the codes would be:
A -> 00 B -> 01 C -> 10 D -> 11
So what’s goin on here ? what am i doing wrong ?
>Solution :
Imagine encoding String "ABBCCCDDDD" – sample string where each symbol occurs with the given frequency.
Huffman encoding you came up with encodes it as 0000010010101011111 – 19 characters.
Your encoding where you optimized for max code length gives 00010110101011111111 – 20 characters.
Your mistake is thinking that max depth is what’s being optimized (or what needs to be optimized) – but Huffman encoding optimizes for encoded text length.
And as a side note, as I mentioned in a comment, Huffman produces "an" optimal solution, not "the" optimal solution, it’s possible I think for a lower-depth solution to be just as good, but it won’t be better.