Vowel Compressibility and the Top 5000 Words in English
I recently learned that the popular term for removing vowels from words, generally for the purpose of censorship, but also if you’re a trendy startup owned by Yahoo is called “disemvoweling”. The reason why this works, and the meaning is not yet lost, is because English words are generally comprehensible without vowels, especially with the help of context. Researchers have found that the inclusion of vowels within words can actually be confusing to first speakers of languages that do not use vowels, or embed them as diacritics.
Given that vowels may not be necessary for effective communication in English, let’s take a look at the degree to which English would be compressed if we removed them entirely.
Mark Davies at Brigham Young University has compiled what he believes to be a corpus of the 5000 most commonly used words in English. Using this list, I ran a few numbers about the degree to which English might be able to be compressed if we removed vowels in their entirety.
Vowel Compressibility and the Top 5000 Words in English
Of the top 5000 words, the average length is 6.4556 letters and the median is 6 letters. The standard deviation in length for the population is 2.35 letters, so the median is reasonably close to the average. The average number of total vowels is 2.0298, so the average number of consonants per word is 4.426. This indicates that 68.55% of the letters in an average word will be retained after removing vowels, or that 31.45% of letters will be removed.
Thus, the “compressibility” of the average word is to 31.45% of net, or to 68.55% of its original size.
To put that in perspective, the size reduction achieved from using the common zip compression can be as low as 0.62% for MP3s, which are universally difficult to compress. Comparatively, the 7z format can achieve up to 98.65% compression with tabular data. According to that same study, the compressibility of Word documents, which are the closest analogue to pure text, is less than 10% for some formats and over 90% for 7z, b1, and arc format file compression.
In other words, removing all vowels from a Word document is likely a more effective form of compression than using a zip archive!