I recently came across the need to build search into my Windows Phone application, Study The Word. In most cases, searches are performed by querying a reverse index. A reverse index is basically a table which catalogs, in my case, terms found within a body of text and their locations. So one term, say “cars”, may have x amount of locations. If you are talking about the web, a term may be associated with a particular web address, or in the case of a book, a term may be associated with a particular chapter (in this case you may want to capture, line index or paragraph, etc). In my case, what I was indexing, the Bible, can easily be associated with a book, chapter, then verse.
As stated above, my index is for a Windows Phone application, so you immediately run into the need to save space. After initially converting my index to XML, the file was over 25MB. I initially trimmed it down a bit by doing various tricks, but ultimately, the file was still too large for a mobile device.
Next was to find a way to compress my index, then decompress it at will. Windows Phone does not come with a built in way to compress and decompress data. Fortunate for myself, as well as other Windows Phone developers out there, there is SharpCompress, a compression library for .NET/Mono/Silverlight/WP7.
Next we have to think about how we are going to compress our file. We still have to make the decision on compressing the file as a whole, or individual parts. In my case, I don’t want to have to spend the time to initially decompress the entire file at first start-up. Although this would in turn make things faster later, it still would have an added execution time cost of decompression, in turn making the user experience less enjoyable. So, I came up with compressing indexes associated with certain terms, based on the length of the index, decompressing them on the fly when needed. There is about a three hundred character threshold to where compression becomes effective, therefore I decided to compress anything over this threshold and store it in an XML attribute of that particular term.
The resulting compressed elements file was under 3MB. Yay!
Below is source code of my full String Compress method that uses the SharpComprss tools, which is based off some code I found around the net.
String Compression Source
public class StringCompression { public static string Compress(string text) { if (text.Length < 300) return text; //Transform string into byte[] byte[] byteArray = new byte[text.Length]; int index = 0; //Redo this w/o ToCharArray conversion foreach (char item in text.ToCharArray()) byteArray[index++] = (byte)item; //Prepare for compress MemoryStream ms = new MemoryStream(); BZip2Stream gzip = new BZip2Stream(ms, CompressionMode.Compress, true); //Compress gzip.Write(byteArray, 0, byteArray.Length); gzip.Close(); //Transform byte[] zip data to string byteArray = ms.ToArray(); ms.Close(); gzip.Dispose(); ms.Dispose(); return Convert.ToBase64String(byteArray); } public static string Decompress(string compressedText) { byte[] byteArray; //Transform string into byte[] try { byteArray = Convert.FromBase64String(compressedText); } catch { return compressedText; } //Prepare for decompress MemoryStream ms = new MemoryStream(byteArray); BZip2Stream gzip = new BZip2Stream(ms, CompressionMode.Decompress); //Decompress byte[] buffer = StreamToByteArray(gzip); //Transform byte[] unzip data to string StringBuilder sb = new StringBuilder(); //Read the number of bytes GZipStream red and do not a for each bytes in resultByteArray; for (int i = 0; i < buffer.Length; i++) sb.Append((char)buffer[i]); gzip.Close(); ms.Close(); gzip.Dispose(); ms.Dispose(); return sb.ToString(); } public static byte[] StreamToByteArray(Stream input) { byte[] buffer = new byte[16 * 1024]; MemoryStream ms = new MemoryStream(); int read; while ((read = input.Read(buffer, 0, buffer.Length)) > 0) ms.Write(buffer, 0, read); return ms.ToArray(); } } |
Compressed XML Example
<Term name="BEGINNING" locations="QlpoOTFBWSZTWU+hQPYAAD6YAAAEf/BAAgdoM3ZDU8mjRSBp5JRlAJNRTJ6IEmUlQR6fOgs+7tv3OtGj5R3mOpnPJPLkMaI0s6rWd58ru94tvqTxKSi969Ei63mZ5qHvj5gVSVxfAt1KV3QI5Q86DJgSALkrtNxw0VYy1sbVIbaLtztsKbenDXm3kvVlOTEmJMfbjhTnzRLhQwE+oOhQbQph0nMvP1//YPDl82Lo4XLyVBholYrMpeQ57mFeIiYogkYriBoZhEVCiJIPh8zRxI+SAqgF9c78mBk2sqjT8HyDuwZPz0wiBb4jHgoKFeJLyia0xLdqjuR8UL2tgAGeEj6m09WVed7zs3nd8PI7S8L9ryJHECoykU1qpjlLsLKVRyRgVu5JfPi6ehkCyBKdpBsKEBpo+JK1BWiHmZ3QJa7sIUjpyrdCwxky8Yj+LuSKcKEgn0KB7A==" /> |