Needle in a Haystack: Understanding the Tradeoff between Accuracy and Efficiency in Searching of High Dimensional Big Data | Cambridge Centre for Data-Driven Discovery

Liang Wang, Computer Laboratory

The Internet is overloading its users with excessive information flows, so that effective content-based filtering becomes crucial in improving user experience and work efficiency. Latent semantic analysis has long been demonstrated as a promising information retrieval technique to search for relevant articles from large text corpora. We build Kvasir, a semantic recommendation system, on top of latent semantic analysis and other state-of-the-art technologies to seamlessly integrate an automated and proactive content provision service into web browsing. We utilise the processing power of Apache Spark to scale up Kvasir into a practical Internet service. Herein we present the architectural design of Kvasir, along with our solutions to the technical challenges in the actual system implementation.

Specifically, we focus on both algorithmic and engineering solutions we adopted in Kvasir to tackle the scalability challenges by finding a good tradeoff between the accuracy and efficiency in searching of large and high dimensional datasets.