The Unreasonable Effectiveness of Co-occurrence Based Models | Cambridge Centre for Data-Driven Discovery

Gabriel Recchia, CRASSH

Given the current excitement over complex and computationally intensive machine learning techniques such as deep learning and conditional random fields, it may seem unlikely that much useful information could be obtained simply computing normalized counts of the number of times that pairs of words co-occur (within a sentence or a lexical window of a particular size, for example). However, co-occurrence based methods have proven surprisingly useful in a wide variety of tasks in natural language processing. Furthermore, their simplicity and transparency affords them certain advantages in particular contexts. I will present research demonstrating the surprising success of co-occurrence-based methods in a variety of circumstances: predicting the degree to which two words are similar in meaning, predicting the excavation sites of archaeological artifacts, predicting grammatical classes of words in an unsupervised fashion, and others. Practical tips and software packages for such methods will be discussed.