Datasets released by Google

For all the Machine Learning fans out there, here is a short list of various datasets released by Google over the years.

  • Co-occurrence of words for word n-gram model training (translation, spelling correction, speech recognition): blog post
  • Job queue traces from Google clusters: blog post data
  • 800M documents (search corpus) annotated with Freebase entities: blog post
  • Wikilinks, 40M disambiguated mentions in 10M web pages linked to Wikipedia entities: blog post
  • Human-judged corpus of binary relations about Wikipedia public figures (pairings of people to freebase concepts, annotated with supporting document and a human rater confidence): blog post data
  • Wikipedia Infobox edit history (39M updates of attributes of 1.8M entities) blog post
  • Triples of (phrase, URL of a Wikipedia entity, number of times phrase appears in the page at the URL) - useful for entity word dictionaries blog post

For other data sources, see the related discussion on HackerNews.