May cohort is now open: How to secure your spot:

High Performance Text Processing in Machine Learning

High Performance Text Processing in Machine Learning

High Performance Text Processing in Machine Learning

This talk covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning.

We demonstrate how Python modules, and in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and finally build statistical models with large volumes of text data.

The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. We will touch on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim).

The talk will be part presentation and part “real life” example tutorial.