Data Science

Using Shapeless for Data Cleaning in Apache Spark

Frank Neff
02 Mar, 2018

spark
typelevel
shapeless

Using Shapeless for Data Cleaning in Apache Spark

When it comes to importing data into a BigData infrastructure like Hadoop, Apache Spark is one of the most used tools for ETL jobs. Because input data – in this case CSV – has often invalid values, a data cleaning layer is needed.Most tasks in data cleaning are very specific and therefore need to be implemented depending on your data, but some tasks can be generalized…

Understanding Stemmers (Natural Language Processing)

Frank Neff
23 Jul, 2015

java
nlp
solr

Understanding Stemmers (Natural Language Processing)

I am interested in NLP and have already some experience with Apache Solr. It’s time to dig a little in-deep regarding stemmers. First of all, I was looking for a general definition of what a stemmer is, and I found this one, which IMHO is quite good:

stemmer — an algorithm for removing inflectional and derivational endings in order to reduce word forms to a common stem

Showing posts from Data Science

Using Shapeless for Data Cleaning in Apache Spark

Understanding Stemmers (Natural Language Processing)