Thursday, August 7, 2014

Text classification with Apache Spark 1.1 (sentiment classification)

Word to vector space model converter was recently implemented in Apache Spark MLLib. Now it is possible to perform text classification. Lets see how it works with the sentiment analysis.

  • Download a Pang and Lee sentence polarity dataset v 1.0 from http://www.cs.cornell.edu/people/pabo/movie-review-data/. It contains 5331 positive and 5331 negative processed sentences.
  • Clone and install the latest version of Apache Spark that contains HashingTF and MulticlassMetrics classes
  • Code snippet. It can be executed in Spark shell or as a separate application that uses spark-core and mllib:
  •   /* instantiate Spark context (not needed for running inside Spark shell */
        val sc = new SparkContext("local", "test")
        /* word to vector space converter, limit to 10000 words */
        val htf = new HashingTF(10000)
        /* load positive and negative sentences from the dataset */
        /* let 1 - positive class, 0 - negative class */
        /* tokenize sentences and transform them into vector space model */
        val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
          .map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
        val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
          .map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
        /* split the data 60% for training, 40% for testing */
        val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
        val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
        /* union train data with positive and negative sentences */
        val training = posSplits(0).union(negSplits(0))
        /* union test data with positive and negative sentences */
        val test = posSplits(1).union(negSplits(1))
        /* Multinomial Naive Bayesian classifier */
        val model = NaiveBayes.train(training)
        /* predict */
        val predictionAndLabels = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
        /* metrics */
        val metrics = new MulticlassMetrics(predictionAndLabels)
        /* output F1-measure for all labels (0 and 1, negative and positive) */
        metrics.labels.foreach( l => println(metrics.fMeasure(l)))
  • I've got around 74% F1-measure for both classes. Similar results can be observed in Weka
  • 0.7377086668191173
    0.7351650888940199

Build a specific Maven project

Assume there is a parent project and child projects. You can build children from the parent folder with the following:
mvn install -pl NAME -am
NAME is the folder of the child project.
http://stackoverflow.com/questions/1114026/maven-modules-building-a-single-specific-module

Wednesday, August 6, 2014

How to use Apache Spark libraries that were compiled locally in your project

Officially released versions of Apache Spark libraries are in maven http://search.maven.org/, so you can always add dependencies to them in your project and maven will download them. See how to make a maven project that uses Spark libraries at avulanov.blogspot.com/2014/07/how-to-create-scala-project-for-apache.html. I want to use the latest build of Spark in my maven project, moreover, my custom build of Spark. There are at least two options of doing this. First is building Apache Spark with install or running `install` target for a particular Spark project:
  1. mvn -Dhadoop.version=1.2.1 -DskipTests clean install
  • Compile your local version of Apache Spark with  
  1. mvn -Dhadoop.version=1.2.1 -DskipTests clean package
  1. mvn install:install-file -Dfile=/spark/core/target/spark-core_2.10-1.1.0-latest.jar -DpomFile=/spark/core/pom.xml -DgroupId=org.apache.spark -Dversion=1.1.0-latest -DartifactId=spark-core_2.10
    • Reference the new version of this library in your pom.xml (1.1.0-latest)
    • There might be a problem with imports and there versions, so try to run mvn install (I run it in Idea IDE). In my case maven didn't like that asm and fz4 dependencies didn't have versions specified. Specify them if needed.

    Tuesday, August 5, 2014

    Hashing trick for word dictionary

    One can use hash for building a dictionary and converting text documents to vector space representation. Dictionary size N has to be specified and documents tokenized to terms. Then, Hash(term) mod N is a term index in VSM. More details at: http://en.wikipedia.org/wiki/Feature_hashing and http://www.shogun-toolbox.org/static/notebook/current/HashedDocDotFeatures.html