Gathering the Movie Data

 

The Movie survey was conducted using the Consensus tool to guarantee anonymity.  While this tool was excellent for providing a simple user interface and robust management, the schema used to store the data was not appropriate for conducting any sort of analysis on what we gathered.  The data was in the form of “SurveyID,” “QuestionID,” “Answer”, and the Movies, Actors, and Directors fields were simply long strings of text.  To mine the data, we first needed to transform the data into a format that was more suitable.  What we needed was a table for each multiple answer question (such as Hobbits), plus a table for all of the single answer questions with a single row for each respondent.  Additionally, we needed to parse out the individual movies, actors, and directors from the text fields.

 

To accomplish this task, we leveraged the power of Yukon Data Transformation Services.  Yukon DTS allowed us to easily split, convert, parse and pivot the data gathered by Consensus into the eight tables we needed to perform our data mining task.  Here is an image of the pipeline task (dubbed the “Octopus”) that performed most of this work.