MovieNight: An Interactive Visualization of Careers of Actors and Directors

MovieNight:

An Interactive Visualization of
Careers of Actors and Directors

ABM Musa
abmmusa at gmail dot com

Khairi Reda
mreda2 at uic dot edu

Download

Introduction

Movie Night is an interactive visualization of the careers of roughly 750,000 actors, actresses, and directors who were active at some point in time between 1910 and 2010. The data for this visualization is comes from the Internet movie database (known as IMDb). This visualization enables users to see rating of actors and directors across their career, and compare the career of multiple people by visualizing their rating charts.

The visualization

The visualization shows a number of graphs that are aligned in the X axis (time). Each of these charts the rating of a single actor, actress, or director over time. The rating is calculated by averaging the rating of movies the actor or director worked on in a single year. The visualization provides a number of interactive features:

The graph can be rendered in three modes:

a bar chart coding the role of the cast member (actor/director) using color
a bar chart coding genre of the movie the cast member worked on A line chart.
The is nice when brushing actor names (more on this later)

Hovering with the mouse over a timestep shows a list of movies the cast member worked on, along with the individual rating of these movies. The graph can either display the absolute rating, or a relative rating calculated from the average rating of all movies at all times. The number of graphs can be increased/decreased using a slider (located in the bottom panel of the visualization). Additionally, the time period shown in the graph can be expanded or contracted to visualize any time-frame with in 1910 to 2010. The mouse wheel is used to expand/contract the time-frame. Dragging with the left mouse button can be used to pan the graph in time.

The visualization also contains a listing of the top/bottom actors, actresses, and directors for every decade, starting with 1910 and ending with 2000. The criteria used to determine this listing is discussed later. To bring up any of these listing, the user right-clicks anywhere in the timeline within the desired decade, causing a list to popup allowing the user to select a listing of top/bottom actors or directors of that corresponding decade.

Interactivity

Person selection:

A person can be selected using a drop-box, which lists all the persons that matches with the typing of the user. This drop-box provides an interactive and user-friendly way to select a person quickly from large number of people (roughly 750,000 actors and directors). A multi-level index map (4 levels of indexing) is used for quickly updating large list for the drop-box. Actors and actresses are shown using different colors and directors are shown using an icon along with their name. So this all-in-one drop-box provides a single place to choose a person. Additionally, brushing the name of the actor in the list causes the graph to be updated interactively, allowing the user to glimpse over a large number of actors or directors in few seconds.

Graph count slider:

This slide can be used to control the number of graphs in the visualization. More graphs allows comparing multiple actors/directors simultaneously. But decrease the amount of space available to chart the career's rating

Data preprocessing and mining

The data was obtained from IMDb website in text format. The data was then preprocessed, filtered, and mined, and the result where saved in an appropriate format for faster loading in the visualization program.

The filtering process consists of the following two steps:

First, the a series of 'grep' filters was applied to the data to remove TV shows, TV series, and TV-only movie titles. This was done by composing a series of regular expressions which matched the undesired listings. The inverse option (-v) is used to instruct grep to output the inverse of the undesired listings (IE, the desired ones)
Although the above step filters out a great deal of undesired data, the step is fairly limited as many of the undesired listings can not be captured by a regular expression. For example, the actor listing file requires a context-free grammar language to capture the desired listings. Therefore, a parsing program was developed to take care of this. The parsing program was also written in Processing. The advantage of this is that the parsing program can parse the data, construct an internal representation of the data in the memory, and then automatically dump the data to disk using Java's serialization mechanism without having to write a second, more efficient parser.

Data structure

A bipartite graph was constructed in memory with its nodes composed of actors and movies and links among them. The links were implemented using memory references. Java's serialization mechanism supports serialization of arbitrary java objects, so once the graph is constructed, the parsing program can instruct the Java virtual machine to automatically serialize. When the data needs to be loaded for visualization, the same graph can be automatically constructed in memory by de-serializing it from disk with no custom code. Unfortunately, the JVM serialize does not work well with deeply nested graphs. Therefore, certain elements of the serializer had to be overloaded, and the memory-based references where replaced with weak references. Nevertheless, the result was efficient, with a loading time of about 20-30 seconds for the entire filtered database. The advantage of loading the entire database in memory in the visualization program is that additional cool features can be added, such as brushing actor names interactively from a pool of 750,000 actors and directors.

Data mining

The data was mined for the get the following information:

Top/bottom actors, actresses, and directors for every decade between 1910 and 2010. The algorithm that determines these listings goes as follows: The algorithm makes one pass over the list of actors, actresses, and directors, independently. For an actor to be considered in the top/bottom listing a decade, they meet the following requirements: worked on at least 7 movies during that decade each of these movies must have at least 100 votes. The rating of an actor in that decade is calculated by averaging the rating of all the movies released in that decade in which that actor was a cast member. A list of 20 top/bottom actor, actresses, and directors is maintained. The actor/director is inserted in to this list if appropriate. The complexity of the algorithm is m*n (where n is the number of actors, m is average number of movies per actor). Practically it is linear (as n is far larger than m).
Additional statistics are compiled during the data mining phase. Among them is the average, min, and max ratings of movies, at all times.

last update: oct 22, 09