The following was originally written for the VHX Developers Blog as part of a writeup about a "Hack Day" project where employees are encouraged to explore a project that interests them. I used this time to develop a working MVP to determine Film Personality for already released films.
As the resident data scientist, I get to do a lot of interesting things, mostly centered around understanding how to make our platform better for our publishers and their customers. But for our last hack day, I was looking for a way to better understand the content itself. Enter Film Personality.
The ideas stems from something in Brand Strategy called "brand personality," which postulates that there are five major dimensions of personality: Excitement, Sincerity, Competence, Sophistication, and Ruggedness.
Sounds a little hokey, no? While reducing all human traits into five dimensions isn't a great way to describe people, I thought it might serve as a better shorthand for discussing films than the current standard of genre, where the description "Drama" could apply to almost anything (so long as no one is enjoying themselves.)
There are a few key elements needed for putting this together: A common language used to describe films, enough people using this common language, and the ability to quickly parse their conversations to extract the key descriptive words. I found it useful to consider critics' reviews as that source of common language. They say similar things about similar types of films, even if they disagree on the quality of the film. The use of words like "exciting," "intelligent," and "imaginative" are still used, even if the critic then follows it up by saying, "but I still hated this movie."
I figured the simplest way to test this out was to use the Rotten Tomatoes API to pull in reviews for a given film and parse those using the lovely Python package Natural Language Toolkit, or NLTK. I'll walk you through a bit of the code, which turned out to be surprisingly simple for an MVP:
The first API call to Rotten Tomatoes returns the JSON of the movie the user requested, based on matching title. It's dependent on two things: 1) the Rotten Tomatoes search, and 2) that the user actually spells the film title correctly. Once we have the movie ID, we call the API again to get the reviews.
Once we get the JSON containing the review snippets, we need to tokenize the parts of speech using NLTK. This returns a list of tuples, each containing an individual word paired with an identifier ('NN' for nouns, 'VB' for verb, etc.) The parts of speech that will be most descriptive of the film's personality are verbs in the gerund form, that is ending in "-ing," and adjectives.
In order to associate these words with the film dimensions we defined earlier, I created a small dictionary including some seed words and added synonyms from NTLK's Wordnet synonym set. I also removed the words "good" and "bad" because that's, like, your opinion, man.
What we get when we run all this madness is a list of descriptive words from the reviews, and a score for each of the five dimensions. While this is currently nothing more than a proof of concept, it shows that we can mine this data for specific words, link those words to personality dimensions, and that this dimensions actually align with our conceptions of the films that are input.
There are definitely areas for improvement. The dictionary needs to be much larger to account for more words, and I would love to bring in full reviews rather than just the snippets from Rotten Tomatoes. Hopefully these iterations will show that there is a better way to define films, and that perhaps film personality is it.