Quick intro to the Kickstarter Data

In [1]:
from __future__ import print_function
import numpy as np
import json
In [2]:
with open("kickstarter.jsonlist") as f:
    dataset = json.loads(f.readlines()[0])
np.random.shuffle(dataset) #just for fun :-)

Let's get some basic statistics, to know what we are working with here...

In [3]:
print("There are {} projects in the dataset".format(len(dataset)))
There are 45815 projects in the dataset

What information does each project contain?

In [4]:
<type 'dict'>
In [5]:
[u'raised', u'sub_category', u'text', u'creator_num_backed', u'featured', u'result', u'duration', u'category', u'goal', u'creator_facebook_connect', u'projectId', u'lon', u'has_video', u'comments', u'faqs', u'start_date', u'rewards', u'end_date', u'parent_category', u'updates', u'lat', u'short_text', u'name', u'url', u'backers']
In [6]:
for i in range(5):
    print("{}: {}".format(dataset[i]['name'],
                                "Success" if dataset[i]['result'] else "Failure"))
    print(dataset[i]['url'] + "\n")
No Regrets for Our Youth: Success

Documentary: Music on Foot (Walking Massachusetts): Success

Already There: The Story of the Kwoncok Project: Failure

Public Arts Project 66: Success

Life Abstract: Failure

Interesting! We have the following fields of interest...

  1. The text of the Kickstarter Project
  2. The category of the Kickstarter Project
  3. Some information about rewards, backers, etc.
  4. Whether or not the project was successful

How many projects are successful?

In [7]:
print("Success rate of Kickstarter projects: {}/{}".format(len([i for i, d in enumerate(dataset) if d['result']]),
Success rate of Kickstarter projects: 23604/45815

How many backers do projects generally get?

Let's make a histogram!

In [8]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist([sum([x['num_backers'] for x in y['rewards']]) for y in dataset],bins=100)#,log=True)

This is a very skewed distribution! Lets see if we can make that plot a bit better by excluding the top 1000 biggest projects, say.

In [9]:
n_backers = np.array([sum([x['num_backers'] for x in y['rewards']]) for y in dataset])
n_backers = np.sort(n_backers)
n_backers = n_backers[:-1000]
plt.hist(n_backers, bins = 100, log = True)

What are the types of categories in our dataset?

In [10]:
print(set([x['category'] for x in dataset]))
set([u'', u'Film & Video', u'Fashion', u'Art', u'Publishing', u'Food', u'Photography', u'Comics', u'Design', u'Games', u'Theater', u'Music', u'Technology', u'Dance'])
In [11]:
print(set([x['sub_category'] for x in dataset]))
set([u'', u'Jazz', None, u'Performance Art', u'Conceptual Art', u'Poetry', u'Fiction', u'Classical Music', u'Animation', u'Art Book', u'Digital Art', u'Indie Rock', u'Board & Card Games', u'Painting', u'Crafts', u'Video Games', u'Illustration', u'Public Art', u'Country & Folk', u'Open Hardware', u'Narrative Film', u'Electronic Music', u'Journalism', u'Webseries', u'Graphic Design', u'Short Film', u'Product Design', u"Children's Book", u'World Music', u'Rock', u'Documentary', u'Hip-Hop', u'Open Software', u'Pop', u'Nonfiction', u'Periodical', u'Sculpture', u'Mixed Media'])

How many projects are in each category? What are the success rates for each category?

In [12]:
from collections import defaultdict
cat_to_proj = defaultdict(list)
for p in dataset:
In [13]:
for c, projs in cat_to_proj.iteritems():
    print("{}: {} ({:.3f}%)".format(c, len(projs), 100.*len([p for p in projs if p['result']==1])/len(projs)))
: 5 (0.000%)
Film & Video: 13502 (47.786%)
Fashion: 1134 (31.481%)
Art: 4236 (53.447%)
Publishing: 4761 (36.967%)
Food: 1431 (48.008%)
Photography: 1508 (43.899%)
Comics: 1068 (51.966%)
Design: 1507 (44.658%)
Games: 1728 (41.725%)
Theater: 2484 (67.351%)
Music: 10884 (63.929%)
Technology: 808 (38.243%)
Dance: 759 (70.224%)

Question of the day: can we predict success given the text of a project?