Data Sources

This page describes several places where you might look for interesting data to use in your SPIS projects.

Here is an overview, followed by more specific information about each one.

Julian McAuley’s Amazon Data sets: http://jmcauley.ucsd.edu/data/amazon/
The CORGIS datasets for Python: https://think.cs.vt.edu/corgis/python/index.html
- CORGIS is a Collection of Really Great, Interesting, Situated Datasets, collected by Austin Cory Bart, a Ph.D. student in Computer Science Education at Virginia Tech (along with several other collaborators.)
- The datasets are updated periodically, and cover many topics from Art, Economics, Geography, History, Literature, Music, Politics, and Travel among others. There are over 40 datasets that come with Python code to access them.
- Many of these datasets are of sufficient size to be considered “big data”, but only if you are careful about setting the test=False parameter. Read the documentation for each data set carefully. More info below.
New York Times Data Journalism: http://www.nytimes.com/section/upshot
Nate Silver’s Election analysis and more: http://fivethirtyeight.com/
JavaScript Library for Data Manipulation: https://d3js.org

Working with Reddit Data

Reddit Data Visualization: https://www.reddit.com/r/dataisbeautiful/
Articles from SPIS 2016 website that relate to getting Reddit Data:
- Python: JSON About JSON in Python in general, but uses data from Reddit as an example.
- Python: Requests: User-Agent A general article about setting the User-Agent to avoid problems when accessing web content from Python, but uses Reddit as an example.

Corgis datasets for Python

When working with the Corgis datasets for Python, be sure to read the part about the test=False parameter.

For many of these datasets, you only get a small sample of the data when you use this code:

import cars
list_of_car = cars.get_cars()
list_of_car = cars.get_cars_by_year("2001")
list_of_car = cars.get_cars_by_make("'Pontiac'")

If instead, you set the test parameter to False, you get a much larger data set that could be considered “big data”:

import cars
# These may be slow!
import cars
# These may be slow!
list_of_car = cars.get_cars(test=False)
list_of_car = cars.get_cars_by_year("2001", test=False)
list_of_car = cars.get_cars_by_make("'Pontiac'", test=False)