Cool Data Resources
Jonathan Graves, resources, UBC, learning, research, GitHub
Over time, I’ve been slowly collecting interesting datasets and places where datasets are located to help students locate or get inspired about data. Here is my current list - if you have suggestions, email me to add them at jonathan.graves@ubc.ca or submit an issue here!
If you’re searching for an idea, browse around and see if there’s a cool dataset you think might be fun to explore.
I make no particular guarantees about the accuracy or availability of any of these resouces. Your mileage may vary!
Other Notes:
- ★ means a particularly interesting dataset
- ☆ means a data clearinghouse, with many datasets available.
- I’ve also loosely divided them up by category
- This page uses Nicky Case’s excellent Nutshell tool to add comments. :Click on them to see the comments
:x clicky
Just like this!
Search Engines
- Google Data Search: Google has a beta version of a dataset search, which uses metadata to find and select datasets from sites online. It’s still in a rough version, and isn’t as clean as it could be, but it’s still quite useful: https://toolbox.google.com/datasetsearch
Political, Media, Journalism
- OpenSecrets: this is US political data on lobbying, politicians, campaign finance, and other related topics. Uncover corruption! Hold politicians accountable! Re-enact Mr Smith Goes to Washington! http://www.opensecrets.org/resources/create/data.php
- PEW Journalism Data: US data on news, the media, election campaign coverage, etc. Is the media biased? Maybe you can find out! http://www.journalism.org/datasets/
- Russian Twitter Troll Data: 538 published 3 million tweets by identified Russian-controlled troll accounts, designed to influence and manipulate public opinion. Spies! Politics! Exciting stuff!
https://github.com/fivethirtyeight/russian-troll-tweets/ (user-friendlier versions and ideas here)
Business, Internet, Web 2.0-ish
- Yelp! Dataset: a GIGANTIC dataset on restaurants, food reviews, and everything Yelp-centric. Lots of spatial data about restaurants. Does immigration lead to better restaurant options? Find out! https://www.yelp.ca/dataset
- Uber Movement: data on UBER trips in different cities, updated over time. Good for cities? Not? Will it work? Where? Try to figure it out! https://movement.uber.com/cities?lang=en-CA
- Cryptocurrencies (BitCoin, etc.): is it a fad? Is the next big thing? Draw your own conclusions instead of listening to your annoying cousin at Thanksgiving who won’t stop talking about it! https://coinmetrics.io/data-downloads/
The Coinbase API for developers is also pretty useful https://developers.coinbase.com/
- Fama-French Data Library: do you like the Fama-French model? What about huge amounts of factor and returns data? OK, here you go: https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
Data repositories
- ★☆ Kaggle: a big warehouse of all kinds of cool Web 2.0-ish data gathered from online. Seriously, so much data; they even run competitions, which you might be able to win! (Be sure to check license and copyright, though) https://www.kaggle.com/datasets
- Also includes lots of “open data” initiatives, like New York, etc.
- ☆ ICPSR: Michigan’s gigantic repository of hundreds and hundreds of social science and statistics data-sets. The old-school Web 1.0 way to get data, particularly useful for government or political datasets, or ones other people have used. In particular, has US Census and other related statistical data. https://www.icpsr.umich.edu/icpsrweb/
- ★☆ ABACUS: UBC’s giant data warehouse, with all kinds of different studies. ICPSR for Canadians.
Big benefit: has all of the standard Statistics Canada microdatasets like the GSS or LFS. http://dvn.library.ubc.ca/dvn/
- ☆ Literally what the URL says (be sure to check license and copyright, though) https://www.cooldatasets.com/
- ☆ Amazon’s AWS: gigantic (literally) collection of Amazon data on every concievable topic relevant for learning about the world. Mostly designed for machine learning purposes, but maybe you can find something interesting before our robot overlords do? https://aws.amazon.com/public-datasets/
Demographic and Microeconomic
- PUMF: the Statistics Canada Public use microdata is all on Abacus, above. This is dozens of detail microdatasets. See the directory.
- KEEP: the Korean Education & Employment Panel Survey, a large panel dataset of nationally representative Koreans from childhood, and their vocational outcomes. Very rich! https://www.krivet.re.kr/eng/eu/eg/euCAADs.jsp
- American Community Survey (ACS): one of the most valuable US Census Bureau products, the ACS is available at both household and individual, with 5% microdata samples available at both the 1 year and 5 year levels. https://www.census.gov/programs-surveys/acs/microdata/access.html
- There are two formats available on their direct download site (SAS and
csv
) and a data picker if you know what you already want. The data is also available in aggregate national and state-level forms. - You can find https://www.census.gov/programs-surveys.html
- See also the R
acs
package: https://cran.r-project.org/web/packages/acs/acs.pdf
- There are two formats available on their direct download site (SAS and
- Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG): a very large set of very detailed data covering India. Covers 500,000 villages and 8000 towns over 25 years, all with common geographic identifiers. https://www.devdatalab.org/shrug
Geospatial and City-specific Data
- New York City’s Open Data: all you wanted to know about New York! Match it up with the Yelp Data? Find the best pizza joint? https://opendata.cityofnewyork.us/
- Vancouver City Data: http://vancouver.ca/your-government/open-data-catalogue.aspx http://vancouver.ca/your-government/vanmap.aspx
- Cancensus: for outstanding geospatial census data from Canada, including rare tax assessment data, check out the
cancensus
package: https://github.com/mountainMath/cancensus
Macroeconomic Data
- World Bank Data: all of the world bank’s development indicators, and data from their many surveys they’ve collected over the years https://data.worldbank.org/
They also have a STATA interface, where you can import the data directly into STATA once you know the variables you want. See the documentation
They also have an even better R package (
wbstats
) with some very solid documentation
- CANSIM: StatsCanada’s macroeconomic and financial database. If you’re looking for Canadian macro data, here’s the source - https://www150.statcan.gc.ca/n1/en/type/data
CANSIM data is usually easier to pull indirectly; check out NESSTAR on the UBC library page. There’s also an R package that can directly collect StatsCan data. If you use R the excellent
cansim
package does this too: https://mountainmath.github.io/cansim/
- FRED: the US Federal Reserve’s main economic database for macro statistics; includes other countries as well, and is very easy to use https://fred.stlouisfed.org/
What do you know, there’s an R package for FRED too:
fredr
- Penn World Tables: the “gold standard” for development macrodata, especially involving productivity https://www.rug.nl/ggdc/productivity/pwt/
Sports Data
Sports sports sports sports! A big thank-you to my student Jack Siemens, who knows everything about baseball and analytics. He curated most of this list; I’ve added his comments in quotes here (as JS).
- NHL Historical Data: all the “standard” hockey scores for every player and team since 1917. Figure out what made the Great One so great. http://www.nhl.com/stats/
- ☆ Football Data: football fans are the classic fanatics, and they’re also fanatic about their data. Here’s a collection of a bunch of different resources. https://datahub.io/awesome/football
- ☆ Hockey Data: not to be outdone, the hockey fans have struck back with their own giant list of hockey data. Take that, soccer people! Link
- Cricsheet: it’s cricket. I don’t know anything about cricket. Maybe you do. Maybe you can explain it to me. https://cricsheet.org/
Reference Sites
- Pro Football Reference (NFL) - http://www.pro-football-reference.com/
- Basketball Reference - NBA/WNBA - NBA/WNBA - http://www.basketball-reference.com/
- Hockey Reference (NHL) - http://www.hockey-reference.com/
- Football Reference (Soccer) - http://fbref.com/
Baseball
- Retrosheet: literally every single baseball play and event since 1915. https://www.retrosheet.org/game.htm
- It is painful and ancient, but new generations must suffer!
- ☆ Fangraphs - Baseball: your single source for all stats baseball. https://www.fangraphs.com/
- Baseball Savant: very detailed, MLB-sourced data on raw stats. https://baseballsavant.mlb.com/
- Baseball Reference: a more modern version of Retrosheet, without the suffering. Historical data. https://www.baseball-reference.com/
baseballr
: you know it exists. A baseball data API, pulling from most of the above. There’s also a Python version. https://billpetti.github.io/baseballr/
R
-packages for Sports
cfbfastR
- College Football - https://github.com/sportsdataverse/cfbfastRhoopR
- NBA - https://github.com/sportsdataverse/hoopRwehoop
- WNBA/NCAA Womens - https://github.com/sportsdataverse/wehoopsportyR
- Visualization Package - https://github.com/sportsdataverse/sportyRchessR
- Chess - https://github.com/jaseziv/chessR/fastRhockey
- NHL/PHF - https://github.com/sportsdataverse/fastRhockeyworldfootballR
- International Football (Soccer) - https://github.com/jaseziv/worldfootballR/
Someone who knows about cricket or American football make me some suggestions!
There are many more options out there, so keep your eyes open!
:x comment1
I can’t express how much I love this website. I’ve spent hundreds of hours browsing it, and it’s perfect in nearly every way. I wouldn’t be enrolled in this class right now without Fangraphs, and I say that with absolute certainty. In terms of its function, it’s a one stop shop for all things Sabermetrics - from basic stats like Retrosheet, to hyper advanced ones, to articles, to projection models, to scouting, to community research. This website is the backbone of the analytically inclined baseball community. - JS
:x comment2
Baseball Savant is a MLB run site dedicated to Statcast, an initiative introduced in 2015 to collect data from every action that takes place on the field through the lenses of around a dozen high frame rate cameras (75-300fps). Compared to FanGraphs, which is primarily calculated metrics from in-game results, all Statcast metrics have a level of rawness to them. Instead of ERA, you will find average exit velocity and launch angle. - JS
:x comment3
Baseball Reference is essentially the 2022 version of Retrosheet. For those who are more interested in historical research, this is the best option available. - JS
:x comment4
When it comes to running more intensive data analysis, I prefer to utilize this API package to receive data. It is directly linked to Statcast data, as well as FanGraphs, which makes it super easy. Also, to my understanding this (and a similar package for Python) are the only methods of retrieving raw play-by-play statcast data, which is necessary in creating advanced models - JS