Big data datasets (large dataset examples)
When you first start working with MapReduce, Hadoop, mongoDB, or any other NoSQL approach, you might need some good sample big data data sets. Fortunately those are pretty easy to find these days.
As I worked through some Hadoop and MongoDB tutorials last year, I made notes of the big data datasets I kept encountering, and jotted down their URLs. I just ran across my notes again, and thought I'd share the information.
Here then is a collection of publicly available big data datasets you can use in your own tests and examples:
- U.S. patent data
- Public data sets on AWS (Amazon)
- The Lemur project ClueWeb09 dataset (1B web pages)
- U.S. Census genealogy data
- Large health data sets (ehdp.com)
The Quora website has a list of large, publicly-available datasets.
A website named BigFastBlog has a list of large datasets.
Depending on your specific needs related MapReduce, Hadoop, MongoDB, or NoSQL in general, hopefully some of those "big data" datasets will be helpful.
As usual, reporting live from Boulder, Colorado, this is Alvin Alexander of Valley Programming. (You can find more information about me as Alvin Alexander on Twitter, as well as thousands of programming tutorials on devdaily.com.)
Recent blog posts
- Free Scala and functional programming video training courses
- Free: Introduction To Functional Programming video training course
- The #1 functional programming (and computer programming) book
- The User Story Mapping Workshop process
- Alvin Alexander, Certified Scrum Product Owner (CSPO)
- Alvin Alexander is now a Certified ScrumMaster (CSM)
- Our “Back To Then” app (for iOS and Android)
- A Docker cheat sheet
- Pushing a Scala 3 JAR/Docker file to Google Cloud Run
- Reading a CSV File Into a Spark RDD (Scala Cookbook recipe)