Big data datasets (large dataset examples)
When you first start working with MapReduce, Hadoop, mongoDB, or any other NoSQL approach, you might need some good sample big data data sets. Fortunately those are pretty easy to find these days.
As I worked through some Hadoop and MongoDB tutorials last year, I made notes of the big data datasets I kept encountering, and jotted down their URLs. I just ran across my notes again, and thought I'd share the information.
Here then is a collection of publicly available big data datasets you can use in your own tests and examples:
- U.S. patent data
- Public data sets on AWS (Amazon)
- The Lemur project ClueWeb09 dataset (1B web pages)
- U.S. Census genealogy data
- Large health data sets (ehdp.com)
The Quora website has a list of large, publicly-available datasets.
A website named BigFastBlog has a list of large datasets.
Depending on your specific needs related MapReduce, Hadoop, MongoDB, or NoSQL in general, hopefully some of those "big data" datasets will be helpful.
As usual, reporting live from Boulder, Colorado, this is Alvin Alexander of Valley Programming. (You can find more information about me as Alvin Alexander on Twitter, as well as thousands of programming tutorials on devdaily.com.)
Recent blog posts
- Business Analyst: How to write accurate software requirements
- Business Analyst: A simple secret to running a great meeting
- One thing a business analyst should ask about any requirement
- Business Analysts and Use Case quality: Questions to ask yourself when writing a Use Case
- The three things a Business Analyst should think about during meetings
- Testing web applications Selenium with Scala 3 and ScalaTest
- Scala Cookbook 2021: A best-selling new release in OOP and FP
- Salar Rahmanian's newsletter (and Functional Programming, Simplified)
- Our “Back To Now” app: Now available on iOS and Android
- An Android location “Fused Location Provider API” example