The TRUTH About Big Data
Andrew McAfee was right when he prophesied, “The world is one big data problem.”
According to Seagate, by 2025, the global data sphere will grow from 45 zettabytes to 175 zettabytes, and nearly 30% of the world’s data will need real-time processing. These massive volumes of data are what people loosely refer to as Big Data.
However, Big Data has more to it to understand beyond its adjective.
In this blog post, we'll learn:
- What is Big Data
- Why Would You Need Big Data
- How Big Data Works
What is Big Data
Big data is defined as “The data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.”
It isn't something that fits in a regular database or a spreadsheet. Big Data requires advanced ways of processing and is available in varying formats like structured and unstructured.
While structured data is logged in an organized format, mostly rows and columns that fit in a relational database, unstructured data is often called streaming data that arrives in real-time and requires reactions. Think Fraud detection on a bank transaction. It could be text, code, a markdown file, etc.
Both formats flow in a data hub from where it is used for various data analysis, science, and integration applications.
Over time, researchers have identified 3 “V”s or traits that qualify data as Big Data, viz.
Volume: The size of data, which is insanely big
Velocity: The speed at which the data is coming in, i.e., real-time over batch processing
Variety: The different formats of data ranging from structured to unstructured
Now that we understand what big data is, we need to find if the fuss around big data is real and how big data evolved. For that, you should know-
Why Do You NEED Big Data
Four prominent use cases evolve to drive the innovation and development of what we call big data:
- Logs: The log is the entry of your activity. Every time you visit a website, open an app, etc., the information is recorded as a log. Multiply the number of activities you perform with the number of people worldwide having access to the devices and the internet. Yes, that significant number in billions is "just" the textual information of your activity.
- Internet of Things (IoT): The number of Internet of Things (IoT) devices worldwide is forecast to almost triple from 8.74 billion in 2020 to more than 25.4 billion in 2030, inflating Big Data like anything. You can’t even estimate how smartly all your smart gadgets, TVs, watches, Fridge, etc., contribute to the amount of Big Data.
- Media: With sites like YouTube and many others, users can upload all kinds of content that has weight to it. It's not uncommon to have a video file on YouTube uploaded to be well over one gigabyte. And you can imagine the millions of people uploading those things every single day; it really adds up.
- Cloud: With the development and advent of cloud technologies, you have a whole new suite of apps that can generate and consume more data.
By now, you must be overwhelmed to know how big data is generated. To simplify things, let's understand this with an example.
Every second, Facebook has 147,000 photos uploaded. Given that Facebook has 2.23 Billion monthly active users, you may estimate the insane amount of data Facebook generates in a second from mere images!!
This brings us to the next question, which is,
How Big Data Works
Without diving into the advanced and sophisticated ways to understand how big data works, let's understand the basic structure of its flow.
The big data files, photos, or videos, etc., arrive in a central repository called the data hub. The data in the hub is thrown into reliable systems that allow data to scale and in parallel processes queries.
An example of a data hub is Hadoop, where data is written in triplicate to avoid losing information if the system is down. The queries are run simultaneously on that data.
No matter how advanced the data warehouse is, there is always a minute scope of error based on the CAP Theorem.
Eric Brewer gave this theorem which states that a distributed system can only have 2 of the 3:
- Partition Tolerance
Consistency means that every time someone reads data from your database, they receive the most up-to-date information or get an error—for example, a bank transaction. If I bought something and then tried to buy something else a millisecond later, that second transaction needs to recognize that the first transaction was completed. Otherwise, we could have fraud on our hands, and people could steal a ton of money.
The second is Availability. It states that any request to your database receives a response. Now, it doesn’t guarantee, in this case, that the answer will be the most up-to-date and accurate information.
The P on our CAP theorem is Partition Tolerance. It means that the system will continue to operate despite any problems or challenges. When dealing with modern distributed systems, Partition Tolerance is not an option. It’s a necessity. Obviously, you can't have 2.23 billion monthly active users and only process their requests in sequential order. It just wouldn't be possible that everyone would have to wait forever.
Take the most popular system Apache Hadoop for example. Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. In addition to it being open-source, it has thousands of contributors enhancing it and adding new platforms to make it even more valuable.
Thus, Big Data technology is ever-evolving, and the fuss about it is REAL.
So that was just a taste of what is a world of big data. Now you should be set up enough to converse about Big Data and at least know what you don't know.
To learn more, head over to freethedataacademy.com/yt to see our entire catalog and sign up for a seven-day free trial, so you can start learning today to elevate your career tomorrow.