I have been working on enterprise projects with all tech stacks, and one thing is currently the same in all the projects: EVERY SINGLE PROJECT NEEDS A DATA WAREHOUSE.
From providing accurate and up-to-date reporting for strategic planning to helping you find new revenue streams and profit sources, data warehouse technology is now the cornerstone of every successful organization’s IT strategy. The end objective here is to make better decisions with available information. Thus, the essential piece of working in the field of data is data warehousing.
In this blog post, we’ll learn:
- What is Data Warehousing
- Why do we need a Data Warehouse
- How does Data Warehousing work
What is Data Warehousing?
To understand data warehousing, we need first to understand what is a data warehouse. According to Oracle, "A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics." To leverage the value from the large amounts of historical data, many companies across the globe have taken up data warehousing.
Data warehousing is the process of collecting and organizing disparate sources of information for analysis. "Disparate sources" is the key term that we'll understand with an example later; for now, data warehousing helps to consolidate and better organize the stream of information that enters from different corporate branches and operations centers. This restructuring makes the data more analysis-focused and human-oriented, helping to quickly lookup anything in the data.
In today's world with modern technology, the data warehouse itself might be more of a concept than an actual physical thing like 10 or 15 years ago. Data warehousing takes information management to a higher level. Hence, the need to know what purpose do data warehouses serve is of paramount importance.
Why do we need a Data Warehouse?
Just storing information isn't enough. It must be stored so that it's readily available for analysis and retrieval.
Let's understand this with an example of a business set up with a simple structure of Sales, Operations, HR, and Finance departments, each having its disparate databases and different systems running.
While the sales team revising the data at the start of the week may report booking $50,000 in profits, the finance team may report losses for the same period due to the differences in how their data system ran. These data silos end up spilling lies and delivering wrong or incomplete information when queried.
Data warehouse helps to break the silos, rendering a single source of truth. The same data previously being operated in different silos is now coming into a consolidated data warehouse, a single location. If the manager asks the same question, how much profit did we book for a certain period, each department will now have one consistent, standardized, and accurate answer. Thus, data warehouses keep the credibility of information from the data intact. This leaves us with a question, how is it done? Do we copy/paste the data at a single place, and voila! Off we go? Not really. To simplify things for analysis, the data warehousing process is an elaborate one.
How does data warehousing work?
The data warehousing process can be classified into three buckets:
ETL is a process of data extraction, transformation, and loading into the data warehouse. It requires bringing data from all the different departments into a single location. There could be one or a hundred different ETL processes that run on usually a scheduled basis, maybe overnight or every 10 minutes, to bring that data in and update the data warehouse. ETL aims to change the format of the information from several source systems (silos) into standardized, cleaned-up, and consistent forms. ETL requires staging and cleansing the data before delivering it to the end-users.
Staging is simply keeping the carbon copy of data. The first principle here is that you store everything and log where it came from and when you got it.
Cleansing means making the data understandable for all. For example, in the raw data, you may have California mentioned as Calif, CA, Cal, etc. To fix errors and normalize the data, you lookup for the ISO codes from a table dedicated for states and their ISO codes and update all the mentions of California to CA and prepare it for delivery.
In delivery, an essential part is that you log what changes are made if you're ever asked what the original data looked like and why you changed it.
When we store the data, we will have to put it in a format that makes sense and can be easily queried. The fact that hundreds of rows in different columns are not very useful without understanding what those rows mean and how they relate is why data modeling is essential.
There are generally two main ways to think about how to store the data. One is a normalized format, and the other one is a de-normalized format.
Normalization refers to storing data in a way that it can be related and joined together. For example, you might have an account number, followed by the company’s name and then the state/country it's from. In another row of information, there may be a reference of this same exact account number but in another state. How do we know that these two accounts are, in fact, the same? To be able to relate them, they need to have specific columns with the same values. The normalized way of storing data will put those fields together so that when we try to join these two rows of information, it's easy for us because the fields are all next to each other and in the same format.
Correspondingly, we could store the same data in a de-normalized model by putting each field into its own row, storing all the values for the account number at once instead of saving it as multiple fields. The advantage to this is that when we try and query against those fields, there are fewer joins between them, so that the performance will be very high. The disadvantage is that it's hard to join this data with other information.
The question in dimensional modeling is where we want to draw the line between normalized and de-normalized and which one we prefer for storage.
The correct answer for that is "It depends.” It depends on the volume and consistency of the data, what queries you want to ask against it, etc.
Once we have our data stored in the normalized form, we need to query it and get a meaningful result. The star schema is the most widely used method for joining information together. It consists of a fact table surrounded by dimensions tables containing descriptive facts about the fact table.
In a dataset, the fact table or the middle star point generally stores the account numbers, sales amount, etc. The fact table is surrounded by dimensions that provide contextual information about the rows in the fact table. There's typically a customer dimension containing your customer information, the date, and the time for every transaction, etc.
Star schema is the simplest style of data mart and is the approach most widely used to develop data warehouses. It is the traditional way that databases are designed for transaction processing. So, when you're writing SQL code, it's really nice if you don't have to join 50 different tables and other kinds of various joint conditions and make it overly complicated.
Thus, DATA WAREHOUSE Can Save You Time, Money and Headaches
Data warehousing, whether it be the actual physical manifestation of it in a star schema, is all aimed at helping you deliver more value to your users. No matter where you're going in data, data warehousing will be a core principle that you need to grasp.
To continue your quest to learn all things data, head over to freethedataacademy.com/yt to see our entire catalog and sign up for a seven-day free trial. Start learning today to elevate your career tomorrow!