Over the years, the roles in the field of data have transformed dramatically. The concepts behind them and what their functions are are the same. Data has been around forever, powering our lives. Starting in the early ‘60s, companies looked at ways to organize this information and make something useful out of it. Going into the ‘90s, we had a new concept called star schemas we’ll touch on here later. Then in the 90's, with the boom of the internet, the concept of big data was introduced. The original idea came from Bill Inmon, considered by many as the father of the data warehouse, was that “a subject-oriented, integrated, time-variant and non-volatile collection of data to support management’s decision-making process.”
That's an incredibly kind of jargony, a technical way of saying you wanted to model your data in a way that mirrored your business so when it came to asking questions using this data, it will make sense and align perfectly. It became incredibly difficult and data warehousing actually got a bad rap because of this. The projects would take forever, it would run on, and would actually never be finished. As businesses and technology advanced and rapidly changed, this model simply couldn't keep up. That's where we have Dr. Ralph Kimball come in. In his book, The Data Warehouse Toolkit, he introduced a similar idea that was smaller and easier to manage. It is what we call a star schema. With this model, you have a mini data warehouse for each business process, and they share some common data sets. Collectively, these are known as star schemas. They eventually make up your entire business but are much more flexible and easier to build. Let's look at an example here.
In the star schema, we’re looking at the sales process. So we have a table, think of it as a location where we put data. Every time we have a sale—we log it here, we write it down; we put it into a spreadsheet. Then from there, we want to look at the sales by customer. We want to figure out who bought what. We need to know where those customers are from. So we have a separate way of looking at sales by region—who made that sale, what team they’re a part of, what group they’re in, what location they’re from. We could add another table here for a salesperson. If we wanted, we could look at sales over time--how they were last year compared to this year, we would need a separate table for that as well. Last, making up our mini star schema here, we have the product—what was sold. The idea was to structure a mini data warehouse around a single business process like you’re seeing here. And in the middle, that light blue, is the fact records, the fact tables as you may see them referred to as. These are just transactions, logs of what happened, pointers to all the other relevant dimensions. That’s what you’re seeing in the light purple there. These dimensions and facts make up a star-like shape that makes it easy for analysts to query, ask questions, and get answers to this data and help the business grow.
Meanwhile, if the customer data changes or needs a new dimension added to it, it would be platform supported and easily done. They could do the changes in a short period and the business would immediately get value from it. At that time, we heard the term business intelligence in American corporate speak and circles. Today business intelligence and data science is essentially the same thing but is sometimes called data analytics. Back then, it was still a niche area that was more IT than it was business. This was the first time we tried to bridge those gaps. With the star schema, the practice of structuring your data in such a way to allow questions to be asked was becoming more prevalent. Dr. Ralph Kimball propelled this part of the industry and business to where it is today forward. Making it possible for a lot of us to do our jobs in a much more effective way.
We also saw new tools come out that changed how all these things work. Statisticians ran algorithms on large data sets to predict future outcomes. Analysts confined to Excel now wrote SQL code against databases, creating visual ways for folks to interact. Engineers were helping to set up and manage all of these things so that the analysts and statisticians—or as we now call them, data scientists—were able to be more effective. These roles all kind of blended together into what is now a data ecosystem or a data team. Since these roles have been blurred, morphed, and evolved over time, I’ve done a lot of research prior to making this course to keep every information as up-to-date as it could be. Aside from my own experience as a consultant across many industries and companies, I’ve recently interviewed a lot of folks in the industry to make sure that I had all the most current information for you.
Let’s look at how the technology stack in this field works, and then how the roles map to that stack. At the base layer, you have the data platform which I’m generalizing here as the place where everyone in the data organization goes to get data for their work. We’ll dig into this later. Imagine it’s like a grocery store filled with all the stuff you could want for whatever occasion. From there, we have our Analysis layer. This is where we take our fresh data from the platform, apply business logic, and add relevant context to it. From that, we’re able to deliver fantastic prepared meals for our users to consume.
Following with the grocery-store analogy, this is our backyard barbecue with some hamburgers and hot dogs, or tofu dogs if you’re into that kind of thing, and some refreshing beverages. A simple but very fulfilling meal. But what if our users have a more refined palate? Well then, we need to introduce more advanced ways of analyzing the data, such as statistical modeling and forecasting. Here is where you get your Data Science layer. This is the smaller, less used, but important processes and ways of turning this data into something valuable.
Let’s see how roles in the data field map to this Technology layer. The foundations of the data organization are built by the Data Engineering team. In this role, you would interface with the teams that manage where data is created. These teams include the Software Development teams and the App Dev teams, the Systems teams who manage the systems provided by third-party vendors, and Analyst groups and the Data Science team because they are your customers. You’re responsible for handling all the data coming into your platform, ensuring its quality, its accuracy, its completeness, and then making it easy for all the analysts and scientists to consume that data and use it. From there, data analysts take this data, apply business logic, and add some relevant context to help teams understand what’s going on. Then they present it and provide it to the front-line teams in a way that makes their workflows more efficient and helps them find new opportunities or ways to grow.
I’ve seen this work best when the data presented lines directly with the tools they’re using. For example, if you’re supporting the sales team and you want to integrate data into their workflow, you will likely want that data directly entered into their customer relationship management (CRM) tool. This way, they don’t have to step outside their normal workflow. Their job is to make sales. They don’t have to leave that process to use the analysis and benefit from it. It needs to be directly integrated into whatever tools and whatever platforms they used to do their work. Data analysis is the easiest role to get started in. The tools for it are becoming very accessible and easily learned. If you know basic Excel skills, you can quickly transition to a lot of the powerful tools that data analysts use regularly. This leaves our data scientists with an easier job of preparing the forecasts, the predictions, and the more specific answers to complex questions.
I like to think of data scientists as the special forces on the team. When a problem has many angles that all need to be carefully considered in a model and the answer to a question needs to be statistically valid, you need a data scientist. The reason I say they're the special forces is that most businesses don't need a big group of data scientists. A lot of the business questions that you have on a day-to-day basis, or on an operational or tactical level, can be answered by a good data analyst. That is an effective, ground-level kind of role that I think a lot of people will be able to start with and then grow from. The data scientists are the ones handling the between-the-company decisions, the ones that really take a long time to study because the answer needs to be super specific and accurate.
To be clear, I’m describing what I’ve witnessed in my career with all the companies I’ve consulted at and worked at. Some other companies will have a different mix of these roles. An advanced company like Netflix will have a lot of data scientists working on things like complex algorithms integrating into their actual product feed. While a small retail company won't need many people doing things at that high level of precision. More likely, what they’ll need are people on the analyst side answering questions much quicker and with just a much more simple look at things that will help guide their business. So both roles are valuable. And depending on the industry you’re in or how the company works, you will have a varying mix of these different roles within your data team.
This brings us to a role often not discussed in the data team, but one I think is super important. This is the Data Product Manager. Now when I say product manager, I’m talking about data products. You can think of a data product as anything from a written report from an analyst or a data scientist with a specific question they were answering, to a tool that supports a team and is making them more self-sufficient. Anything in between is what I would call a data product. Now the Data Product Manager is someone that binds this all together and is the glue of the group. They bring together the right people at the right time to make the product successful. A background in data would be very helpful because then they would have this domain expertise. When working with end users and trying to translate that into something for a data engineer, they would benefit from any kind of prior knowledge or experience in that area.
And as the last bit, I wanted to mention you’ll hear different terms for these roles. Now the terms I’m using are the ones that are the most current but they will change down the road. However, the functions and the concepts behind them are consistent. And those things have been around, as I mentioned earlier, for a long time. So let’s keep going down our journey here and dig into each one of these roles more in-depth to understand what they do.