Job titles in the field of analytics tend to have very scientific-sounding names: data analyst, data scientist, data engineer. The truth is that data artist might be equally true. Okay, okay, I can hear the moans through the internet on that one. But if you think about it, the goal of the data analyst is to find the deeper meaning in the data, to dream up questions you could ask and then to ask and answer them. The questions aren’t necessarily known ahead of time. And why would you want them to be? In that sense, an empty analytics dashboard is no more than a blank canvas that must be artistically crafted into form.
Poetry aside, this begs the question about how to design your data pipeline and warehouse to make this possible. And the first thing to consider is that data usually needs to be combined, condensed, or in some other way reorganized before it can be presented neatly and efficiently in charts and graphs. This is called transformation. Why is this necessary? Because the raw data in your database is typically organized in a way that is optimized for individual transactions (inserting new orders or customer information) and not in the form of summary tables. In fact, pre-transformation of data is usually necessary because condensing and summarizing data can be time consuming and slow and you don’t want to do it at the time you are rendering your charts. (Imagine loading a dashboard, but it takes ten minutes or more to return results.)
For decades, the conventional wisdom has been that you use ETL to move data between your database and your data warehouse. ETL stands for Extract, Transform, Load…in that order! In other words, you first extract your data from your database, transform it in some way so that it’s digestible, de-duped and summarized, and then load it into your warehouse in condensed form. Then your data analyst, using data visualization tools, has an easier time of generating charts from these summary tables. Sounds good, right? Well, actually this turns out to be the wrong approach for a two major reasons.
The first reason is that the data analyst is often a separate person in an organization from the person who sets up ETL. Typically, a data engineer, not a data analyst, is responsible for designing the systems and processes for moving and storing data. This means that the data analyst would need to coordinate with the data engineer to get data summarized in different ways. It’s not an impossible hurdle, but it’s an obstacle.
The second reason, and perhaps the most important one, is that you want to put the design of the transformations in the control of your data analysts (your data artists, remember). If you hand them only summary tables that have already been transformed, you are limiting the analysts to pre-canned and limited information. It’s almost as if you are giving them only answers, with limited ability to ask additional, creative questions. When you put all of your data in a warehouse, and allow your analyst to experiment with ways to slice and dice it, you end up getting better insights into your data. This in turn, allows you to make better business decisions.
Why We Like Matillion for ELT
Matillion is a platform that allows you to visually orchestrate the ELT process with drag-and-drop components that connect together to form a pipeline of data as it’s extracted, loaded, and transformed. But, more importantly, it works by first extracting your data from your database and loading all of it into your warehouse. Then, subsequently, you transform it and generate additional tables, such as summary tables alongside the other tables.
In other words, Matillion allows you to move all of your data into the warehouse first. Then, you can craft and design, re-craft and re-design, your transformed tables anytime you want without having to go extract and load any additional data. That’s classic ELT, and it’s quite powerful!
Matillion is best for data analysts who are also data engineers as it can be quite technical to configure and set up. However, it’s probably your best bet for very large datasets. It runs on a cloud instance in AWS, Google Cloud, or Azure. You spin up the instance when you need it and spin it back down when you don’t. You are only billed when it’s running, and it’s priced by the hour and not by the row, as many other ELT/ETL tools are. That makes it a cost effective solution when there is a lot of data.
Why We Like Looker and Specifically Looker Persistent Derived Tables (PDTs)
Okay, so suppose your data analyst is not also a data engineer. Suppose you want your analyst working with your data, but they aren’t going to be setting up your data pipeline. And perhaps they aren’t going to be meddling with AWS policies that spin up ETL tools or scheduling ELT jobs.
Looker is a data analytics platform that lets you explore, analyze and share real-time business analytics easily. They have their own markup language, called LookML, which can be used to describe dimensions, metrics, and data relationships in general. It provides an additional abstraction on top of SQL so that you can describe how you want your data to be aggregated, summarized, and ultimately presented to the user via dashboards and visualizations.
For example, you can write LookML that creates a view using all customers and orders and shows the top customers for a given period of time in a so-called derived table. Yes, that ultimately translates to a set of SQL queries (and, by the way, Looker can seamlessly translate the LookML into many different SQL dialects depending on your database connection). But the analyst is able to write in a high-level language, while Looker takes care of a lot of the low level heavy lifting to generate new views.
But the real magic comes with Persistent Derived Tables (PDTs). You can use the same LookML language to indicate that you want your abstraction written back to the data warehouse in a cache. In doing so, it becomes a persistent transformation.
In this way, Looker places the analysts in the middle of the ELT pipeline. It gives them tools to create views (aka derived tables) on a whim and to write them back as new transformed tables to the data warehouse. And then those persistent tables can be updated on a schedule, also controlled by the data analyst. In short, they can transform the data any way they want, change their minds and ask new questions any time they want, yet benefit from the performance of having the data cached in pre-crunched tables. We think this is a unique and wonderful tool for analysts.
We recommend you learn more about the differences between ELT and ETL. We’d love to chat about anything data-related! Send us a note at hello@hashpath.com