We get it. It’s a confusing world out there in the land of data warehouses. And all you want to do is get your data analytics story going. How to even begin? You’ll find many articles out there about benchmarks (recorded in log scale seconds…what’s that?) and I/O per second (IOPS) measurements. Meanwhile, you are comfortable with your Postgres Database or your SQL Server. Why do you even need to move to a database warehouse? Can’t you just stick with what you have?
In this post, we are going to discuss some of the high level points that you should be thinking about in terms on choosing a data warehouse. Here are some of the typical questions you might have about data warehouses:
Do I even need a data warehouse for analytics? Why can’t I just use the database I already have?
Yes, you absolutely need a data warehouse if you want to do analytics. First of all, your primary database is busy doing what it does. It’s running your business. It’s capturing orders or customer interactions or research data. You wouldn’t want to slow down your production environment by making complex analytics queries, right?. But why not just clone your database nightly and use that second copy (which won’t be busy at all) as your analytics database?
That will certainly help, but you’ll quickly find that your analytics queries are really slow. Here’s why: if you are using Postres, MySQL, or SQL Server to run your business, you are running a transactional database (aka OLTP). A transactional database is designed to process inserts and updates or data very quickly and securely. That’s because these databases under the hood are all built around rows. Insert rows. Delete rows. Find rows. It stores the data in rows at the lowest levels (in block storage) to make it run as fast as possible for transactions. So for example, if you want to add a new customer record or process a new order, the database is really good and fast at that because you are operating on lots of columns but a small number of rows.
Analytics on the other hand is all about summarizing. You want your metrics, your sums and averages, your totals for the month and quarter. For those type of aggregate functions, you are going to be operating over a small number of columns but a large number of rows. For example, if you are totaling the sales for a given quarter, you will be adding up all the values in the sales amount column. If all the records are stored in row format, the database will need to fetch each row (at a low level that means fetching a block of storage that contains that row), jump to that column within that block and pull out the value. If there are 50,000 rows, that means fetching 50,000 separate blocks of storage and finding the column within each of those blocks. Sounds exhausting, doesn’t it?
That’s where data warehouses come in. Data warehouses are specialized databases designed to make analytics faster. They store data in column format which means that all data for a given column is stored together. Imagine all the values of a column all packed together neatly in a package (one or more blocks) without other data from other columns mixed in or getting in the way. Then the database can simply fetch a block from storage and walk through all the values all in one place sequentially. If it sounds easier and faster, it is! So we definitely recommend choosing one of the many powerful data warehouses as the underlying data source for your analytics and not trying to make your relational database work. You’ll be able to ask and answer your business questions faster, model bigger and come complex analytics, and ultimately make smarter decisions for your business.
So which data warehouse platform do I choose?
In this post, we’ll be discussing three data warehouse options: Google BigQuery, Amazon RedShift, and Snowflake. There are others, but those are the big ones currently. In order to choose the right one, you’ll need to ask yourself some questions. How important is for you to save every last millisecond (or in some cases microsecond) of query time? What do you plan to do with the data and where are your data visualization tools running? How will you get data in and out of your warehouse? We aren’t going to discuss all of those topics in detail in this post, but here are the key things to keep in mind:
Location, location location! If your apps, other services, and more importantly your relational database, are running on Amazon Web Services, you probably want to keep things in the neighborhood and go with Redshift or Snowflake (which primarily runs on AWS). It obviously makes sense to keep your services together and avoid transporting data all over the internet. It can slow things down or, worse, be expensive. Likewise, if you have standardized on a Google Cloud platform (many large customers have including Apple and Snap), then BigQuery may be an obvious choice. In general, if your cloud strategy involves more than one or two cloud infrastructure and database vendors then it’s probably more complicated than it needs to be.
And then there is the tradeoff of having the ability to tune the data warehouse to eek out every last drop of performance versus having it just generally work good and fast enough. We’ve found that BigQuery is the best database if you need a good data warehouse that works for most small and medium cases. You load your data in and you are good to go. There aren’t a lot of knobs for tuning or any special database knowledge required. On the other hand, Amazon Redshift has lots of tunables and tends to outperform BigQuery most of the time. These types of comparisons are easily found online but they boil down to that main point.
Sometimes it’s important to be able to have fine-grained control over your data warehouse performance. These reasons include having extremely large datasets, really large or complex queries that need a little grease, or conversely having really small queries that you run so frequently that you want them to return as quickly as possible. In any case, we believe you can’t really go wrong with either BigQuery or Redshift in terms of data warehouse performance, and it’s other factors such as cost and co-location that you should be considering, based on your specific use case.
But there is a new kid on the block that’s getting a lot of attention and that’s Snowflake (okay it’s actually been around for several years but it’s growing quickly). It’s worth discussing this solution specifically because it’s quite different from some of the others, which leads to the next question we get a lot….
What’s so different about Snowflake and why should I consider it?
Amazon Web Services is the largest of the cloud providers and so it’s not surprising that its data warehousing product, Amazon Redshift, is also dominant. Redshift is widely used and popular. But it has one major limitation in that scales (and therefore prices) storage and compute together. So if you need more storage, you end up paying for more compute. And if you just want a boost of performance, you are going to end up paying for storage.
Along comes Snowflake where the basic idea is: we can also run on AWS but we can do it better and cheaper. They do this by decoupling compute and storage and allowing each to scale and pay-as-you-go independently. When you need more compute, it spins up more compute on Amazon EC2 and if you need more storage it uses Amazon S3.
For Redshift, you have to decide ahead of time how big your warehouse needs to be, whereas with Snowflake you can decide later or never. It’s always as big as it needs to be. Perhaps your business needs a whole bunch of analytics horsepower after the holiday season to crunch some big numbers but the rest of the year your requirements are more modest. It allows small companies to burst up to really large warehouses and then spin them down just as quickly. It allows David-sized firms to have Goliath-sized warehouses only when they need it. That’s super powerful. (It should be noted that Google Bigquery also separates Storage from Compute and did it first way starting in 2010. But as mentioned above, sometimes you just want to keep it all in the neighborhood and all on AWS).
Snowflake ends up being better than Redshift in many cases in terms of performance and storage costs. And in general, it’s better at giving you control over compute and storage resources and the ability to scale.
In summary, we like all of these options and believe all have their place in the ecosystem. Like everything else it comes down to your specific use case and data sets.
At Hashpath, we’re here to help. Send us a note at hello@hashpath.com. We’d love to chat about anything data-related!