This blog post is an account of my first six months at Collectors building a data platform from scratch. It is part memoir, part instructional manual for data teams embarking on a “build a data platform” journey. Most of it is based on facts. All of it is based on my own personal experience and opinions.

Music in my ears

“Yeah, it’s entirely green field. You’d be building a brand new data platform from scratch.” I was sitting outside a coffee shop in New York’s Lower East Side, with my phone balancing against a sugar jar on the table, trying to make sure Dan, the CTO of Collectors, could see more than just my chin on our video call. I hadn’t exactly picked the best location to talk about a possible future job with my possible future boss, but sometimes you just have to make things work. 

Based on Dan’s overview of the current state of data at Collectors, there was a huge opportunity to build something new and exciting and help make the organization more data-driven. I started getting giddy, brainstorming possible solutions with Dan, rattling off the list of “modern data stack” type tools I’d been working with for the past few months. After spending several years building a data platform and analytics pipelines with healthcare data and a short stint at an open source data quality tool, I had taken the summer off to re-orient myself, learn more about current data tooling, and figure out where exactly in the vast space of “data” I wanted to position myself. Maybe this was my opportunity to finally “do the thing”, put all my learnings into action, and build something from the ground up?

Alright, what we got here?

Fast forward a month, and I’m here at Collectors to actually “do the thing”. Can’t be too hard, can it? I start talking to an enormous list of stakeholders to get a picture of the current landscape of data reporting across the organization, as well as understand the world of collectibles – also known as The Hobby – that’s entirely new to me. Here’s what I’m looking at: Most of the data from our production databases like the grading system and our websites is currently stored in on-prem Microsoft SQL Server instances. We run scheduled stored procedures to generate reporting tables, and use SSRS (SQL Server Reporting Services) to create fairly static “reports” (think tables with numbers and a date filter) that are served via a website. In addition, a number of tech savvy users frequently run SQL queries against the production databases (I screamed just a little here) to create elaborate Excel spreadsheets. Manually. Every week. (I screamed a little more here). Data requests are submitted to the IT team, which reroutes them to the Database Admin team (DBA) who in turn either create new stored procedures and reports, or run ad hoc queries and send over numbers or spreadsheets via email. This is all done for different brands and teams across Collectors with acronyms such as PSA, PSA/DNA, PCGS, which are slowly starting to make sense to me. Oh and by the way, we also acquired this other company that has their data in an entirely separate PostgreSQL database that we haven’t really looked at yet – good luck!

I try not to be too judgmental when it comes to technology choices. There’s usually a good reason for why a tech stack or workflow turned out a certain way: it was the only thing available at the time, it grew organically, historical reasons. Also, I often work with the people who built those systems, and talking down the work someone did is never a good way to start off a relationship – you gotta have some empathy, man. In this case, it was clear that the system had grown organically out of a need to just get some numbers, but it no longer served the requirements of a fast growing organization that wanted to consistently use data in its decision making processes. Both the technology and the workflows meant that the “data producers” (the application teams and DBA team) were removed from the actual use cases and often had to “throw data over the fence”, whereas the “data users” had little ability to drill into data, slice and dice it any way they wanted, or get deeper into analyses that tackled more complex questions than just “getting numbers”.

To the drawing board!

After several weeks of conversations, endless pages of notes, diagrams, and a lot of confusion about card and coin terminology (wait, what is a “flip” again?) I started to form a clearer picture of the data flow across Collectors as well as some key data needs, which would inform the design of our centralized data platform. First of all, what did we actually want to accomplish and why? Well, as with any organization, Collectors constantly tracks metrics in order to inform business decisions: How many items of which type are collectors submitting to us? How fast are we grading them? Are there any new trends emerging in the world of collectibles that we should be prepared for? Do our turnaround times for customer service requests meet our standards? Rather than using anecdotes or “gut feelings” to answer those types of questions, we needed a reliable, centralized source of information. What was currently buried in various reports and Excel spreadsheets should become a collection of dashboards and ad-hoc analyses using a modern Business Intelligence (BI) tool that would allow users to filter, sort, and drill down into the data. The dashboards would be built on top of a reusable and well documented metrics layer, which would be fed by clean and validated data that’s updated periodically. Sounds pretty neat, right?

I then started sketching out some key principles for the architecture of the data platform: Most importantly, we had to avoid hitting the production databases as much as possible. I had already gone through the rite of passage of slowing down the production system for several minutes with a long-running query, and I was assured that this wasn’t the first time that happened. Second, as a one-person data team, for the time being, I wanted to do as little infrastructure maintenance as I could get away with. This was part of the reason why I chose to use “modern data stack” type tools, a term that usually refers to a set of data infrastructure tools that are often open source first, with paid cloud-based SaaS (Software as a Service) versions. Think software like dbt (data build tool) and Apache Airflow. And finally, while I didn’t have particularly strict budget limitations, I had to keep an eye on the cost and make sure it was reasonable and defensible for a platform that would be “work in progress” for a while.

Since there were no clear needs for real-time data analytics, I settled on building a fairly traditional ETL (Extract-Transform-Load) type data pipeline with the following components: 

  • Data extracts to move data from our production databases and third party systems into a centralized data warehouse
  • The data warehouse for storage and compute
  • Data transformation to modify the raw data into de-normalized tables and aggregated analytics tables.
  • Optional, once we need it: A workflow orchestration tool that would connect all the parts of the timeline and trigger one process after completing another. More about that later.

For each one of these components, we selected several candidates and embarked on several weeks of demos and vendor calls.

Get in <data person>, we’re going shopping!

Oh, the vendor calls. This was definitely one of the tasks I had not expected to be part of my role as a regular ol’ data engineer: evaluating tools, talking to vendors, putting together a budget, and going through security reviews of every component. How many records are we going to churn through in our data warehouse? What exactly does “SOC2 compliant” mean? Can we justify spending $60,000/year on a BI tool that will likely require us to hire a specialist to get any value out of it? (Spoiler: No.) Despite my experience building and maintaining data pipelines for years, this was an entirely new process to me, and it definitely posed a challenge for my engineering mindset of “getting stuff done”. Heads up, “data team of one” type folks, this may be you, too! I will be forever grateful to my colleague Colin Andrade, head of BI and Analytics at our sister company Goldin Auctions, who coordinated and led a lot of these vendor calls and let me watch and learn. After several weeks of sample projects, occasionally frustrating sales calls (“Can you please just put us in touch with a technical person… please?”), Google spreadsheets and pro/con docs, we finally made a decision for our data stack. Here’s our final selection:

Stitch for automated data extraction from our production databases and load into our data warehouse. We also evaluated Fivetran and Hevo and landed on Stitch for pricing and “it’s the other big name” reasons.

Google BigQuery as our data warehouse. We were also considering Snowflake as one of the most popular options in the market, as well as Amazon Redshift, since a lot of our production infrastructure already lives in AWS. This was a long back and forth, but we eventually settled on BigQuery due to the low maintenance effort and comparatively low estimated cost.

dbt for data transformations. Aside from writing a lot of elaborate SQL queries, there didn’t seem to be a real alternative to dbt. Plus, their documentation and community support are unparalleled – this was a no-brainer.

Tableau as our BI tool. This was probably the most controversial decision, and we evaluated several other tools in the space before settling on Tableau. Looker and Mode didn’t make the cut due to pricing considerations – we’re looking at $40,000-$60,000 per year for a “work in progress” data platform, for which even simple filtering and sorting capabilities would be a huge upgrade from the current reporting platform. Additionally, Looker has a steep learning curve and requires learning LookML to get value out of it – as a data team of one, this was not really an option for me. This is not to say that we won’t consider these more powerful tools in the future, but for the time being, we needed something simpler. Other contenders were Metabase, Power BI (just for the sake of evaluating it, but we stopped the evaluation when we found out there’s no desktop app for our Mac OS laptops), and a fairly impressive new platform called Glean, which felt just a little more targeted at analysts than what we were looking for. In the end, Tableau made the cut due to their per-seat license pricing model, the fairly quick learning curve, as well as in-house experience on our team.

We have liftoff!

Aah, finally, we have a decision! We’ve picked our stack, the budget is signed off, my knuckles are cracked, let’s start building some pipelines! But, not so fast. I ran into several other roadblocks before I could really start building something. It took us a little while to iron out administrative details such as who in the organization should own all the accounts, which credit card to use, who to send the invoices to, and how many licenses to get. Note to self: This is a great thing to figure out early on when embarking on such a project. Next up, we had to get into our technical starting blocks and work through things like production database access via an SSH tunnel, which required a lot of back and forth with the Stitch team, and some seriously strong nerves from our database admin and system engineering folks. Oh and don’t forget that we also need user groups with the right permissions for BigQuery in Google Cloud Platform. Pro tip: If you send your colleagues a giant box of snacks when they do some heavy lifting for you, they’re more likely to forgive you. 

Alright, now we’re ready to roll! With the help of our DBA team, I took an existing “daily production” report from the reporting portal, and worked backwards through the existing stored procedures that generated the report to create a shortlist of relevant tables we wanted to integrate from the production databases. In short: I read a lot of SQL and tried to reverse-engineer what was going on. I then connected those tables to Stitch (after having gone through the SSH tunnel exercise I described in the previous paragraph) and hit “Extract” to pull data and load it to BigQuery. Seeing the tables being created in BigQuery and having the first rows of data show up gave me strong “open the floodgates!” vibes (for my fellow Germans: “Ozapft is!”) and felt like an exciting milestone. 

Next up, I started writing staging models in dbt for some light column renaming (snake_case all the way), created some dim and fact models for key entities, and attempted to reproduce the existing reporting query in dbt. The dbt run output lives in a dataset (schema) in BigQuery, which we called our “production” schema. Finally, I connected the resulting table to Tableau and created a first visualization, which I proudly showed to the folks on my team. And this is the magic of the “modern data stack” and SaaS products: Once everything was actually signed off, it took me less than a day from start to finish to build an entire pipeline and data visualization. 

In part 2 of this blog post series, we’ll take a closer look at the next stage of the data platform project: iteration and validation.

We’re hiring at Collectors! We’ve got open roles on the Data Platform team and many other teams: