The Amundsen Project

The primary focus of my 8-week stint with the Data Engineering team at RigUp for the summer of 2020 was a data catalog project. A data catalog is designed to organize large databases via a platform that is more human-friendly and searchable. A couple of years ago, the rideshare company Lyft set out to minimize the time people spent searching data by creating an open-source data cataloguing tool named Amundsen.

Amundsen’s tag and search bar-based system makes it extremely easy to search through various tables in a database. Additionally, the large swaths of metadata that is collected by Amundsen allows the tool to display very creative information: such as showing the most popular tables, usage statistics per table, and where a table was sourced or generated from.

I and another intern, with the support of the Data Engineering team, were tasked with implementing a version of Amundsen that could be used by RigUp personnel. Personnel which mainly consisted of business intelligence analysts, who were the target users for the tool. After using Docker to quickly set up Amundsen’s five microservices on a GCP virtual machine, most of the initial work was focused on connecting the data pipeline from Snowflake to Amundsen’s graph database, Neo4j. This was done almost painlessly using pre-built connectors from Amundsen’s data builder library.

A cool part of the project that I got to handle was generating custom usage statistics from user data in Snowflake and ingesting those statistics into Amundsen. This was so that we could power the "popular table" and "usage data" features that I briefly mentioned earlier. The general solution was to take each user’s SQL queries and parse them for table names. From which the presence of an individual table name was counted as one read. Those reads were then aggregated to a dataframe and uploaded daily to a table in Snowflake that held all of the usage data. That table was then processed and read by Amundsen.

Also, we got the opportunity to utilize tools like Apache Airflow, CircleCI, and GCP to make the development and deployment of Amundsen a lot more efficient. By the end of 8 weeks, RigUp's data catalog was ready to go and accessible internally.

Back to Home