odpf/meteor — Metadata Extractor
Before getting into meteor, lets build some context around what its purpose is and get familiar with some terms.
Metadata is a self-referential form of information that describes the entire data contained in a data source without the need of going through all of it. In the DE ecosystem at Gojek, Kafka has been widely used for event streaming, and if you need to load real-time streaming data from Kafka to data stores, then the topic information needs to be fetched first, and the initial version of Meteor was meant for to handle this specific job and had a few metadata extractors like topics, BQTables, BQDatasets, and protobuf.
So, the drawbacks previous meteor as a service had, and with the new pipeline coming up, it was quite non-trivial to jump to a newer approach that includes a single solution for all the available data sources whether it is a database, dashboard, data bucket, users, or whatever possible, and to be able to process the metadata and sink it down easily to the destination by the way user wants.
TLDR: The old meteor lacked the inclusion of all the different data sources, and there were many more feature that could have been added, so the team moved on to build a plugin-based metadata extractor.
Terminologies: Some of the terminologies we will need to get clear with the entire working of meteor (Just trying to be beginner friendly, sharing the spoils of my naivety). These terms may have some other meaning in a different project, we are talking about them w.r.t meteor.
- Job: One task, refers here to metadata extraction from one data source.
- Source: Our Data Source, can be database, dashboard, etc. One Job in meteor allows one source in meteor.
- Processor: After the metadata is extracted we may move ahead with processing it before further steps. There can be multiple processors in a single recipe.
- Sinks: The final destination(s) for the metadata we have extracted from the source. Meteor supports multiple sinks and some examples are : a simple log in the console, or the output through Kafka.
- Recipes: The instruction in a
yaml/yml
file that is needed to instruct the extractor to perform a task, contains information about the source, processors and sinks in meteor.
So as you may have got it from the terminologies, in meteor to perform a job we need to write a recipe.yaml
file, which contains information around the source of our metadata, from the type to configurations, as well as the information about the processors, and sinks.
What Kind of Metadata is being extracted??
Different data source require different kind of metadata to represent themselves and hence we follow different models for different kind of metadata a source can provide. For e.g. Postgres or any other database that store data in the form of tables returns the list of tables containing the metadata about each of these tables. Similarly Metabase and Grafana will return a list of Dashboards with the information about the charts on them.
Features
Few other features that set meteor apart from the previous one are:
- More support for various data sources.
- The plugins based approach, that allows the flexibility to add more extractors and sinks easily.
- Since it is written in Golang, there is no dependency and only a binary file hence easy to use.
- Easy to set up, and use with a very well-planned cli, it is really smooth to get info around all the features you need to know or generate recipes for various configs you might need.
Stuffs under Analysis
A few of the good features about meteor as it keeps growing better will be:
- The ease with which data discovery can be addressed. For an instance, suppose looking for some information around a particular database to which you have access but have not set up the workbench to explore the database. You might wanna have a cli that can help you explore multiple dashboard services that your org supports before actually setting up one of them.
- Setting up data lineage is also a future issue, meteor aims to work on, with the purpose to make it easier for cross-team information sharing about the resource being used, without the need to ask everyone about it.
About ODPF
The Open Data Platform or ODPF is an org on GitHub that includes the Open Source initiatives from Gojek’s Data Engineering team and is open for everyone to contribute, and collaborate.
Meteor is also open for open source contributors, and feel free to try out the project following the docs at odpf/meteor.