Extensible telematics data pipeline — for time-series data visualization
Recently while working on a project I was delighted to learn about multiple fleet management platforms but surprised to see that all of them suffer from similar problems. Like holding your data hostage, limited API support, archaic architecture not taking advantage of the latest advancements in time-series DBs to process data in a more efficient manner.
Let's get terminology straightened out:
- Geotab — they provide OBDII dongles and a platform to store and analyze data
- InfluxDB — to make more sense out of time-series data
- Grafana — Data visualization tool
Why to even embark on this journey? you may ask, few reasons
- They store all your data in their cloud
- They use MySQL to store the data
- Their platform gives you fleet management capabilities but enough for your data analytics team
So how do we build this pipeline:
- Leverage Geotab .NET API adapter to pull data
- Save data into your MYSQL DB (finally, you have control over your data), this could be counter-intuitive like why should you replicate data that is already existing:
- You will have full control over the data once it's in your DB
- You will have the audit history of tables that get updated regularly - Create an InfluxDB instance
- Use python MYSQL API to retrieve data and then use InfluxDB 2.1 API to store in time-series format
- Use Grafana and Flux to process and visualize data
Simple! yes, but there is a lot more involved if you would like to do it right. Let’s dive deep
To future proof this pipeline I decided to leverage cloud computing:
- EC2 — to host the Geotab API Adapter
- AWS RDS MYSQL — a self-managed MYSQL instance (a few important gotchas on it later on)
- EC2 + EBS — to host the MYSQL to Influx adapter, Grafana server, and InfluxDB
- VPC and Security groups to avoid unwanted lurkers
Gotcha’s
- AWS RDS MYSQL could be very expensive especially due to IOPS when doing bulk data transfer. To keep the cost low transfer in chunks
- Make sure to configure your InfluxDB so that DB is located on EBS else if the EC2 instance goes down you’ll lose all your time-series data
- Work backward from your analytics needs and store specific data into Influx
- If you are confused about what should be the tags and what should be the fields in Influxdb, you are not alone. Just remember Keys are indexed and fields are not. For every tag declared you’ll have the same copy of the data.
- Flux could be alien-like when coming from SQL but use the InfluxDB GUI and the query builder to get the hang of it.
- As for Grafana + Flux, the “pivot” function is your best friend
One final note, if done right this can be a pipeline to get any data into InfluxDB. I was able to extend it to another telematics solution in less than a few hours. Where most of the time was spent understanding the data structure of the new data source and what to bring into the time-series database