P L E A S E  W A I T...

Data Pipeline to support ML deployed services in Google Cloud Platform


The Brand

The client is a 50+ years old higher education institution. The institute offers undergraduate, graduate, and doctoral degrees, in multiple streams like business management. Overwhelming majority of the students use the website and enrol for the online courses.


The Opportunity

The client wanted to set up a cloud-based solution for collecting user level data, store data in a data warehouse, and build pipelines to connect with ML model. Ever increasing number of users on the website called for a significantly scalable and fault tolerant data pipelines and the connectors.


The Approach

Google cloud platform as the core infrastructure

Google Cloud platform has different services that can be clubbed together as a unified platform for running production ETL pipelines, ad-hoc analytics, and machine learning that can auto scale. Google cloud compute services was used for offline Analysis by scheduling and serverless computing to launch Realtime connectors.

Data Pipelining

Initially the connector script ran to pull historical data for last 1 year and this data is used to train the model. This trained data is stored as pickle files in GCP. Later, on daily basis, the connector fetches the data, cleanse it, and then store it in the big query. ML models inside Cloud Run, fetch the data from big query, and score it. Output of scoring is uploaded to GA, as well as, stored in the Big Query as columnar data. This whole process and components were designed and deployed using Python.

ML Implementation

ML Algorithm was released as an end point in GCP that could be called by authorized online tags, and, integrated with GA for further audience creation and reporting, as well as, with Data studio for model performance monitoring dashboards.


The Solution

Deployment 1 – Offline Model:

deployment 1

A well-built python-based connector hosted on cloud that runs on daily basis and retrieves the latest Clickstream data form Google Analytics / Big Query and provide as input for ML model to calculate intent score and upload intent score value as custom dimension back into GA and Big query. Offline model uses different services in Google cloud platform like Cloud function, cloud scheduler, Google compute engine and big query

Deployment 2 – Online real time model:

deployment 2

Python-based real-time connector hosted on cloud which act as scalable connectors to calculate intent score when user is on website.

Auto Refresh:

GCP data pipeline is a manged ETL service that can launch a virtual machine on daily scheduled time to run the python-based connector to pull the data from GA API and process it , use it as input while executing ML model, and push the output to Big query and GA. It has mechanism to alert admins, using email notification, if there is any failure in pipeline or any error occurred during the ETL run.


Want to learn more? Let's Talk.