First 12 months activity summary

September 19, 2021

Consulting work

Data integration project New Zealand working on AWS environment to prepare data ingestion from replication of Hospital PAS to analytics Data Warehouse. SQL Server, DXC WebPas
Data integration project Zurich working on Customer Cloud VMs to prepare data ingestoin from Hospital PAS to analytics Data Warehouse. PostgreSQL, custom PAS
Supporting Software Delivery project with SA Health
Large scale platform design, integration and delivery project. Integrating data from 30+ hospitals across multiple organisations. Working closely with Microsoft Consulting Services Europe to design the solution architecture for an Azure solution to ingestion from on premise to Azure, process and publish analytics back to on premise clients. Building technical teams to support engineering and delivery.

Secret Labs development

Secret Labs have been working on a product called DataStream. This is an incubator project that is more research and design than a product build. The intention of this project is to explore contemporary technologies that support data movement and processing. This is part research and part internal professional development, allowing Integrant Labs staff to get hands on experience working at depth with modern data, clustering and microservices technologies by building supporting tools.

The first iteration of DataStream was a Python based back end using Pandas for schema-less data wrangling. This iteration was a proof of concept to expose some of the challenges of building an ETL tool. The Python based DataStream Engine implemented a dependency DAG (Directed Acyclic Graph) that was calculated based on the tasks withing a job. A scheduler was implemented using a cron library. Multiple front end UIs were developed including a plain CLI, and advanced CLI that included auto-complete and curses based terminal management. Most UI effort was put into an Electron based client built using React with gRPC client server communication. This client implemented a basic code editor and a visual designer for describing an ETL job using yaml with schema validation. Additional features implemented included Job Scheduling, User management and User profile settings.

During mid 2021 work commenced on integrating the application to run on a kubernetes cluster. The key driver for this was a shift away from using Python for the data processing. Large enterprises use tools such as Kafka and Spark for procesing data. These tools generally require kubernetes clusters and only run on linux. This conflicts with a primary design goal of DataStream to be able to run on a Windows server or a developers desktop. DataStream is essentially trying to fill the gap between SQL Server Integration Services and tools like Flyte. We have done some experimentation of running kubernetes clusters on products like minikube on the desktop. Minikube is not a production worthy environment but if a cluster will run on minikube it will be portable to other cluster environments. A stepping stone into the world of kubernetes clusters.

This work with Kubernetes has changed the direction of DataStream. ideas had been emerging for DataStream to integrate with Kubernetes API to expose monitoring and management to the app. Potentially to manage deployments to other cluster environments. To support the model of developing on a single machine and deploying into a cloud cluster for example. The scope of DataStream has as a result been significantly refactored. We are currently exploring:

Some organisations like Confluent and DataBricks have built large businesses around providing data management as a service. It is not feasible to try to compete with such companies. DataStream is possibly always going to be a research and development project. We do hope that we learn a lot from the experience. We hope to focus more on the tools to enable workign with Kafka, Spark and other related technologies. A positive outcome from this project will be spin off some useful microservices and other tools that others may use in their own work.

Follow up blogs will focus on specific technical challenges that we encounter.

Thank you for your interest in reading this blog.