Welcome to my portfolio

Data Engineer
building scalable
data systems

I design, build, and optimize data pipelines, cloud platforms, and analytics systems that power modern businesses.

Scroll to explore

About Me

Passionate about data engineering and building reliable systems

I'm Victor Kipruto Rop, a Data Engineer with a passion for transforming raw data into scalable, reliable systems. With expertise in designing and implementing robust data infrastructures, I specialize in building ETL/ELT pipelines, managing big data ecosystems, and optimizing data workflows for enterprise-scale applications.

My approach focuses on reliability, scalability, and maintainability. I believe in solving complex data problems with elegant solutions and fostering data-driven cultures within organizations.

Data Architecture

Designing and implementing scalable data systems

Pipeline Engineering

Building robust ETL/ELT processes

Cloud Data

Leveraging AWS, GCP for data solutions

5+ Years Experience

25+ Projects Built

10+ Technologies

By The Numbers

Impact and achievements

Data Pipelines Built

Data Processed

Uptime Achievement

Cost Savings

Engineers Mentored

Tech Stack Expertise

Technical Skills

Technologies and tools I work with daily

Languages

Expert

Python SQL Scala Java Bash

Data & ETL

Expert

Airflow Spark Kafka dbt Hadoop

Databases

Advanced

PostgreSQL BigQuery Snowflake MongoDB Redshift

Cloud Platforms

Expert

AWS Google Cloud Azure Databricks

DevOps & Tools

Intermediate

Docker Kubernetes Git Jenkins Terraform

Visualization

Advanced

Tableau Power BI Grafana Looker

Featured Projects

Real-world solutions and case studies

Featured

Real-Time Data Pipeline

Built a high-throughput streaming data pipeline using Apache Kafka and Spark Streaming to process 1M+ events per minute with fault tolerance.

1M+ Events/min

99.9% Uptime

Kafka Spark AWS

View on GitHub

Production

Data Warehouse Modernization

Migrated legacy data warehouse to Snowflake, reducing infrastructure costs by 40% and improving query performance by 5x.

40% Cost Reduction

5x Faster Queries

Snowflake dbt Python

View on GitHub

Architecture

Data Lakehouse Design

Designed and implemented a data lakehouse solution using Delta Lake and Databricks for unified analytics across business units.

4 Business Units

100TB+ Data Volume

Databricks Delta Lake Spark

View on GitHub

Orchestration

Workflow Orchestration

Developed Apache Airflow-based orchestration framework with custom operators, monitoring, and alerting managing 200+ daily workflows.

200+ Daily Workflows

99.9% Uptime SLA

Airflow Python Docker

View on GitHub

Analytics

Analytics Platform

Engineered a scalable analytics platform processing terabytes of data with sub-second query response times using BigQuery.

<1s Query Time

10B+ Records

BigQuery GCP SQL

View on GitHub

Governance

Data Governance Suite

Implemented comprehensive data governance with metadata management, quality checks, and compliance monitoring across the enterprise.

500+ Data Assets

100% Compliant

Great Expectations Data Catalog Python

View on GitHub

Case Studies

In-depth technical deep dives into major projects

Enterprise

Building a 1M+ Events/Min Real-Time Pipeline

Challenge: Legacy batch processing couldn't handle real-time data ingestion from multiple sources with 1M+ events per minute.

Solution: Architected Kafka + Spark Streaming infrastructure on AWS with auto-scaling and fault tolerance.

✓ 99.99% uptime ✓ 10x throughput increase ✓ $500K annual savings

Kafka Spark AWS Kubernetes

Data Warehouse

Migrating Legacy DW to Cloud (40% Cost Reduction)

Challenge: On-premise data warehouse running up to $2M annually, complex legacy queries, 500+ dependent reports.

Solution: Designed and executed migration to Snowflake with dbt transformations, automated testing, and zero downtime cutover.

✓ 40% cost reduction ✓ 5x query speedup ✓ Zero-downtime migration

Snowflake dbt Python SQL

Architecture

Building a Data Lakehouse for 4 Business Units

Challenge: 4 business units with siloed data, inconsistent data models, no unified analytics capability.

Solution: Implemented Delta Lake + Databricks lakehouse with medallion architecture (bronze/silver/gold layers).

✓ Unified analytics view ✓ 100TB+ data volume ✓ 50% faster insights

Databricks Delta Lake Spark MLflow

What Others Say

Feedback from colleagues and clients

Sarah Kim

CTO at DataTech Corp

"Victor architected our data warehouse migration project beautifully. His expertise in Snowflake and attention to detail saved us thousands in infrastructure costs."

Michael Peterson

Lead Data Scientist at Analytics Hub

"Working with Victor on the Airflow orchestration framework was a game-changer. The system is robust, maintainable, and well-documented."

Jennifer Lee

VP Engineering at CloudScale

"Victor's ability to solve complex data engineering challenges with elegant solutions is exceptional. A true professional to work with."

Latest Articles

Insights and best practices in data engineering

Data Engineering

How I Design Reliable ETL Pipelines

A deep dive into designing ETL pipelines that handle millions of records daily with zero data loss. Learn about idempotency, error handling, and monitoring strategies.

Jan 20, 2026 8 min read

Read Article

Best Practices

Common Data Engineering Mistakes (And How to Avoid Them)

5 critical mistakes I've seen in data engineering projects and actionable solutions. From schema design to resource optimization, here's what you need to know.

Jan 15, 2026 10 min read

Read Article

Cloud Architecture

Scaling Data Pipelines on AWS: A Practical Guide

Strategies for scaling data pipelines from thousands to millions of events per minute. Covers auto-scaling, cost optimization, and monitoring.

Jan 10, 2026 12 min read

Read Article

Code Snippets

Real examples from my projects

Python

Data Quality Check Framework


@data_quality_check
def validate_pipeline_output(df):
    assert df.count() > 0, "Empty output"
    assert df.filter(col("id").isNull()).count() == 0
    return df.select("*").toPandas().describe()

result = validate_pipeline_output(spark_df)
print(f"Records: {result.iloc[0, 0]}")

SQL

Optimized ETL Transformation


-- 10x faster medallion transform
SELECT 
    user_id, 
    window(event_time, '1 hour') as time_bucket,
    COUNT(*) as event_count,
    APPROX_PERCENTILE(0.95) as p95_latency
FROM raw_events
WHERE event_date >= current_date - 7
GROUP BY ALL
QUALIFY ROW_NUMBER() OVER 
    (PARTITION BY user_id ORDER BY event_count DESC) = 1

Airflow

Dynamic Workflow Generation


dag = DAG("dynamic_etl", schedule="0 2 * * *")

for dataset in DATASETS:
    extract = PythonOperator(
        task_id=f"extract_{dataset}",
        python_callable=extract_data,
        op_kwargs={"src": dataset}
    )
    load = SparkSubmitOperator(
        task_id=f"load_{dataset}",
        application="transform.py"
    )
    extract >> load

Certifications & Credentials

Professional credentials and recognitions

AWS Certified Solutions Architect

Professional Level

Issued: March 2024 | Expires: March 2026

Verify →

Google Cloud Professional Data Engineer

Professional Level

Issued: January 2024 | Expires: January 2026

Verify →

Databricks Certified Data Engineer

Lakehouse Platform

Issued: November 2023 | Expires: November 2025

Verify →

Confluent Certified Developer

Apache Kafka

Issued: September 2023 | Expires: September 2025

Verify →

GitHub Activity

Open source contributions and code activity

Contributions This Year

1,247

Public Repositories

Total Stars Received

1,540

Languages Used

Top Repositories

data-pipeline-toolkit ⭐ 380

spark-optimization-guide ⭐ 245

airflow-patterns ⭐ 189

kafka-schema-registry ⭐ 156

View Full Profile →

Speaking & Publications

Talks, articles, and thought leadership

Conference Talk

Building Fault-Tolerant Data Pipelines at Scale

Oct 2024 Data Summit NYC

45-minute deep dive on architecture patterns, failure modes, and recovery strategies for streaming data pipelines processing 10M+ events daily.

Watch Talk →

Webinar Series

Data Engineering Best Practices (4-Part Series)

Aug 2024 2,400+ attendees

Comprehensive series covering schema design, testing strategies, monitoring patterns, and cost optimization for modern data platforms.

View Series →

Published Article

The Complete Guide to Data Lakehouse Architecture

Jun 2024 50K+ reads

In-depth guide covering medallion architecture, Delta Lake implementation, data governance, and migration strategies from traditional warehouses.

Read Article →

Conference Talk

Optimizing Spark for Cost & Performance

May 2024 Spark Summit

Technical workshop demonstrating tuning techniques that reduced query costs by 60% while improving execution time by 3x.

View Workshop →

Open Source Contributions

Projects I maintain and contribute to

data-pipeline-toolkit

Maintained

Production-ready Python framework for building robust, testable data pipelines with built-in monitoring and error handling.

⭐ 380 stars 🍴 45 forks 📦 10K+ downloads/month

Python Pytest SQLAlchemy

Repository →

spark-optimization-guide

Maintained

Comprehensive resource for optimizing Apache Spark jobs. Includes benchmarks, patterns, and real-world tuning examples.

⭐ 245 stars 🍴 32 forks 📚 80+ code examples

Scala Spark Jupyter

Repository →

Apache Airflow Contributor

Active Contributor

Regular contributor to the Apache Airflow project. 15+ PRs merged focusing on performance and reliability improvements.

📝 15+ PRs merged 💬 40+ issue discussions 🔧 Code reviewer

Python Airflow Kubernetes

Contribute →

Resume

Download my full professional resume

Victor Kipruto Rop

Senior Data Engineer | Cloud Architecture Specialist

Contact

📧 kiprutovictor39@gmail.com

📱 +254723484552

💼 GitHub | LinkedIn

Summary

Senior Data Engineer with 5+ years of experience designing and implementing scalable data systems. Expertise in ETL/ELT pipelines, cloud platforms (AWS/GCP), and modern data stack technologies. Proven track record of reducing infrastructure costs by 40% while improving system reliability and performance.

Core Competencies

Apache Spark Kafka Airflow Python SQL AWS Google Cloud Snowflake BigQuery Docker Kubernetes dbt

Certifications

✓ AWS Certified Solutions Architect - Associate

✓ Google Cloud Professional Data Engineer

✓ Databricks Certified Data Engineer

✓ Confluent Certified Developer

Download Full Resume (PDF)

Experience

My professional journey in data engineering

Senior Data Engineer

Tech Company • 2022 - Present

Leading data architecture initiatives, designing scalable systems for 1M+ daily users, mentoring junior engineers, and implementing best practices across data teams.

Architected cloud-native data platform on AWS
Reduced data processing costs by 35%
Built real-time analytics dashboard

Data Engineer

Analytics Startup • 2020 - 2022

Developed ETL pipelines, optimized warehouse queries, built data infrastructure from scratch, and established data quality frameworks for a fast-growing startup.

Built ETL pipelines handling 100GB daily
Designed snowflake schema for analytics
Implemented data quality monitoring

Data Engineer Intern

Fortune 500 Company • 2019 - 2020

Assisted in data pipeline development, learned industry best practices, worked with Spark and Hadoop clusters, and contributed to data warehouse optimization.

Optimized Spark jobs for 30% speedup
Developed Python data pipelines
Documented data lineage

Let's Build Something Great

I'm always interested in discussing data engineering challenges and opportunities

Email

kiprutovictor39@gmail.com

Usually responds in 24 hours

Connect with me

Open to collaborations

Phone

+254723484552

Available for calls

GitHub

Check my code

Explore my projects

Data Engineerbuilding scalabledata systems

About Me

Data Architecture

Pipeline Engineering

Cloud Data

By The Numbers

Technical Skills

Languages

Data & ETL

Databases

Cloud Platforms

DevOps & Tools

Visualization

Featured Projects

Real-Time Data Pipeline

Data Warehouse Modernization

Data Lakehouse Design

Workflow Orchestration

Analytics Platform

Data Governance Suite

Case Studies

Building a 1M+ Events/Min Real-Time Pipeline

Migrating Legacy DW to Cloud (40% Cost Reduction)

Building a Data Lakehouse for 4 Business Units

What Others Say

Sarah Kim

Michael Peterson

Jennifer Lee

Latest Articles

How I Design Reliable ETL Pipelines

Common Data Engineering Mistakes (And How to Avoid Them)

Scaling Data Pipelines on AWS: A Practical Guide

Code Snippets

Data Quality Check Framework

Optimized ETL Transformation

Dynamic Workflow Generation

Certifications & Credentials

AWS Certified Solutions Architect

Google Cloud Professional Data Engineer

Databricks Certified Data Engineer

Confluent Certified Developer

GitHub Activity

Top Repositories

Speaking & Publications

Building Fault-Tolerant Data Pipelines at Scale

Data Engineering Best Practices (4-Part Series)

The Complete Guide to Data Lakehouse Architecture

Optimizing Spark for Cost & Performance

Open Source Contributions

data-pipeline-toolkit

spark-optimization-guide

Apache Airflow Contributor

Stay Updated

Resume

Contact

Summary

Core Competencies

Certifications

Experience

Senior Data Engineer

Data Engineer

Data Engineer Intern

Let's Build Something Great

Email

LinkedIn

Phone

GitHub

Send me a message

Data Engineer
building scalable
data systems