Hello Data Enthusiasts 👨‍💻!

I am Vedavyas. Thanks for dropping by my portfolio site!

As a passionate Data Engineer, I bring extensive experience in building scalable, efficient data pipelines and architectures that support complex data analysis and machine learning models. Skilled in a wide range of technologies including Python, SQL, Apache Spark, and Hadoop, I excel in designing and implementing robust solutions for big data challenges. My expertise extends to cloud platforms such as AWS and Azure, where I've deployed and managed data lakes and warehouses, optimizing for performance and cost-effectiveness.

As I recently completed my graduation in May 2025 from the prestigious Indiana University Bloomington, I am enthusiastically preparing to embark on my professional journey in the dynamic field of Cloud Data Engineering. The experiences I've gained thus far have honed a diverse skill set that I am eager to leverage in tackling real-world challenges within the ever-evolving tech landscape. Additionally, I am open to relocation. If you are in search of a dedicated and motivated Cloud/Data engineer, poised to learn and make a substantial impact, let's establish a connection!

Technical Skills:
Programming & Databases: Python, R, MySQL, PostgreSQL, MongoDB, Snowflake, Amazon DynamoDB, Apache Cassandra, PrestoDB, Neo4j Graph Database
Cloud & Big Data Technologies: AWS Glue, Amazon Redshift, Amazon Athena, AWS EMR, Amazon S3, Azure Databricks, Azure Data Factory, Azure Synapse Analytics, Azure Event Hubs, Google BigQuery, Google Cloud Storage
Data Engineering & DevOps: Apache Spark, Apache Hadoop, Apache Kafka, Apache Hive, Apache Airflow, PySpark, Pandas, NumPy, dbt (data build tool), Docker, Kubernetes, Jenkins, Git, Celery, ETL/ELT Pipelines, Data Modeling, Spark Streaming
Business Intelligence & Analytics: Tableau, Microsoft Power BI, Amazon QuickSight, scikit-learn, TensorFlow, PyTorch, Random Forest, XGBoost, SVM, Linear Regression, Logistic Regression, Hypothesis Testing, Dimensional Modeling, Statistical Analysis
Web Development & Tools: Django, Flask, Linux, React.js, HTML5, CSS3, JavaScript, Postman, Redis, RESTful APIs, FastAPI

Things I am good at


  • Be open to Learn
  • Play any outdoor sport
  • Engage in healthy discussions
  • Be an excellent team player
  • Be a mentor
  • Volunteer in social activities

My Education

Master of Science in Data Science, Indiana University Bloomington, USA,
August 2023 - May 2025 (upcoming)

CGPA: 3.8 out of 4

Coursework: Bigdata Applications, Advance Database Concepts, Applied Algorithms, Applied Database Techniques, Software Engineering, Computer Networks, Data Mining


Bachelor of Technology in Computer Science, Anna University, Chennai, India, August 2017 - June 2021

CGPA: 8.89 out of 10

Coursework: Data Structures, Database Management Systems, Computer Organization and Architecture, Operating Systems, Cloud Computing, Linux Programming

My Experience

Indiana University,   United States,   May 2024 - Present (Full-Time)

Data Engineer:
  • Architected healthcare RCM platform using Azure Data Factory and Databricks to process 2M+ daily EMR records, implementing Bronze-SilverGold Medallion architecture in ADLS Gen2 and Delta Lake for 500GB/day analytics with HIPAA compliance.
  • Standardized diverse hospital schemas using Common Data Model (CDM) and implemented SCD Type 2 in Delta tables, integrating EMR systems, claims processing, and healthcare APIs into unified Gold layer Enterprise Data Warehouse for comprehensive RCM analytics.
  • Spearheaded metadata-driven ETL pipelines using ADF orchestration and Databricks Spark to handle 10M+ daily transactions, implementing parallel processing and incremental watermarking across 10 tables with audit logging.
  • Optimized data processing performance through parallel batch execution and incremental load strategies, reducing ETL runtime from 4 to 1.5 hours and maintaining comprehensive audit trails while ensuring data quality validation across healthcare workflows.
  • Rolled out AR aging Power BI dashboards and authored cross-team RFCs guiding RCM lakehouse roadmap, accelerating claim denial recovery by 35% and safeguarding $2.1M revenue while exemplifying Ownership and Customer Obsession principles.

Techtinium,   India,   September 2022 - July 2023 (Full-time)

Data Engineer:
  • Engineered real-time data pipeline using AWS Kinesis and Apache Spark Streaming to process 1,000+ orders/minute with 5-second micro-batches, integrating streaming analytics with Amazon Redshift for executive dashboards and reducing data latency by 85%.
  • Established robust Spark Streaming data quality framework with 10-minute watermarking and automated schema validation using Delta Lake, achieving 99.9% data accuracy while reducing late arrival data loss by 95% across critical analytics workflows.
  • Streamlined ETL orchestration with Apache Airflow Python DAGs and CloudWatch monitoring, improving pipeline efficiency by 30% and reducing processing costs by 25% while maintaining 99.9% system uptime.
  • Revamped Redshift data warehouse with star schema modeling and optimized sort keys and distribution keys, reducing query times by 40% and scaling to 1.5M daily transactions with 3x improved storage efficiency.
  • Created Power BI Premium dashboards with DirectQuery optimization and row-level security, delivering real-time insights in 30 seconds for 50+ users while reducing manual reporting overhead by 70%.
  • Worked with different product vendors on POC, Post Sales Implementation.

Kaar Technologies,   India,   December 2020 - August 2022 (Full-time)

Data Engineer:
  • Implemented end-to-end batch ETL pipeline using AWS services (S3, EMR, Lambda, CloudWatch, Glue, Athena), Snowflake, Apache Spark, and Airflow to process 50GB+ daily retail data from 5 tables, reducing processing time by 60% and enabling Tableau dashboards for 300K+ transactions.
  • Automated daily data extraction from Snowflake OLTP database using SQL stored procedures and CloudWatch-Lambda triggers with Airflow orchestration, achieving 99% pipeline reliability and eliminating 8 hours of manual work weekly.
  • Orchestrated ETL workflows using Apache Airflow in Docker containers on EC2, managing Spark job submission to Amazon EMR cluster with DAG monitoring and achieving 95% reliability across 12 interdependent tasks.
  • Optimized PySpark transformation scripts and automated data catalog systems on Amazon EMR using AWS Glue crawlers, delivering 70% query performance improvement and enabling analysts to ship Tableau dashboards 5x with parquet format output.
  • Mentored 2 junior developers and led campus data-club sessions, achieving 25% improvement in on-boarding efficiency while fostering inclusive team culture and strengthening organizational data analytics capabilities.
  • Attained an impressive 89% evaluation during the internship tenure, culminating in a successful transition to a full-time role as a Data Engineer within the dynamic Data Analytics team.

Kaar Technologies,   India,   July 2020 - December 2020 (Internship)

Data Analyst Intern:
  • Engineered React-based Customer, Vendor, and Employee portals integrated via SAP PI/PO middleware, utilizing SAP as the backend. Devised automated chat-bot for user-friendly access, Employee Maintenance, and system health checks.
  • Performed Exploratory Data Analysis and handled unbalanced datasets to train the ML models. Built an internal Chat-bot to create new incidents, automate data look-ups , etc., and integrated it with Flutter mobile application.
  • Leveraged Microsoft Access and SQL to extract and manipulate raw data, generating ad-hoc reports that boosted data-driven decisions in Power BI, enhancing trend visualization and optimizing KPIs for a 22% increase in efficiency.
  • Implemented thorough data quality checks for Access and Power BI, ensuring precise updates; resulted in smoother database operations and a 18% reduction in data discrepancies on dashboards.
  • As the internship progressed, my commitment to excellence and the pursuit of knowledge became evident, leading to a seamless transition into a full-time role with the esteemed Data Analytics team.

My Certifications

I love doing certifications. Below are a few that I pursued so far. I am parallelly working on a couple more while you are reading this information.

AWS Certified Data Engineer – Associate

AWS Certified Data Engineer – Associate

Skills learned and applied: AWS Glue, Athena, Redshift, DynamoDB, EC2, Amazon Kinesis and many other AWS Data Engineering services

Databricks Data Engineer Associate

Databricks certified Data Engineer Associate

Skills learned and applied: Apache Spark, Delta Lake, Databricks, Lakehouse, Delta Live Tables, Data Pipelines, ETL, Production, SQL, Python

Micrsoft certified PowerBI Data Analyst

Microsoft certified PowerBI Data Analyst Associate

Skills learned and applied: Prepare the data, Model the data, Visualize and analyze the data

Astronomer Certification for Apache Airflow 3 Fundamentals

Astronomer Certification for Apache Airflow 3 Fundamentals

Skills learned and applied: Apache Airflow, DAGs, Data Pipelines, Orchestration, Scheduling

Databricks certified Spark 3.0 Developer Associate

Databricks Spark 3.0 Developer Associate

Skills learned and applied: Spark architecture, Spark SQL functions, UDFs, DataFrames, Adaptive query execution, Python

Databricks certified Spark 3.0 Developer Associate

AWS Academy Graduate - AWS Academy Cloud Architecting

Skills learned and applied: Architecting Solutions On AWS AWS Cloud Best Practices Building Infrastructure On AWS

Azure Fundamentals and Azure Data Fundamentals

Azure Fundamentals and Azure Data Fundamentals

Skills learned and applied: Cloud concepts, Azure architecture and services, Azure management and governance, Describe core data concepts, Identify considerations for relational data on Azure, Describe considerations for working with non-relational data on Azure, Describe an analytics workload on Azure

Apache Kafka developer

Apache Kafka Developer

Individuals who successfully complete the Confluent Fundamentals Accreditation have an understanding of Apache Kafka and Confluent Platform. Users are able to: explore use cases, have general knowledge of Kafka’s core concepts, understand the ability of Kafka as a highly scalable, highly available, and resilient real-time event streaming platform.

Databricks certified Spark 3.0 Developer Associate

AWS Academy Graduate - AWS Academy Cloud Web Application Builder

Skills learned and applied: Architecting Solutions On AWS AWS AWS Academy AWS Cloud Building Infrastructure On AWS Web Applications

Databricks certified Spark 3.0 Developer Associate

AWS Academy Graduate - AWS Academy Cloud Foundations

Skills learned and applied: AWS Architecture AWS Cloud AWS Core Services AWS Pricing AWS Support

My Projects

Below are a few hands-on projects that I worked on.

GrandFile share

E-commerce Business Intelligence Platform with dbt and Kafka

This project is designed to enable real-time web activity analytics, addressing the business need to gain immediate insights into user behavior and website interactions, such as clicks, devices used, and network information. From a technical standpoint, the solution begins with a Python3 FakeData generator simulating web activity at a rate of one row per second, which is then streamed by a Kafka producer into a Kafka broker. This establishes a Kafka–Spark pipeline that effectively ingests over 86,000 daily web events (derived from 1 row per second) into a Data Lake hosted on Google Cloud Platform (GCP), specifically Google Cloud Storage, thereby supporting seamless minute-level data updates. A crucial transformation layer involves dbt (Data Build Tool), with its transformations automated within a Dockerized Airflow environment, which processes the raw data from the Data Lake. This refined and structured data is then loaded into BigQuery, Google's cloud data warehouse, facilitating the creation of real-time dashboards using Google Studio. The entire underlying infrastructure on GCP, including BigQuery datasets, Kafka, Spark clusters, and Airflow instances, is systematically provisioned and managed using Terraform for automation and consistency.

Skills learned and applied: BigQuery, Airflow, Apache Kafka, Spark, Docker

GrandFile share

Real-Time Food Analytics Platform with Snowflake CDC

The Real-Time Food Analytics Platform with Snowflake CDC project, primarily developed in Python and built upon a robust Dimensional Modeling framework, was engineered to provide real-time insights into food order fulfillment metrics, addressing the critical business need for accelerated decision-making. Technically, this highly scalable solution centers on an engineered Snowflake Medallion platform, designed to efficiently process substantial data volumes including 10M+ daily transactions and 500GB of food records. It leverages Snowflake Streams for Change Data Capture (CDC) to ingest incremental data changes in near real-time, and implements SCD Type 2 for comprehensive historical tracking within its dimension tables. The project's structured approach is based on Dimensional Modeling, utilizing fact tables like order_item_fact to capture granular business events and metrics, complemented by various dimension tables (such as customer, menu, restaurant, date, location, and delivery agent dimensions) which provide descriptive attributes for rich analytical querying and reporting. The curated and transformed data is then surfaced through real-time Streamlit dashboards which are performance-optimized via Materialized Views, ultimately empowering users to make 30% faster decisions for critical order fulfillment metrics. The entire platform is deployed on AWS, leveraging its cloud capabilities for scalability and reliability.

Skills learned and applied: Snowflake, Amazon Web Services (AWS), Dimensional Modeling, Streamlit, Extract, Transform, Load (ETL)

GrandFile share

Real-Time-Credit-Card-Fraud-Detection-Pipeline

This project focuses to avoid identity theft, which detects any unusual activity using credit card, which has skyrocketed in the current era. I've used AWS S3 bucket, AWS EMR, Hive, Hadoop, MongoDB, PySpark and Apache Kafka to do this project. I've attached all the necessary files to perform this project in the Media section. Following are the checks performed in my project to detect frauds in Credit Card Transactions: The Transaction would be considered as fraudulent if the Transaction Amount exceeds the Upper Control Limit (UCL) considering the last 10 transactions of the card The Transaction would be considered as fraudulent if the Credit Score is less than 200 The geo location of each transaction is captured and the distance and time is identified for them. Considering if the time taken between two transactions does not exceed the speed limit of 900 Km/hr based on the distance, the current transaction would be treated as genuine, if not, as fraudulent.

Skills learned and applied: AWS(S3, EMR), Hadoop, Hive, MongoDB, PySpark, Apache Kafka

GrandFile share

Spotify Data Pipeline: Extract, Transform, and Analyze with AWS

This project involves building an end-to-end data pipeline for Spotify data using AWS services. It encompasses extracting data from the Spotify API, storing it in AWS S3, and implementing automated transformation processes using AWS Lambda. The pipeline includes scheduled data extraction, data cleaning and formatting, and automated triggers for transformation based on data updates. The transformed data is then stored back in S3 with proper organization. The project also leverages AWS Glue and Athena for creating analytics tables and enabling efficient querying. Key skills learned from this project include working with AWS services (S3, Lambda, Glue, Athena), API integration, data extraction and transformation, automated pipeline development, cloud-based data storage and organization, and setting up data analytics infrastructure. This comprehensive solution provides a scalable and automated approach to processing and analyzing Spotify data, offering valuable insights for various analytical purposes.

Skills learned and applied: Application Design, Python, Infrastructure As Code, Amazon Web Services, AWS Elastic Cloud Compute, AWS Relational Database Service, AWS Simple Storage Service

GrandFile share

Employee Churn Prediction

Retaining current employees is more difficult for the HR team than recruiting new ones. Any business that loses one of its valuable employees suffers a loss in terms of productivity, time, money, and other factors. This loss could be reduced if HR could able to foresee future employees who were considering leaving their positions; as a result, we looked into ways to address the employee turnover issue from a machine learning perspective through this project. When the time comes to lay off workers as part of organizational changes, the corporation can use churn modeling to make a rational decision rather than randomly selecting layoff candidates.

Skills learned and applied: Data collection, Data Mining, Data Pre Processing, Python, Pandas, Exploratory Data Analysis, Dimensionality Reduction, Machine Learning algorithms, Tuning ML model performance

GrandFile share

Tableau-Dashboard-for-HR-Data-along-with-EDA

I had the privilege of working on a transformative project in which I oversaw the implementation of Tableau for in-depth data analysis of our Human Resources data. Every company has an HR Department that handles various recruitment and placement tasks. In this project, I worked with a massive dataset to extract valuable insights. These insights can be really helpful for the HR department to improve their work and gain a better understanding of the recruitment process in the market. These dashboards empowered with the insights needed for informed decision-making, allowing to steer businesses in the right direction and drive substantial improvements in overall performance.

Skills learned and applied: Data Collection, Data Cleaning, Data Transformation, Data Modeling, Data pre-processing, Information Visualization, Business Intelligence, Tableau

Leadership and Involvement

  • Played a pivotal role in mentoring new hires at Kaar Technologies by offering technical expertise and guidance, ensuring successful integration into the SAP Data Analytics team; resulted in a 30% decrease in onboarding time and increased overall team efficiency by 25%.
  • Got BEST-PERFORMER-OF-THE-MONTH four times for best performance on the project.
  • Participated in the Campus Hiring process and on-boarded many employees.
  • Placed 5th among 21 teams at the on-site finals, Kaizen Robotics Program of IIT Madras, Chennai.

Contact Me

🏠 Bloomington, Indiana, US.

If you have any potential opportunity for me, or just want to get in touch, please feel free to drop me an email. You can also reach out to me on LinkedIn from the bottom-right corner of this page.

I look forward to connecting with other industry professionals, sharing knowledge, and exploring new opportunities in the data space. Because in this realm of information, a warm welcome is just a JOIN operation away. 😊