HR Workday Data Pipeline

Overview

A standalone project providing a complete data generation and cloud deployment pipeline for Workday-style HR data. The generator creates realistic transactional HR datasets for a simulated 10,000-employee North American financial services organization, producing 40,000+ transactions across multiple data domains. The AWS deployment component provides full infrastructure-as-code for S3, Glue, and Redshift Serverless.

Data Model

core_hr_employees

10K

Non-transactional employee master data

job_movement_transactions

Hires, promotions, terminations, moves

compensation_change_transactions

20K

Merit increases, market adjustments, equity

worker_movement_transactions

13K

Transfers, relocations, org changes

Total: 40,000+ transactions over 1 year

Architecture

The pipeline follows a two-stage approach:

Data Generation: Python-based generator using Faker and NumPy to produce Workday-style CSV files with realistic organizational hierarchies (Business Unit → Division → Department → Team), compensation structures, and temporal patterns

AWS Deployment: CloudFormation-based infrastructure including encrypted S3 storage, Redshift Serverless with optimized table design, Glue ETL jobs using native COPY commands, and daily automated scheduling with full IAM role definitions

Tools & Technologies

AWS S3 AWS Redshift Serverless AWS Glue AWS CloudFormation Python Faker NumPy Pandas Workday

Key Artifacts

Governance Notes

This project generates entirely synthetic HR data using statistical distributions modeled on publicly available workforce statistics. No real employee data, proprietary Workday configurations, or confidential organizational information was used. The organizational structure, compensation bands, and demographic distributions are fictional.