Overview
A standalone project providing a complete data generation and cloud deployment pipeline for Workday-style HR data. The generator creates realistic transactional HR datasets for a simulated 10,000-employee North American financial services organization, producing 40,000+ transactions across multiple data domains. The AWS deployment component provides full infrastructure-as-code for S3, Glue, and Redshift Serverless.
Data Model
Non-transactional employee master data
Hires, promotions, terminations, moves
Merit increases, market adjustments, equity
Transfers, relocations, org changes
Total: 40,000+ transactions over 1 year
Architecture
The pipeline follows a two-stage approach:
Data Generation: Python-based generator using Faker and NumPy to produce Workday-style CSV files with realistic organizational hierarchies (Business Unit → Division → Department → Team), compensation structures, and temporal patterns
AWS Deployment: CloudFormation-based infrastructure including encrypted S3 storage, Redshift Serverless with optimized table design, Glue ETL jobs using native COPY commands, and daily automated scheduling with full IAM role definitions
Tools & Technologies
Key Artifacts
Governance Notes
This project generates entirely synthetic HR data using statistical distributions modeled on publicly available workforce statistics. No real employee data, proprietary Workday configurations, or confidential organizational information was used. The organizational structure, compensation bands, and demographic distributions are fictional.