Overview
At RIIID (뤼이드), introduced the company’s first multi-GPU training and improved all pipeline procedures.
Key Achievements
- GPU Utilization: Increased from 25% to 95%
- Initialization Time: Reduced from 1 hour to 10 seconds
- First Multi-GPU Training: Introduced the first multi-GPU training in the company
- CI/CD: Built CI/CD pipelines using GitHub Actions
graph LR
A[Single GPU\n25% util] --> B[DDP Setup]
B --> C[Multi-GPU\nTraining]
C --> D[Docker\nContainer]
D --> E[Distributed\n95% util]
Technical Approach
- Introduced multi-GPU training to the company for the first time
- Improved all pipeline procedures, resulting in GPU utilization going from 25% to 95% and initialization time dropping from 1 hour to 10 seconds
- Set up CI/CD with GitHub Actions
Tech Stack
- Training: Multi-GPU, Data-Distributed Training (PyTorch DDP)
- CI/CD: GitHub Actions
- Infrastructure: Docker, AWS
Period
June 2020 - December 2020 | RIIID (뤼이드)
