๐ Introduction
Welcome to my third project in the DevOps Challenge! This time, I tackled the automation of collecting, storing, and analyzing NBA player data using AWS services. My goal was to fetch real-time NBA data from the Sportsdata.io API and create a scalable data lake on AWS. To take it a step further, I automated the entire workflow using GitHub Actions and set up AWS CloudWatch for logging and monitoring.
๐ What This Project Does
This project automates the process of collecting NBA player data and storing it in an AWS data lake. Here’s what it accomplishes:
Fetch NBA Data โ Retrieves player stats from the Sportsdata.io API.
Store Data in S3 โ Saves the fetched data in AWS S3 in JSON format.
Create a Data Lake โ Uses AWS Glue to structure and catalog the data.
Enable SQL Queries โ Configures AWS Athena to query the data using SQL .
Log Everything โ Implements AWS CloudWatch for logging and tracking all activities.
๐ ๏ธ Tools and Technologies Used
To build this pipeline, I leveraged the following technologies:
Programming Language: Python 3.8
AWS Services: S3, Glue, Athena, CloudWatch
API Provider: Sportsdata.io (NBA Data API)
Automation: GitHub Actions
๐ Setup Instructions
Step 1: Prerequisites
Before setting up this project, ensure you have:
An AWS account.
A Sportsdata.io API key.
The necessary IAM role/permissions in AWS for:
S3:
s3:CreateBucket
,s3:PutObject
,s3:ListBucket
Glue:
glue:CreateDatabase
,glue:CreateTable
Athena:
athena:StartQueryExecution
,athena:GetQueryResults
Add these secrets to your GitHub repository (Settings > Secrets and variables > Actions):
| Secret Name | Description |
|----------------------|----------------------------------|
| AWS_ACCESS_KEY_ID | AWS access key |
| AWS_SECRET_ACCESS_KEY | AWS secret access key |
| AWS_REGION | AWS region (e.g., `us-east-1`) |
| AWS_BUCKET_NAME | Your S3 bucket name |
| NBA_ENDPOINT | Sportsdata.io API endpoint |
| SPORTS_DATA_API_KEY | Sportsdata.io API key |
Step 2: How It Works
Clone the Repository
git clone https://github.com/kingdave4/Nba_Data_Lake.git cd nba-data-lake-pipeline
Project Breakdown
The project is structured to run a Python script within a GitHub Actions workflow.
The workflow YAML file (
.github/workflows/deploy.yml
) automates the execution of the script.The Python script handles:
AWS service configuration and initialization
Fetching and processing NBA data
Uploading data to S3 and cataloging it with Glue
Enabling Athena queries for analysis
๐๏ธ Order of Execution
Hereโs how the Python script executes step by step:
Create an S3 Bucket โ The bucket is used to store raw NBA data.
Create a Glue Database โ Organizes and catalogs the data.
Fetch NBA Data โ Calls the Sportsdata.io API for player data.
Convert Data to JSON Format โ Ensures compatibility with AWS services.
Upload Data to S3 โ Stores the JSON files in a structured folder.
Create a Glue Table (
nba_players
) โ Allows querying via Athena.Enable Athena for SQL Queries โ Set up SQL-based analytics on the dataset.
โ๏ธ GitHub Actions: Automating the Deployment
The GitHub Actions workflow is set up to trigger on every push to the repository. When executed, it:
Installs dependencies
Sets up AWS credentials
Runs the Python script to fetch and store NBA data
Configures AWS services automatically
This ensures that each code update automatically refreshes the pipeline, making it hands-free!
๐ Results of the Pipeline Execution
Once the pipeline runs successfully:
- S3 Bucket: Stores all raw data in the
raw-data/
folder. - AWS Glue: Manages the data schema.
- AWS Athena: Enables querying of the data using SQL.
Example SQL Query (Athena)
SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'SG';
๐ก๏ธ Error Tracking & Logging with CloudWatch
To ensure smooth execution, I integrated AWS CloudWatch Logs to track key activities, including:
- API calls
- Data uploads to S3
- Glue catalog updates
- Athena query executions
If an error occurs (e.g., missing API keys or AWS permissions), CloudWatch provides insights for troubleshooting.
๐ Lessons Learned
This project reinforced several key DevOps and cloud computing concepts:
โ Leveraging AWS services (S3, Glue, Athena, CloudWatch) to build scalable data pipelines.
โ Automating workflows using GitHub Actions.
โ
Securing credentials and sensitive data using GitHub Secrets and .env
files.
โ Fetching and processing real-world data from an API.
โ Using SQL with AWS Athena for data analysis.
โ Implementing logging and monitoring with CloudWatch.
๐ฎ Future Enhancements
To improve the pipeline further, I plan to:
โจ Automate data ingestion with AWS Lambda โ Run the pipeline on a scheduled basis.
โจ Implement a data transformation layer with AWS Glue ETL โ Clean and enrich the data.
โจ Add advanced analytics and visualizations using AWS QuickSight โ Create dashboards for insights.
๐ฏ Final Thoughts
This project was an exciting challenge that combined DevOps, cloud computing, and data engineering.
Automating data collection and analysis using AWS tools has been a game-changer for me.
Iโm eager to keep building and refining my skills in cloud automation and data engineering!