Overview
This project automates the provisioning of an end-to-end Azure Data Lake solution using Terraform and Python. It schedules a recurring refresh of NBA player data into Azure Data Lake Gen2 and enables querying with Synapse Analytics. The solution uses Azure Functions for scheduled data ingestion, Azure Key Vault for secret management, and Application Insights for observability.
π§ Tools & Technologies
- Terraform: Automates deployment of all Azure resources.
- Azure Storage (Data Lake Gen2): Stores raw JSONL-formatted data.
- Azure Synapse Workspace: Enables querying data with serverless SQL pools.
- Azure Key Vault: Secures API keys and sensitive values.
- Azure Function (Python): Periodically fetches NBA data from Sportsdata.io and uploads it to the lake.
- Application Insights & Azure Monitor: Provides diagnostics, metrics, and custom logging for observability.
π§± Infrastructure Diagram
π» Project Structure
AzureDataLake/
βββ Terraform/ # Terraform configuration
β βββ main.tf # Main infra config
β βββ variables.tf # Input variables
β βββ secrets.tfvars # Sensitive input variables (excluded from repo)
βββ myfunctionapp/ # Azure Function code
β βββ __init__.py # Main function trigger
β βββ data_operations.py # Handles data fetch and blob upload
β βββ requirements.txt # Python dependencies
π Deployment Steps
1. Clone Repository
git clone https://github.com/kingdave4/AzureDataLake.git
cd AzureDataLake
2. Configure Secrets
Create Terraform/secrets.tfvars:
subscription_id = "<AZURE_SUBSCRIPTION_ID>"
sql_admin_password = "<SQL_ADMIN_PASSWORD>"
apikey = "<SPORTSDATA_API_KEY>"
nba_endpoint = "<NBA_API_ENDPOINT>"
sp_object_id = "<TERRAFORM_SP_OBJECT_ID>"
3. Provision Azure Resources
cd Terraform
terraform init
terraform plan -var-file="secrets.tfvars"
terraform apply -var-file="secrets.tfvars" -auto-approve
4. Deploy Azure Function
cd ../myfunctionapp
func azure functionapp publish datafunctionapp54
π§ Azure Function Logic
Trigger: Every 10 Minutes
Defined in function.json:
"schedule": "0 */10 * * * *"
Function Code
from data_operations import fetch_nba_data, upload_to_blob_storage
def main(mytimer: func.TimerRequest):
data = fetch_nba_data("MyKeyVault", "SportsDataApiKey")
if data:
upload_to_blob_storage("MyKeyVault", data)
Data Operations
- Fetches data using HTTP GET with API key header
- Uploads JSONL to nba-datalake/raw-data/nba_player_data.jsonl
π Querying with Synapse
Sample query:
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://<storage-account>.dfs.core.windows.net/nba-datalake/raw-data/nba_player_data.jsonl',
FORMAT = 'CSV',
FIELDTERMINATOR = '0x0b',
FIELDQUOTE = '0x0b'
)
WITH (
line varchar(max)
) AS [result];
Ensure Synapse Firewall is configured to allow your IP.
π Monitoring and Logs
- Live Stream Logs:
func azure functionapp log stream --name datafunctionapp54
- Azure Monitor Dashboards: View traces, logs, and function metrics via Application Insights.
π§Ή Cleanup
Tear down everything:
cd Terraform
terraform destroy -var-file="secrets.tfvars" -auto-approve
πΊοΈ Future Improvements
- CI/CD via Azure DevOps Pipelines
- Data Quality Checks and Validation Layer
- Additional sources like team stats and schedules
- Power BI integration for live dashboards
π Repository
GitHub - kingdave4/AzureDataLake
π¬ Contact
Feel free to reach out via GitHub if you have questions or suggestions!