Data Ingestion and Data Flows in ADF
This project focuses on using Azure Data Factory (ADF) to manage and process COVID-19 data. As you can see in this screenshot of the Azure interface, there are several components: ADF, Azure SQL database, Azure Databricks, Log Analytics Workspace, etc. These resources align with a data pipeline architecture, combining ADF for ETL, SQL for structured storage, and Databricks for analytics.
Framework/Architecture of this project
Data Flow
The first data flow(transform hospital admissions) focused on aggregating hospital admissions data, conditionally splitting it into weekly and daily data streams, applying transformations such as pivoting and sorting, and exporting the processed data into destination sinks.
The second data flow(transform cases deaths) filtered case and death data specifically for Europe, performed lookups to enrich country-level information, and aggregated the data before exporting it to the final dataset. Key components included pivot transformations, conditional splitting, lookups, and sorting, which ensured efficient data organization and readiness for reporting and analysis.
Linked Services
Trigger
Copy processed data from data lake to MS SQL:
once I process my data in a data lake and move it into a structured format in SQL for further analysis, reporting, and querying, it effectively becomes part of a data warehouse.
Some problem when I was doing this project:
- Parameters & Variables
Parameters are external values passed into pipelines, datasets or linked services. The value cannot be changed inside a pipeline.
Variables are internal values set inside a pipeline. The value can be changed inside the pipeline using Set Variable or Append Variable Activity.
If we want to ingest http/url in the json file, we need to set them in the variables.
- website column name changed
The data of the website keeps updating and the column and row change every day, I did the project in 2022, but now the ingested files column names are changed.
- wrong name resulted in null data
I am confused that why the confirmed cases column is null with all rows…But I found that I did something really silly: comnfirmed
Finally!!
- unable to access Azure HDInsight
Unfortunately, I cannot create the Azure HDInsight cluster with the Student subscription. :(
- unable to have permission for access tokens to access the databricks RESTAPI