Data Ingestion and Data Flows in ADF

Azure interface: all resources

This project focuses on using Azure Data Factory (ADF) to manage and process COVID-19 data. As you can see in this screenshot of the Azure interface, there are several components: ADF, Azure SQL database, Azure Databricks, Log Analytics Workspace, etc. These resources align with a data pipeline architecture, combining ADF for ETL, SQL for structured storage, and Databricks for analytics.

Framework/Architecture of this project

Data Flow

The first data flow(transform hospital admissions) focused on aggregating hospital admissions data, conditionally splitting it into weekly and daily data streams, applying transformations such as pivoting and sorting, and exporting the processed data into destination sinks.

Data Flow: transform hospital admissions

The second data flow(transform cases deaths) filtered case and death data specifically for Europe, performed lookups to enrich country-level information, and aggregated the data before exporting it to the final dataset. Key components included pivot transformations, conditional splitting, lookups, and sorting, which ensured efficient data organization and readiness for reporting and analysis.

Data Flow: transform cases deaths

Linked Services

Trigger

trigger
trigger dependency

Copy processed data from data lake to MS SQL:

   
       
   
   
       
   

once I process my data in a data lake and move it into a structured format in SQL for further analysis, reporting, and querying, it effectively becomes part of a data warehouse.

Some problem when I was doing this project:

  • Parameters & Variables

Parameters are external values passed into pipelines, datasets or linked services. The value cannot be changed inside a pipeline.

Variables are internal values set inside a pipeline. The value can be changed inside the pipeline using Set Variable or Append Variable Activity.

   
       
   

If we want to ingest http/url in the json file, we need to set them in the variables.

  • website column name changed

The data of the website keeps updating and the column and row change every day, I did the project in 2022, but now the ingested files column names are changed.

  • wrong name resulted in null data

I am confused that why the confirmed cases column is null with all rows…But I found that I did something really silly: comnfirmed

   
       
   
   
       
   

Finally!!

   
       
   
  • unable to access Azure HDInsight

Unfortunately, I cannot create the Azure HDInsight cluster with the Student subscription. :(

Reason: With an Azure Student subscription, you will initially be able to access only Azure services that are available with a free tier of service use.

  • unable to have permission for access tokens to access the databricks RESTAPI