Azure Data Factory - Copy Activity
A beginner-friendly guide to copying and transforming data using Azure Data Factory
What is Azure Data Factory?
Azure Data Factory (ADF) is a cloud-based ETL (Extract, Transform, Load) and data integration service that enables users to create workflows for orchestrating data movement and transformation. It provides a code-free interface, making it accessible for both technical and non-technical users. ADF is widely used for cloud-scale data migration, real-time data pipelines, and hybrid data integration scenarios.
One of the most widely used features of ADF is the Copy Activity, which allows users to transfer data between different data stores. Whether moving data from an on-premise SQL Server to Azure Blob Storage or from one cloud service to another, Copy Activity ensures data is moved reliably and efficiently.
Step 1: Create a Data Factory
- Go to the Azure Portal and search for "Data Factory".
- Click on Create and fill in the basic details like Subscription, Resource Group, and Data Factory Name.
- Choose your region and proceed with the default options or set up Git integration and network rules as required.
- Click Review + Create and then Create to deploy your Data Factory.
- Once deployed, click Go to Resource and launch the Data Factory Studio.
Step 2: Create a Pipeline
Pipelines in Azure Data Factory are used to group activities like the Copy Activity. To create one:
- In ADF Studio, go to the Author tab and click + New pipeline.
- Give your pipeline a meaningful name, e.g., CopyBlobToBlob.
- Add a Copy Data activity to the pipeline canvas.
- Remember: A pipeline can contain up to 40 activities, and at least one activity is required to publish it.
This visual designer allows you to drag and drop components and easily configure them without writing code. You can define triggers, set schedules, and even parameterize your pipeline to increase reusability.
Step 3: Configure Linked Services
Linked services define the connection information to your source and destination data stores. For example:
- To read from Azure Blob Storage, create a linked service that includes your storage account credentials.
- Repeat the process for your target data store (e.g., another Blob container or a SQL database).
Linked services serve as bridges between the Data Factory and your external data sources. Azure supports built-in connectors for popular services like Amazon S3, Salesforce, SQL Server, and more.
Step 4: Set Up Datasets
Datasets represent the data you want to move. Each dataset should be linked to one of the services you just defined.
- Create a source dataset (e.g., CSV from Blob Storage).
- Create a target dataset (e.g., another folder or container).
- Specify file format, container name, path, and file name pattern if needed.
You can also define schema, delimiter, encoding, and compression type to tailor the Copy Activity for your data.
Step 5: Execute and Monitor
Once everything is set up:
- Click Debug to test your pipeline before publishing.
- After successful debugging, click Publish All to save your pipeline to the Data Factory service.
- Go to the Monitor tab to track pipeline executions and check for errors or performance metrics.
The Monitor tab offers detailed logs, including throughput, duration, and error traces, making it easier to debug and improve performance.
Tips for Beginners
- Always publish your changes to persist the configuration in Azure.
- Use naming conventions for pipelines, datasets, and linked services to keep your project organized.
- Leverage the Integration Runtime for executing activities across different regions or networks.
- Use parameters and variables to create dynamic and reusable pipelines.
- Take advantage of the activity dependency feature to control the execution flow.