Post on 29-May-2020
transcript
Data flows in Azure Data Factory –
endlich auch hier Transformationen!
Pass Camp 2018
Stefan Kirner 8.2.2018
2
Stefan KirnerTeamlead Business Intelligence Solutions @inovex GmbH
› Mehr als 15 Jahre Erfahrung mit dem Microsoft Business Intelligence Toolset
› MCSE for Data Management & Analytics & Cloud Infrastructure
› Microsoft P-TSP Data Platform
› RGV SQL PASS e.V. Community Karlsruhe
› Speaker at conferences and user groups about BI- und cloud-themes
› Mail: stefan.kirner@inovex.deTwitter: @KirnerKa
1. Intro Data Factory v2
2. Control Flow & Triggers
3. Data Flow
4. Roadmap & Q+A
Agenda
New Pipeline ModelRich pipeline orchestration
Triggers – on-demand, schedule, event
Data Movement as a ServiceCloud, Hybrid
30 connectors provided
Data flow as NEW Data Transformation Layer
SSIS Package Execution In a managed cloud environment
Use familiar tools, SSMS & SSDT
Author & MonitorProgrammability (Python, .NET, Powershell, etc)
Visual Tools for Control Flow and NEW: Data Flow
6
Data Factory EssentialsArtefacts in Data Factory
Data flow
NEW Activity!
SSIS Package
Pipeline
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
ADFv2 Pipelines
Activity 1 Data Flow
Activity 3
“On Error” Activity 1
params
params
My Pipeline 1
…
My Pipeline 2
For Each…
Activity 4
params
Trigger
EventWall ClockOn Demand
params
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
Scalableper job elasticity (cloud data movement units)
Up to 1 GB/s
Set parallelism of threads used in source and target
SimpleVisually author or via code (Python, .Net, etc)
Serverless, no infrastructure to manage
Staged copy (compress/decompress in hybrid scenarios, SQL DW load using polybase, bypass firewall restrictions)
Access all your data 30+ connectors provided and growing (cloud, on premises, SaaS)
Data Movement as a Service: 17 points of presence world wide
Self-hostable Integration Runtime for hybrid movement
Data Movementaka
“Copy Activity”
New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com
11
ActivitiesKnown from v1 - Data Transformation ActivitiesData transformation activity Compute environment
Hive HDInsight [Hadoop]
Pig HDInsight [Hadoop]
MapReduce HDInsight [Hadoop]
Hadoop Streaming HDInsight [Hadoop]
Spark HDInsight [Hadoop]
Machine Learning activities: Batch Execution and Update Resource
Azure VM
Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server
U-SQL Azure Data Lake Analytics
12https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity and others in How-To > Control Flow
ActivitiesNew! Control Flow Activities…nicht vollständigControl activity Description
Execute Pipeline Activity
allows a Data Factory pipeline to invoke another pipeline.
ForEachActivity used to iterate over a collection and executes specified activities in a loop.
WebActivity call a custom REST endpoint and pass datasets and linked services
Lookup Activity look up a record/ table name/ value from any external source to be referenced by succeeding activities. Could be used for incremental loads!
Get MetadataActivity
retrieve metadata of any data in Azure Data Factory e.g. did another pipeline finish
Do Until Activity similar to Do-Until looping structure in programming languages.
If ConditionActivity
do something based on condition that evaluates to true or false.
Execute SSIS Execute SSIS Package
13https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/
ActivitiesConcepts
exec
exec
Pipeline check metadata
14
Demo Control FlowGet all data of a system by metadata
Activity: For eachtable in source
Activity: CopyData to Data Lake
/ Blob Store
exec
1. on-demand
2. Wall-clock Schedule
3. Tumbling Window (aka time-slices in v1)
4. Event on Blob Store
15
TriggersHow do pipelines get started
Visual Data Flows
17
1. Does not require understanding of Spark, Big Data Execution Engines, Clusters, Scala, Python …
2. Focus on building business logic and data transformation› Data cleansing
› Aggregation
› Data conversions
› Data prep
› Data exploration
Code-free Data Transformation At Scale
ADF Data Flow Overview
Sort, Merge, Join, Lookup …
DestinationStagingData Sources
Transformations
ADF Data Flow Workstream
Data Sources Staging Transformations Destination
Sort, Merge, Join, Lookup …
• Explicit user action• User places data
source(s) on design surface, from toolbox
• Select explicit sources
• Implicit/Explicit• Data Lake staging area as
default• User does not need to
configure this manually• Advanced feature to set
staging area options• File Formats / Types
(Parquet, JSON, txt, CSV …)
• Explicit user action• User places
transformations on design surface, from toolbox
• User must set properties for transformation steps and step connectors
• Explicit user action• User chooses destination
connector(s)• User sets connector
property options
Data flow started - what happens at databricks?
21
Modern DWH Layers
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure Data
Lake Store
Structured data / master data
SQL Database
(Data-Warehouse)
Data factory
Data factory
Azure Databricks
(Spark)
Power BI
(Analytical dashboards)
Store ResultsProcessingStoreIngest Presentation
Analysis Services
(Sem. Modell)
(optional, depends on data)
Data Model
Data Science
(Workloads and use
cases)
1. Visual “Data Flow Builder” / “Data Mapping”
2. Extensible through scripting and expressions
3. Data Flow can be embedded into ISV / SaaS apps
› Embed UI
› Embed Parameterize Data Flows
4. A graphical UI for building data transformation routines on Spark
5. Built for resiliency and operationalized environments
Visual Data Flow Key Tenets
25https://www.youtube.com/watch?v=vSTn_aGq3C8 Video zum Feature von Mark Kromer
Handing Schema Changes
Data flow will accept both columns (here in derived columns)
Use Auto Mappingin Sink
26
Overview current Transformations 1/2
Transformation Description
Source
Source transform lets you configure the sources we want to bring into the data flow. A data flow can have one or more sources. It should be able to connect to any type of source supported by ADF.Note: Private preview only supports blob
SelectSelect transform allows to select list of columns from the input stream which you can pass to other transformation. It allows providing alias names to columns.
Derived ColumnDerived column allows you to create new columns or modify existing column. Its support a wide range of data manipulation functions.
Filter Filter transform allows you to restrict the rows based on filter expression.
Conditional SplitConditional split allows you to split the input stream into n number of output stream based on expressions conditions. Rows not matching the condition will be routed to default output.
SortSort allows you to sort the based-on order rules. The output rows will follow the same order in the subsequent transformation. It has other options like toggle “case sensitive”, computed columns etc.
AggregateAggregate transform allows you to define aggregation functions by group by columns. You can also build custom expressions in this transform.
27
Overview current Transformations 2/2
Transformation Description
ExistsExist transform allows you to filter rows in one input stream based on another source. You have an option of applying either “Exists” or “Not Exists” condition.
LookupLookup transform allows you to lookup values from another stream based on the lookup condition. Its works like inner join.
JoinerJoiner transform allows you to join two streams based on a condition. You have an option to perform inner, left, right, outer and cross joins.
Union Union transform allows you merge two streams into a single stream.
New BranchNew branch transform allows you to replicate current stream. You can create a new stream or copy of a stream.
OutputOutput sink will output the data into all kinds of storage supported by ADF.Note: private preview only supports blob and SQL DW
• Azure SLAs are NA for preview services (private or public preview) until GA of the service.
• Limited Preview Support
• Handled directly with the Azure Engineering team via adfdataflowext@microsoft.com. Turn-around time on fixing issues during private preview will depend upon access to customer data sources and customer Databricks clusters for RCA and debugging.
• Public Preview Support
• Normal Azure customer service channels
Data Flow Limited Preview Support & SLAs
Demo: Data FlowVisual Design of Transformations in Spark
30
Conclusion: Data Flow in ADFIs cool because…
• visual design for fast learning & understanding
• ETL using spark technology in the background whichcould scale (but does not have to in any case)
• Azure Databricks as elastic processing engine fordifferent workloads, tools and user groups
• Integration in Control Flow enables modellingdependencies and cost-efficient orchestration of Azureresources
32
Klar, bin VeteranData, what?
HOL 1 Copy Task HOL 3 Control Flow
HOL 2 Trigger
HOL 4 Data Flow
Yes!
Schon mal was gemacht mit der Data Factory?
Subscription im Preview Prog-ramm für DF?
Start: Einrichten Azure Resourcen
Step 1: Azure Basics – set up resources on Azure
33
Pipeline
34https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity
HOL 1/2: Data Movement & TriggerCopy Activity
Activity: Copy datafrom input file to
SQL table
On- demandTrigger
run
Linked service: Blob Store
Linked service:Azure SQL DB
Dataset: Container + Flat file
Dataset: Table
exec
exec
Pipeline check metadata
35
HOL 3: Control FlowGet all data of a system by metadata
Activity: For eachtable in source
Activity: CopyData to Data Lake
/ Blob Store
exec
HOL 4: Data FlowVisual Design of Transformations in Spark
Attention: Use Databricks 5.0 includes A. Spark 2.4.0, Scala 2.11
37
1. Microsoft documentation: https://docs.microsoft.com/en-us/azure/data-factory/2. Azure Data Factory – data flows preview documentationhttps://github.com/kromerm/adfdataflowdocs3. Cool screencasts about data flowshttps://github.com/kromerm/adfdataflowdocs/tree/master/videos4. Another good blogpost about ADF Data Flowshttps://visualbi.com/blogs/microsoft/azure/azure-data-factory-data-flow-activity/5. Comparison ADF Data Flows vs. SSIS vs. T-SQL https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/
39
Links and further informationsNoch mal überarbeiten mit aktuellen Links
40
inovex ist ein IT-Projekthaus mit dem Schwerpunkt „Digitale Transformation“:
Product Ownership · DatenprodukteWeb · Apps · Smart Devices · BI Big Data · Data Science · SearchReplatforming · Cloud · DevOpsData Center Automation & HostingTrainings · Coachings
Wir nutzen Technologien, um unsere Kunden glücklich zu machen.Und uns selbst.
inovex gibt es in Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg
Und natürlich unter www.inovex.de
Vielen Dank
Stefan Kirner
Head of BI Solutions
inovex GmbH
Ludwig-Erhard-Allee 6
76131 Karlsruhe
s.kirner@inovex.de
0173 3181 012