Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege ›...

transcript

Data flows in Azure Data Factory –

endlich auch hier Transformationen!

Pass Camp 2018

Stefan Kirner 8.2.2018

Stefan KirnerTeamlead Business Intelligence Solutions @inovex GmbH

› Mehr als 15 Jahre Erfahrung mit dem Microsoft Business Intelligence Toolset

› MCSE for Data Management & Analytics & Cloud Infrastructure

› Microsoft P-TSP Data Platform

› RGV SQL PASS e.V. Community Karlsruhe

› Speaker at conferences and user groups about BI- und cloud-themes

› Mail: stefan.kirner@inovex.deTwitter: @KirnerKa

1. Intro Data Factory v2

2. Control Flow & Triggers

3. Data Flow

4. Roadmap & Q+A

Agenda

New Pipeline ModelRich pipeline orchestration

Triggers – on-demand, schedule, event

Data Movement as a ServiceCloud, Hybrid

30 connectors provided

Data flow as NEW Data Transformation Layer

SSIS Package Execution In a managed cloud environment

Use familiar tools, SSMS & SSDT

Author & MonitorProgrammability (Python, .NET, Powershell, etc)

Visual Tools for Control Flow and NEW: Data Flow

Data Factory EssentialsArtefacts in Data Factory

Data flow

NEW Activity!

SSIS Package

Pipeline

New capabilities for data integration in the cloud, Mike Flasko at Ignite 2017,https://myignite.microsoft.com

ADFv2 Pipelines

Activity 1 Data Flow

Activity 3

“On Error” Activity 1

params

My Pipeline 1

My Pipeline 2

For Each…

Activity 4

params

Trigger

EventWall ClockOn Demand

params

Scalableper job elasticity (cloud data movement units)

Up to 1 GB/s

Set parallelism of threads used in source and target

SimpleVisually author or via code (Python, .Net, etc)

Serverless, no infrastructure to manage

Staged copy (compress/decompress in hybrid scenarios, SQL DW load using polybase, bypass firewall restrictions)

Access all your data 30+ connectors provided and growing (cloud, on premises, SaaS)

Data Movement as a Service: 17 points of presence world wide

Self-hostable Integration Runtime for hybrid movement

Data Movementaka

“Copy Activity”

ActivitiesKnown from v1 - Data Transformation ActivitiesData transformation activity Compute environment

Hive HDInsight [Hadoop]

Pig HDInsight [Hadoop]

MapReduce HDInsight [Hadoop]

Hadoop Streaming HDInsight [Hadoop]

Spark HDInsight [Hadoop]

Machine Learning activities: Batch Execution and Update Resource

Azure VM

Stored Procedure Azure SQL, Azure SQL Data Warehouse, or SQL Server

U-SQL Azure Data Lake Analytics

12https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity and others in How-To > Control Flow

ActivitiesNew! Control Flow Activities…nicht vollständigControl activity Description

Execute Pipeline Activity

allows a Data Factory pipeline to invoke another pipeline.

ForEachActivity used to iterate over a collection and executes specified activities in a loop.

WebActivity call a custom REST endpoint and pass datasets and linked services

Lookup Activity look up a record/ table name/ value from any external source to be referenced by succeeding activities. Could be used for incremental loads!

Get MetadataActivity

retrieve metadata of any data in Azure Data Factory e.g. did another pipeline finish

Do Until Activity similar to Do-Until looping structure in programming languages.

If ConditionActivity

do something based on condition that evaluates to true or false.

Execute SSIS Execute SSIS Package

13https://www.purplefrogsystems.com/paul/2017/09/whats-new-in-azure-data-factory-version-2-adfv2/

ActivitiesConcepts

Pipeline check metadata

Demo Control FlowGet all data of a system by metadata

Activity: For eachtable in source

Activity: CopyData to Data Lake

/ Blob Store

1. on-demand

2. Wall-clock Schedule

3. Tumbling Window (aka time-slices in v1)

4. Event on Blob Store

TriggersHow do pipelines get started

Visual Data Flows

1. Does not require understanding of Spark, Big Data Execution Engines, Clusters, Scala, Python …

2. Focus on building business logic and data transformation› Data cleansing

› Aggregation

› Data conversions

› Data prep

› Data exploration

Code-free Data Transformation At Scale

ADF Data Flow Overview

Sort, Merge, Join, Lookup …

DestinationStagingData Sources

Transformations

ADF Data Flow Workstream

Data Sources Staging Transformations Destination

Sort, Merge, Join, Lookup …

• Explicit user action• User places data

source(s) on design surface, from toolbox

• Select explicit sources

• Implicit/Explicit• Data Lake staging area as

default• User does not need to

configure this manually• Advanced feature to set

staging area options• File Formats / Types

(Parquet, JSON, txt, CSV …)

• Explicit user action• User places

transformations on design surface, from toolbox

• User must set properties for transformation steps and step connectors

• Explicit user action• User chooses destination

connector(s)• User sets connector

property options

Data flow started - what happens at databricks?

Modern DWH Layers

Business / custom apps

(Structured)

Logs, files and media

(unstructured)

Azure Data

Lake Store

Structured data / master data

SQL Database

(Data-Warehouse)

Data factory

Azure Databricks

(Spark)

Power BI

(Analytical dashboards)

Store ResultsProcessingStoreIngest Presentation

Analysis Services

(Sem. Modell)

(optional, depends on data)

Data Model

Data Science

(Workloads and use

cases)

1. Visual “Data Flow Builder” / “Data Mapping”

2. Extensible through scripting and expressions

3. Data Flow can be embedded into ISV / SaaS apps

› Embed UI

› Embed Parameterize Data Flows

4. A graphical UI for building data transformation routines on Spark

5. Built for resiliency and operationalized environments

Visual Data Flow Key Tenets

25https://www.youtube.com/watch?v=vSTn_aGq3C8 Video zum Feature von Mark Kromer

Handing Schema Changes

Data flow will accept both columns (here in derived columns)

Use Auto Mappingin Sink

Overview current Transformations 1/2

Transformation Description

Source

Source transform lets you configure the sources we want to bring into the data flow. A data flow can have one or more sources. It should be able to connect to any type of source supported by ADF.Note: Private preview only supports blob

SelectSelect transform allows to select list of columns from the input stream which you can pass to other transformation. It allows providing alias names to columns.

Derived ColumnDerived column allows you to create new columns or modify existing column. Its support a wide range of data manipulation functions.

Filter Filter transform allows you to restrict the rows based on filter expression.

Conditional SplitConditional split allows you to split the input stream into n number of output stream based on expressions conditions. Rows not matching the condition will be routed to default output.

SortSort allows you to sort the based-on order rules. The output rows will follow the same order in the subsequent transformation. It has other options like toggle “case sensitive”, computed columns etc.

AggregateAggregate transform allows you to define aggregation functions by group by columns. You can also build custom expressions in this transform.

Overview current Transformations 2/2

Transformation Description

ExistsExist transform allows you to filter rows in one input stream based on another source. You have an option of applying either “Exists” or “Not Exists” condition.

LookupLookup transform allows you to lookup values from another stream based on the lookup condition. Its works like inner join.

JoinerJoiner transform allows you to join two streams based on a condition. You have an option to perform inner, left, right, outer and cross joins.

Union Union transform allows you merge two streams into a single stream.

New BranchNew branch transform allows you to replicate current stream. You can create a new stream or copy of a stream.

OutputOutput sink will output the data into all kinds of storage supported by ADF.Note: private preview only supports blob and SQL DW

• Azure SLAs are NA for preview services (private or public preview) until GA of the service.

• Limited Preview Support

• Handled directly with the Azure Engineering team via adfdataflowext@microsoft.com. Turn-around time on fixing issues during private preview will depend upon access to customer data sources and customer Databricks clusters for RCA and debugging.

• Public Preview Support

• Normal Azure customer service channels

Data Flow Limited Preview Support & SLAs

Demo: Data FlowVisual Design of Transformations in Spark

Conclusion: Data Flow in ADFIs cool because…

• visual design for fast learning & understanding

• ETL using spark technology in the background whichcould scale (but does not have to in any case)

• Azure Databricks as elastic processing engine fordifferent workloads, tools and user groups

• Integration in Control Flow enables modellingdependencies and cost-efficient orchestration of Azureresources

Klar, bin VeteranData, what?

HOL 1 Copy Task HOL 3 Control Flow

HOL 2 Trigger

HOL 4 Data Flow

Schon mal was gemacht mit der Data Factory?

Subscription im Preview Prog-ramm für DF?

Start: Einrichten Azure Resourcen

Step 1: Azure Basics – set up resources on Azure

Pipeline

34https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity

HOL 1/2: Data Movement & TriggerCopy Activity

Activity: Copy datafrom input file to

SQL table

On- demandTrigger

Linked service: Blob Store

Linked service:Azure SQL DB

Dataset: Container + Flat file

Dataset: Table

Pipeline check metadata

HOL 3: Control FlowGet all data of a system by metadata

Activity: For eachtable in source

Activity: CopyData to Data Lake

/ Blob Store

HOL 4: Data FlowVisual Design of Transformations in Spark

Attention: Use Databricks 5.0 includes A. Spark 2.4.0, Scala 2.11

1. Microsoft documentation: https://docs.microsoft.com/en-us/azure/data-factory/2. Azure Data Factory – data flows preview documentationhttps://github.com/kromerm/adfdataflowdocs3. Cool screencasts about data flowshttps://github.com/kromerm/adfdataflowdocs/tree/master/videos4. Another good blogpost about ADF Data Flowshttps://visualbi.com/blogs/microsoft/azure/azure-data-factory-data-flow-activity/5. Comparison ADF Data Flows vs. SSIS vs. T-SQL https://sqlplayer.net/2018/12/azure-data-factory-v2-and-its-available-components-in-data-flows/

Links and further informationsNoch mal überarbeiten mit aktuellen Links

inovex ist ein IT-Projekthaus mit dem Schwerpunkt „Digitale Transformation“:

Product Ownership · DatenprodukteWeb · Apps · Smart Devices · BI Big Data · Data Science · SearchReplatforming · Cloud · DevOpsData Center Automation & HostingTrainings · Coachings

Wir nutzen Technologien, um unsere Kunden glücklich zu machen.Und uns selbst.

inovex gibt es in Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg

Und natürlich unter www.inovex.de

Vielen Dank

Stefan Kirner

Head of BI Solutions

inovex GmbH

Ludwig-Erhard-Allee 6

76131 Karlsruhe

s.kirner@inovex.de

0173 3181 012

Data flows in Azure Data Factory endlich auch hier ... › fileadmin › files › Vortraege ›...

Documents