+ All Categories
Home > Documents > Analyse von großen Datensätzen in den...

Analyse von großen Datensätzen in den...

Date post: 28-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
42
Analyse von großen Datensätzen in den Lebenswissenschaften und der Bioinformatik (19403201) Tim Conrad Session 15
Transcript
Page 1: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Analyse von großen Datensätzen in den Lebenswissenschaften

und der Bioinformatik (19403201)

Tim Conrad

Session 15

Page 2: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

• Technologies / Frameworks for Big Data analysis• ETL in the Cloud

Page 3: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Data Streams = continuous flows of dataExample Analyses:

Page 4: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 5: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

https://databricks.com/session/a-platform-for-large-scale-neuroscience

Page 6: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 7: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 8: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Neuroscience @ Freeman Lab, Janelia Farm

Page 9: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 10: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 11: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

11

Page 12: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Analyzing STREAM DATA:

Ingest, Process, Store

12

Page 13: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/

Common Pipeline: Ingest, Process, Store

Processing Stack

Page 14: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

How to process big streaming data

Page 15: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

How to process big streaming data

Page 16: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

16

Page 17: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Stream Ingestion Systems

Page 18: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Stream Processing Systems

Page 19: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Stream Storing Systems

Page 20: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Frameworks / Technologies for Big Data Analysis

Page 21: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 22: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 23: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 24: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 25: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 26: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 27: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 28: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 29: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 30: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

https://www.elastic.co/de/products/kibana

Page 31: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Technologies „in the field“

31

Page 32: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Cloud ETL Tools

AWS GlueServerless ETL

Azure Data FactoryVisual Cloud ETL

Google Cloud DataflowUnified Programming Model for Data Processing

Page 33: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

AWS Glue architecture

Source: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html

Page 34: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Components:• Data Catalog• Crawlers• ETL jobs/scripts• Job scheduler

Useful for…• …running serverless queries against S3 buckets and

relational data• …creating event-driven ETL pipelines• …automatically discovering and cataloging your

enterprise data assets

AWS Glue

Page 35: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Can act as metadata repository for other Amazon services

Tables - Added to “databases” using

the wizard or a crawler

Data sources: Amazon S3, Redshift, Aurora, Oracle, PostgreSQL, MySQL, MariaDB, MS SQL Server, JDBC, DynamoDB

Crawlers connect to one or more data stores, determine the data structures, and write tables into the Data Catalog

Data catalog - the central component

Page 36: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Jobs PySpark or Scala scripts, generated by AWS Glue Visual dataflow can be generated, but not used for development Execute ETL using the job scheduler, events, or manually invoke Built-in transforms used to process data

ApplyMapping• Maps source and target columns

Filter• Output selected fields to new DynamicFrame

SelectFields

SplitRows• Load new DynamicFrame based on

filtered records• Split rows into two new DynamicFrames

based on a predicate Join SplitFields

• Joins two DynamicFrames • Split fields into two new DynamicFrames

Page 37: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 38: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Google Cloud Dataflow

Source: https://cloud.google.com/dataflow/

Page 39: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Unified programming model for batch (historical) and streaming (real-time) data pipelines and distributed compute platform• Reuse code across both batch and streaming pipelines• Java or Python based

Programming model an open source project - Apache Beam (https://beam.apache.org/)• Runs on multiple different distributed processing back-ends:

Spark, Flink, Cloud Dataflow platforms

Fully managed service• Automated resource management and scale-out

Google Cloud Dataflow overview

Source: https://cloud.google.com/dataflow/

Page 40: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Dataflow I/O transforms

Page 41: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks
Page 42: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks

Build data pipelines using a visual ETL user interface• Visual Studio Team Services (VSTS) Git integration for collaboration,

source control, and versioning

Drag, drop, link activities• Copy Data: Source to Target• Transform: Spark, Hive, Pig,

streaming on HDInsight, Stored Procedures, ML Batch Execution, etc.

• Control flow: If-then-else, For-each, Lookup, etc.

Azure Data Factory

Source: https://azure.microsoft.com/en-us/blog/continuous-integration-and-deployment-using-data-factory/


Recommended