© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only1
BigData Vom Experiment zur Produktion
Mario Vosschmidt
Consulting Systems Engineer
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only2
AgendaBigData oder SmartData?
1) Was ist „BigData“
2) Anforderungen und Herausforderungen
3) Auf welche Szenarien konzentrieren wir uns?
4) Wie sehen Lösungsansätze aus?
5) Wie implementiere ich diese Lösungen?
6) Zusammenfassung
3
The Big Data Landscape
The 3V ParadigmBigData
Variety Multiple data sources Multiple data formats
Velocity High speed processing Fast changing requirements
Volume Huge amounts of data Process and persist
NetApp Confidential - Internal Use Only4
Entering a New Era of Scale
5
Big Data Solution PortfolioA B C s of Big Data at Netapp
6
Insight from extremelylarge datasets
Performance for data intensive workloads
Secure boundlessdata storage
BigData
Not Even to The “Peak”
Estimated size of the digital universe in 2020
35 Zettabytes 5 BillionSmart phones
30 BillionPieces of new content to Facebook per month
7
Technology Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
VISIBILITY
TIME
80%Unstructureddata
A Lot of Hype and Buzz – Everyone is Jumping In
NetApp Confidential - Internal Use Only8 8
Big Data Vendor Landscape
Market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015
Most firms are taking a pragmatic approach Big data is in the very early stages of maturity Best practices are not mature
IDC Big Data Survey
Nov-11
400
350
300
250
200
150
100
50
0Jan-08
Cloudera series BMapR series A
Cloudera series C
10gen series DMapR series BDataStax series BNeo Technology series AOpera Solutions series APlatfora series ACouchbase series C
Cloudera series D
Funding for Hadoop and NoSQL
"The Big Data market is expanding rapidly … For technology buyers, opportunities exist to use Big Data technology to improve operational efficiency and to drive innovation. Use cases are already present across industries and geographic regions." Dan Vesset, Vice President, IDC
451 Research
Data Growth Impact on Business
9
“Big Data” refers to datasets whose size is beyond the ability of typical tools to capture, store, manage and analyze
10
The Big Data Opportunities
Fraud detection & prevention
Anti-money laundering Risk management
Supply chain optimization Defect tracking Root cause analysis RFID correlation
Law enforcement Counter-terrorism Research and Education
Financial Services
ManufacturingHealthcare
Government
Drug development Patient Records Evidence-based medicine
NetApp Confidential - Internal Use Only11
Why Should You Care?It’s the Value of Your Data
Top line revenue– Leverage their data
assets into business advantage
Bottom Line savings– Lower the cost of
compliance
– Manage ever growing data efficiently
Over 1PB of data Growth of 175% YOY 90 days of data within 24 hours of a failure
5 Billion Records Anywhere, Anytime Faster time to market 50% Increase in Revenue
NetApp Big Data
13
Why NetApp?Practical solutions that solve today’s problems
Get Control
NetApp helps you turn your exploding data from threat to opportunity. Manage your data effectively and affordably.
Break Through
Break through the limits. With NetApp, you can take on even the most massive and complex data projects.
Gain Insight
Turn insight to action. NetApp helps you get to clarity and insight faster and more reliably.
14
Experience Managing Data at Scale
NetApp’s Largest Customer
100 Customers
50 Customers
10 Customers
4 Customers100 PB
50 PB
20 PB
10 PB
NetApp Confidential - Internal Use Only15
NetApp Big Data Strategy
Best of breed storage for Big Data Applications
Built on open standards with best-in-class partnerships
Validated with ecosystem leaders Complete server, network and storage
“Racks” Delivered via trusted
high-value partners
15
OpenBest-of-Breed
Choice
Analytics Smart Data
16
Big Analytics StrategySmart Data
17
DSS / DW (traditional analytics) Solutions partners include IBM, Oracle, Microsoft,
ParAccel, Exasol and SAND
Big Analytics Enterprise class Hadoop-based solutions
MapR, Hortonworks, Cloudera
Leverage partners to complete Big Analytics stack Solutions for validated server, network and storage
Big Analytics Solutions
18
Data Warehouse
Fast, space-efficient backup and recovery with storage utilization up to 90%. Less raw capacity with modular scalability
Mixed Use Database, Cubes
Optimized for IBM, Oracle and Microsoft. Simplified data management and protection. Zero down time
Hadoop
Enterprise class Hadoop with Lower total cost of ownership and based on open standards
NetApp Confidential - Internal Use Only19
The Value Proposition:Some problems require and Enterprise Class Hadoop Solution
Enterprise Class HadoopPackaged ready-to-deploy modular Hadoop cluster
The Data has intrinsic value $$$ Usable capacity must expand faster than
compute Higher storage performance Real human consequences if the system fails
(Threats, treatments, financial losses) System has to allow for asymmetric growth
White Box HadoopValues associated with early adopters of Hadoop
Social Media Space Contributors to Apache Strong bias to JBOD Skeptical of ALL vendors
Enterprise Class HadoopPackaged ready-to-deploy modular Compute / Memory intensive Hadoop cluster
Compute intensive applications Tic Data Analysis Extremely tight Service Level
expectations Severe financial consequences if the
analytic run is late
Enterprise Class HadoopBounded Compute algorithm / Memory intensive Hadoop cluster
Compute intensive applications Additional CPUs do not improve run time Extremely tight Service Level
expectations Severe financial consequences if the
analytic run is late Need for deeper storage per datanode
Co
mp
ute P
ow
er
Storage Capacity
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only20
Challenges with Hadoop in Enterprise
20Cisco and NetApp Confidential. For Internal Use Only. Do Not Distribute.
Operations
Implementation
Requires three copies of data, larger footprint, and more storage
Limited flexibility; storage and servers tied together affects scalability
Low cluster efficiency, higher network congestion
NameNode is a single point of failure
Slow recovery from disk drive failure
Expensive process to replace failed disks online
Most common Hadoop support issue is disk drive failure
Availability
Need to keep up with fast-paced patches, projects of open source platform
Need to decide on distribution of Hadoop
Skills are not common
Integration with existing IT infrastructure can be difficult
Tuning expertise needed to make Hadoop perform optimally
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only21
Why Big Data and Analytics as a service is important!
FlexPod Converged Infrastructure Family
Enterprise/Service ProviderMSB/Branch Office Dedicated
Distinct A
rchitectures
Distinct A
rchitectures
FlexPod® Express FlexPod Data Center FlexPod Select
Cisco UCS C-SeriesNexus, Catalyst®, MDSE-Series, FASReference architecture and/or designsApplication-based management
Cisco UCS C-SeriesNexus® 3KFAS2xx0, Two fixed pod sizesCisco UCS Director, VMware®, and Microsoft®
Cisco UCS C-Series/B-Series, Nexus® 5kFAS StorageFlexible pod sizesFlexPod validated management and ecosystem
Massively scalable shared virtual data center infrastructure
Big data analytics, scientific, HPC
For smaller, less-dynamic requirements and VAR velocity
Storage Pool
Network Pool
Compute Pool
AppAppApp
Storage Pool
Network Pool
Compute Pool
App AppAppAppAppApp
Storage
Network / Direct
Compute Nodes
App
Netapp Reference Architecture
NetApp Confidential - Internal Use Only23
Example: FlexPod Select with Cloudera
* NetApp 50% Storage Guarantee http://www.netapp.com/us/solutions/infrastructure/virtualization/guarantee.html
Converged big data platform from NetApp and Cisco for Hadoop
Enterprise-class Hadoop: Innovative storage, servers, networking validated with leading Hadoop distributions
Faster time to value: Prevalidated configuration accelerates deployment
High availability: Less downtime, higher serviceability to meet tight SLAs around data applications and processes
Flexible scaling: Independently scale servers and storage; modular design for scaling as data needs grow
Cisco UCS® C-Series Rack Mount Servers
NetApp® FASStorage Systems
NetApp E-SeriesStorage Array
Cisco UCS Manager
Cisco UCS Fabric Interconnect
24
Architected for the enterprise Superior NameNode protection
Faster recovery from failover
Lower cluster downtime
Faster time to value Validated, presized configurations
Low-latency, high-bandwidth networking
12 DataNodes in master, 16 in expansion
Coexistence with current applications and infrastructure Supports existing applications from
SAP, Microsoft, Oracle
Data management and monitoring with Cloudera Manager, Cisco UCS® Manager
FlexPod Select with HadoopNetApp and Cisco deliver enterprise
class Hadoop for high availability, performance, scalability
…
…
Cloudera or Hortonworks Distribution of Hadoop
Master Expansion
26
Service-Level Expectations Around Data High-Value Time-Sensitive Problems
Accelerate time to insights Fast deployment with validated, preconfigured, reference designs
Store, process, analyze all data for new opportunities and business impact
More time to focus on data analysis rather than deal with cluster downtime
Making the Hadoop experience betterOptimized, tuned, fully configured cluster
Hadoop integrated with storage, compute, networking
Monitoring and management tools with SANtricity® and from partners (Cloudera Manager, Cisco UCS® Manager)
High density and capacity reduce data center footprint
Reduce risk in an open ecosystemCompatibility with existing infrastructure and applications
Best-in-class partnerships, not entire stack from one vendor
Future-proof against lock-in and benefit from evolving ecosystem
27
FlexPod Select for Hadoop with
Cloudera
Ease of Setup and DeploymentPreconfigured – Pre-Vaildated
28
NO Architectural effort to design balanced server-net-storage hardware
NO network design effort
NO RAID level decisions, logical volume, block sizing, stripe sizing
NO effort to assemble, install and cable
NO software stack design, minimal effort for Hadoop installation,
NO design negotiations with multiple vendors or IT groups
NO hardware compatibility list or supportability list to work
NO O/S version efforts, no patching required
NO Hadoop tuning or performance testing effort
Significantly simpler sizing
Simpler cluster management with built-in tools
End-to-end compatibility
Professionally designed, supported, documentation,
training
Unified support
Delivery of fully configured cluster
Use Case Example: NetApp Auto Support
Correlate disk latency (hot) with disk type 24 billion records 4 weeks to run query Hadoop implementation 10.5 hours
Bug detection through pattern matching 240 billion records – Too large to run Hadoop implementation 18 hours
30
Phone home data representing information about the status NetApp storage controllers
Wireless Service Provider
3232
Provides wireless voice and data services globally
Telco Industry
The solution consists of an eight node Hadoop cluster at the core site. All the data from the remote sites are transported over WAN into the central site. The data gets collected, ingested, compressed and archived into the Hadoop cluster via HDFS. The data is then categorized, put into separate containers, and indexed based on its record keeping tags.
NetApp Hadoop Solution
Hadoop Distributed File System (HDFS)
Archiving & Indexing Tools
DN
DN
Remote Site
Agent Servers
AS AS AS
Use
r In
terf
ace
+
Sea
rch
Too
l
Central Site
Collector Servers
CS CS CS
Remote Site
Agent Servers
AS AS AS
DN
DN
DN
DN
DN
DN
Analytics & Enterprise Apps Environment
33
Sensors
Applications
Logs
Location/GPS
Mobile Devices
Storage(All other storage, i.e. internal DAS)
Content Repositories
Shared StorageInfrastructure
Storage File Systems
Data Management
Analytics
Applications
Reporting/Dashboard/Visualization
ETL
OLAP
OLTP
Other Data Sources
OLAPETL
Storage DataManagement
NFS/sNFS/pNFS
Bandwidth
34
Big Bandwidth SolutionsFull Motion Video
Scalable density and performance to ingest and simultaneously analyze UAV and satellite video data
Video Storage for Surveillance
High bandwidth & density supporting hundreds or thousands of HD cameras
Media Content Management
High ingest & play-out rates with support for media and entertainment workflows
HPC: Lustre, GPFS, BeeGfs Massively parallel distributed file system for large scale cluster computing and O&G Seismic Processing
Big Bandwidth Solutions
E-Series Storage
Storage File Systems
Applications
PerformanceDensity
Reliability Efficiency
FlexibilityModularity
Full-Motion Video Storage Solution
Turnkey solution in a 40U industry-standard rack Single architecture for ingest,
exploitation and dissemination
1.8PB Raw Capacity– 4000+ hours of uncompressed
720p HD video
>20 GB/s R/W Performance, >30 GB/s Peak Performance
Scale to multiple Petabytes in a single data container
Full-Motion Video Built on E-Stack
High bandwidth HD Video Ingest• Satellite• UAV
Quantum® StorNext File System
E5460 Stack
Multi-Stream Video Playout• Processing• Exploitation• Analyst
ViewingMassively Scalable
Single Data Container
HPC: Lustre
NetApp Confidential – Limited Use38
Performance to meet the needs of the world’s fastest Supercomputers
High Bandwidth & Density– 1.8PB & 30GB/s per
40U rack
Highly available– No Single points of failure
– Extensive RAS features
NetApp provided 7x24 Lustre Support
NetApp Professional Services
Lawrence Livermore National Lab
Supercomputer storage to support twenty thousand trillion arithmetic operations per second with access speeds up to 1 TB/sec
55PB of usable storage
Simulations for nuclear weapons viability
Counter Terrorism
Energy Security
Understanding Climate Change
Sequoia – announced as the fastest supercomputer and storage combination on the planet at ISC 2012
Press Release: http://www.netapp.com/us/company/news/news-rel-20110928-990734.html
39NetApp Confidential – Limited Use
Video Surveillance Storage
Enhance public safety with better physical security
Industry trends are exploding storage Analog to Digital SD to HD 7 days to 30+ Days
Open Platform Solution Best of breed industry partners Flexible deployments Modular scalability 99.999% up time
40
Unique Out-of-Band Recording
No servers required between cameras and storage
save HW/SW, licensing, footprint, very robust, save a lot of network cabling, easy to scale.
41 NetApp Confidential - Internal Use Only
Media Content Management
Highly scalable digital repository Consolidates collaborative production Multi-format distribution workflows
Industry-leading bandwidth per rack to reduce bottlenecks
Highest capacity density to minimize power and cooling
Single namespace for multi-petabyte repositories
Unmatched breadth of production client support
42NetApp Confidential – Limited Use
44
Content Management
NetApp Confidential – Limited Use
Big Content Solutions
File Services
Multi-application workloadsNon-disruptive operationIntegrated data protection, efficiency
Enterprise Content Repository
Infinite container Fixed content Non-disruptive operation Integrated data protection, efficiency
Distributed Content Repository
Large, multi-site repositoryPolicy based data managementMetadata-enabled object storage
45NetApp Confidential – Limited Use
File ServicesONTAP Cluster Mode
46
Heterogeneous cluster: A mix of controller types in a
single cluster per workload needs
Entry, mid, and high-end platforms Native and third-party storage
(FAS and V-Series) Multiprotocol: NFS, pNFS, CIFS,
iSCSI, FCP Integrated Data Protection
Virtual storage tier: Match data to disk price and
performance Manage multiple tiers in the
same namespace or many
NetApp Confidential – Limited Use
Enterprise Content Repositories ONTAP Cluster Mode with Infinite Volume
Single large content repository Scales to PBs and billions of files across
cluster Native storage efficiency
Simplified operations Multi-tenancy Simplifies application workflows Load balances data at ingest Starts small, grow granularly
High availability Protects against disk and hardware failures Snapshots & Replication for quick recovery Manage & Upgrade non-disruptively
47
Flat Namespace No filesystem hierarchy
Metadata separated Not within data space
Metadata serve as descriptors Can change over time However Data is persistent
Objects referenced by ID Index
Write once read many Similar to library Objects do not change Single writer multiple readers
Object Storage InsightsContent Repository
Less data management overhead
High Metadata rates
Less space management
Data are replicated across Geos
Simplified rights management
NetApp Confidential - Internal Use Only48
Distributed Content RepositoriesStorageGRID
Large content repository for big, unstructured data Billions of data sets, dozens of petabytes
Create, manage and consume content globally Predictable access to data
independent of location Policy-controlled
data stores at each site
Intelligent data classification and access Metadata-based management
49
StorageGRID Functional Diagram
HTTP API / CDMI
Location-Transparent Distributed Object Store
Global Object Namespace
Object-Level Data Management
Metadata Tagging and Query
Storage Systems
NASProtocols
(SG 9)
NASI/O
Object Ingest and Retrieval
Policy-DrivenData Placement
Media Content Repository
NetApp Confidential – Limited Use51
High-performance, scalable storage infrastructure built to support 17 million revenue-generating transactions annually
100% uptime even during peak holiday access when transaction increase 6 to 10 times
3PB of rich media data
Consumer access to 950 million digital images
20,000 worldwide retail locations, online fulfillment partners and in-store kiosks Wal-Mart Canada, Costco, Sam’s Club,
Tesco, CVS/pharmacy, and Kodak
NetApp FAS6280 and FAS3200, Data ONTAP, and FlashCache
PNI Digital Media
“We’ve increased the number of retail partners we work with from 2,000 to almost 20,000 in just a few years. In the past 6 years, we’ve seen a 1,900% increase in transactions. This plus the massive increase in digital images uploaded by consumers demanded a more robust and highly scalable storage infrastructure.” – Zach Wickes, Vice President of Technology, PNI
Health in the Cloud STaaS offering for healthcare providers
Medical Image Archive Cloud Two sites with ~1PB each 2TB+ local cache at each edge site 8x growth in capacity last 12 months 100% uptime since start of service “Forever” retention policies ~60% of customers use hybrid cloud model
Solution offers a proven 100% up-time with automated data movement from on-premise to off-premise public clouds with “keep forever” retention policy and indefinite growth
52
Press Release: http://www.netapp.com/us/company/news/news-rel-20111128-36413.html
Big Data System Integrators Solutions Built on NetApp®
Integrated Big Data Solutions and ExpertisePlanning and implementation expertise for Big DataTurn-key solution stacks and Big Data services
NetApp Confidential – Limited Use53
© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only54
Reference Material
Common Architecture
Flexpod Select
55
Software Solution
Validated Architecture& SKUs
Infrastructure Integration& Distribution
Solution Rack
Operational Integration& System Integrators
Application Packaging
Appliance
+
Services &
SupportEfficiency
Management
Integration
Analytics
Visualization
56
Big Data Summary
Enable enterprise customers to gain business advantage
Practical solutions proven to reduce complexity, increase efficiency and lower cost of ownership
Open standards based with best-in-class partnerships
For more information: http://www.netapp.com/us/company/leadership/big-data/
57
Strategic Assessment Business goals Data growth needs Use case discovery
(partner delivery)
Consult Solution architecture and
design (NetApp delivery)
Deploy Installation and implementation
(NetApp delivery) Solution implementation
(partner delivery)
Next Steps - Team with the Experts
Support options:
Global support available from NetApp and partners
Thank You
NetApp Confidential - Internal Use Only