S3: Komponenten und
Empfehlungen für ein Infrastruktur-
Ökosystem
Peter Wittenburg
Max Planck Computing and Data Facility
Executive Director RDA Europe
2
Peter Wittenburg: Warum sprechen wir über Common Components und
Empfehlungen? Was sind typische Empfehlungen?
Stefan Winkler-N: Notwendigkeit von CoCos und Empfehlungen aus Förderersicht
Ralph Müller-Pfefferkorn: Was sind typische CoCos und wie definiert man CoCos?
Stephan Kindermann: Typische CoCos aus Sicht der Klima-Modellierungs-
Infrastruktur
Heinz Pampel: re3data.org - Registry of Research Data Repositories
David Schiller: Typische CoCos aus Sicht der Sozialwissenschaft
Holger Mickler: Aufbau institutioneller Repositorien an der TU Dresden und der TU
Bergakademie Freiberg
Hagen Peuckert: Anforderungsprofil, Architektur und Implementierung eines
geisteswissenschaftlichen Dokumentenservers
Peter Baumann: „Science SQL“ as a building block for flexible,
standards-based data infrastructures
Poster - Rainer Siegers: SowiDataNet - Forschungsdatenrepositorium für die
Sozial- und Wirtschaftswissenschaften
S3-Agenda: Komponenten und Empfehlungen
für ein Infrastruktur-Ökosystem
3
Data (Intensive) Science is
dependent on efficient ways to
find, access and re-use/combine
data from different sources across
communities, countries & projects!
DIS Credo
4
IDC states a lack of 190.000 data professionals
indicating where the bottleneck is.
Same holds for Business
5
But we are in an exploratory phase
and funders let 1000 flowers
blossom.
This is the best way given the many
open questions.
State
6
Digital Science
Open Science eScience
Data Intensive
Science
algorithms
data
knowledge
Society&
Technology
Dynamics
Generating
Using
A Phase of high Dynamics
Key Terms
- Data
- Grand Challenges
- Openness
- Collaboration
- Innovation
- Jobs
7
Digital Science
Open Science eScience
Data Intensive
Science
Information
Infrastructure
Ecosystem
algorithms
data
knowledge
Society&
Technology
Dynamics
Enabling
Accelerating
Influencing
Populating
Defining
Generating
Using
Need Information Infrastructures
8
Organizations
Countries
Disciplines eInfrastructures (starting offering
services)
Industry (starting proprietary
services)
• ESFRI: much awareness raising in Europe, lots of young people trained,
much testing of variety of approaches, identifying gaps in service
landscape, etc.
• eInfra: starting to change towards service orientation, need more stable
services, need clarification of costs
Does this all make sense – costs are huge?
Let 1000 Flowers blossom
9
Have excellent projects but all do things at
almost all levels differently.
Heterogeneity and large solutions space
tend to become obstacles for efficient work
and even worse for investments.
Result
10 EUDAT Federation
• only to replicate data incl.
PID, metadata, rights,
relations, etc. into the
EUDAT federation requires
different software stacks
for every request
• same experience in
DataONE
• no one can fund this
11
Consolidation Phase is needed to
reduce solution space and thus
increase interoperability
Towards ecosystem of II
12
G8/FAIR/FORCE11/etc. – data should be
searchable -> create useful metadata
accessible -> deposit in trusted repository and use PIDs
interpretable -> create metadata, register schema and semantics
re-usable -> provide contextual metadata
persistent -> provide persistent repositories
Unified Requirements from Funders
13 Towards an optimal Ecosystem of II
• All together: too many boundaries, too inefficient, too expensive,
hampering software investment, hampering new businesses
• Improving our collaborative understanding about II ecosystem
without making too big errors
Glue
Need actions
towards
harmonization
and
sustainability
14
Reduce heterogeneity & costs
Make solutions stronger
Achieve sustainability
Scientific
Analytics
Scientific
Creation
Management
Curation
Access
PID, AAI, MD, WF,
Registries,
Repositories,
meta-semantics,
etc.
Identifying Common Components
15
• X*10 suggestions
• Hampering openness,
innovation, investments,
collaboration
• Little job creation
IBMnet
DECnet
ISO-OSI
X25
MPSNet
TCP/IP
Usenet
Ethernet
Finding the right level
Agreeing on one standard
as a community process
IETF
Opened new area
- New industries
- New businesses
- New jobs
Mosaic
Browser
Global
TCP/IP
many
others
Can we learn from Internet Example?
WHY?
HOW? rough
consensus,
running code
16 DFIG Spinoff – Repository Registry
Domain of Trusted
Repositories
Safe Deposit
Scientists
Publishers
Funders
trusted Re-use
valid References
reproducible Science
machine usage
Registry (Humans,
Machines)
17
Starting in
US, JP, CN, EU
common
components
RDA
process
tests
testbeds
experience
feedback
II
Landscape
eco
system
cost
reductions
sustainability, trust recommendations
interoperability
this is the real interest of researchers, funders
and perhaps also industry, society
• How to accelerate this cycle?
• How to increase density of debate?
• Need a lot of ideas & tests
• Gabriel/SAP CIO/etc. at IT Summit • need to speed up
• need space for ideas & tests
• don’t need more paper
• what do we lack in DE/EU
Accelerate the CoCo Process
18
Einige Slides von Stefan Winkler-Nees
DFG
19 Stefan Winkler-Nees DFG
• Die Infrastrukturen werden ja gemeinhin auch als eine Art
„Ökosystem“ bezeichnet. Das Bild stimmt insofern, dass sie
nicht als isolierte Einheiten agieren sollten, sondern
organisatorisch und technisch anschlussfähig sein müssen.
• Die Entwicklungswege des
„Ökosystems“ und seiner
Bestandteile sind dabei so
vielfältig, wie es die
Anforderungen und
Inhaltstypen und letztendlich
die Wissenschaft selbst sind.
D.h., dass es vermutlich
nicht einen Weg zu einem vernetzten und aufeinander
abgestimmten „Ökosystem“ geben kann.
20 Stefan Winkler-Nees DFG
• Dennoch müssten/sollten im eigenen Interesse die Entwickler
und Betreiber der Infrastrukturen die Gesamtentwicklung und vor
allem auch die Anschlussfähigkeit mit anderen Infrastrukturen
mit berücksichtigen. Das ist anspruchsvoll und nicht leicht, vor
allem nicht schnell umsetzbar. Zu einem späteren Zeitpunkt wird
dies aber nur schwieriger.
• Grundlage der gesamten Infrastrukturentwicklung ist der
(zukünftige?) Bedarf in der Wissenschaft. Dieser äußerst sich
zwar zunächst meist in einem konkreten Informationsdefizit, das
akut behoben werden soll. Das ist grundsätzlich auch zu
unterstützen. Hierbei sollte aber technisch und organisatorisch
keine Sackgasse eingeschlagen werden, die „evolutionär“ keine
Perspektive bietet.
21 Stefan Winkler-Nees DFG
• Dieses „Weiterdenken“ ist in einigen Bereichen bereits zu
beobachten: Systeme, die es sich zur Aufgaben gemacht haben,
größere Einheiten einer wissenschaftlichen Community und
darüber hinaus zusammen zu bringen („community building“,
Vereinbarungen treffen, Regelwerke erarbeiten etc.) und
schließlich technische Lösungen aufzubauen, die einer Vielzahl
unterschiedlicher Wissenschaftsbereiche
Informationsdienstleistungen bieten können. Die Besonderheit
hier ist, dass bei solchen Vorhaben der Aufwand einer
technischen Umsetzung gegenüber der Komplexität der
Projektorganisation und Governance eher in den Hintergrund
rückt.
22
Next is Ralph telling us how to do it
23
~120 Interviews/Interactions (incl. Radieschen Results)
3 Workshops with Leading Scientists (RDA EU, US)
there are positive project examples etc. but in general ...
too much manual work or via ad hoc scripts
hardly usage of automated workflows and lack of reproducibility
DM and DP not efficient and too expensive
(Biologist for 75% of his time data manager)
federating data incl. virtual information much too expensive
Reality
24
• pressure towards DI research is high, but only some
departments are fit for the challenges
• DI research is only available for Power-Institutes
• Senior Researchers: can’t continue like this!
• need to move towards proper data organization and
automated workflows is evident
• but changes now are risky:
• lack of trained experts,
• lack of guidelines and support
Seniors want a change but …
25
what
Value AddedServices
DataSources
PersistentIdentifiers
PersistentReference
Analysis Citation
AppsCustomClients
Plug-Ins
Resolution System Typing
PID
Local Storage Cloud Computed
Data Sets RDBMS Files
Digital Objects
PID record
attributes
bit sequence
(instance)
metadata
attributes
points to instances
describes properties
describes
properties
& context
point to
each other
DONA Foundation
Guaranteeing a worldwide PID service
Handle Service Providers
(DataCite/DOI, EPIC, etc.)
Russian Institute to become MPA
One such CoCo: PIDs
26 Infrastructure Landscape
27
DFT core model is
so simple.
Messages are on
purpose very
simple.
Need to learn to
speak with one
voice.
If all SW builders
would adhere to it,
we would have
gained a lot!
Data Foundation & Terminology
28
result: a registry for data types
Linking structure/semantics with functions
you get an unknown file,
pull it on DTR and content is being
visualized
You find a tag and know how to
interpret
no free lunch: someone needs to
register and define type
PIT Demo already working with
DTR
Various sciences make use of it
Federated Set ofType Registries
Visualization
Data Processing1010011010101…. Data Set
Dissemination
1010011010101….
1010011010101….
Terms:…
Rights
Agree
VisualizationProcessingInterpretation
3
Domain ofServices
2
1
Human or Machine Consumers
4
Data Type Registries
29
result: a generic API and a set of basic attributes
a PID Record is like a Passport (Number, Photo, Exp-Date, etc.)
if all PID Service-Provider agree on one API and talk the same language
(registered terms) SW development will become easy
Climate community
using it together
with DTR
EPIC will
adapt its API
LOC location, path
CKSM checksum
CKSM_T checksum type
RoR owning repository
MD path to MD
PID Information Types
30
Practical Policies = executable Workflow Statements
result: a set of Best Practice PPs for a number of typical DM/DP tasks
(Integrity Check, Replication, etc.)
currently a large collection of PPs, currently being evaluated
replication policy Xreplication policy Yintegrity policy Aintegrity policy Bintegrity policy Cmd extraction policy lmd extraction policy ketc.
Policy InventoryRepository
selection
implementation
execution
data manager
Practical Policies
31 Data Fabric Interest Group
Data Fabric IG looking at the data
production and consumption cycle in the
labs
Other WG/IGs looking at data
publication workflows and citation
32 Data Fabric Discussions
results achieved after 18 months!
33
Recently paper by a number of colleagues engaged in RDA
Data Management Trends, Principles and Components – What Needs to
be Done?
Co-authors don’t claim to own any ideas – but kick-off a broad discussion
Need to accelerate solution finding and convergence process
Doc: http://hdl.handle.net/11304/992fe6a0-fe34-11e4-8a18-f31aa6f4d448
Data Fabric Wiki: https://rd-alliance.org/node/44520/all-wiki-index-by-group
Position Paper “Paris.doc”
8 Common Trends Partly stable, some still in debate
G8+ Principles Widely agreeed
Consequences of Principles Not really thought through
19 Components To be discussed now
Organizational Approaches To be discussed now
34
RDA: http://rd-alliance.org
RDA Europe: http://europe.rd-alliance.org
Data Management Trends, Principles and Components - What Needs to be
Done Next? V6.1: http://hdl.handle.net/11304/992fe6a0-fe34-11e4-8a18-
f31aa6f4d448
Principles for Data Sharing and Re-use: are they all the same?
http://hdl.handle.net/11304/1aab3df4-f3ce-11e4-ac7e-860aa0063d1f
Living with Data Management Plans
http://hdl.handle.net/11304/ea286e5a-f3d1-11e4-ac7e-860aa0063d1f
RDA Europe: Data Practices Analysis
http://hdl.handle.net/11304/6e1424cc-8927-11e4-ac7e-860aa0063d1f
DFT: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html
Data Fabric: https://rd-alliance.org/group/data-fabric-ig.html
Data Fabric Wiki: https://rd-alliance.org/node/44520/all-wiki-index-by-group
References