Date post: | 10-Feb-2017 |
Category: |
Technology |
Upload: | harald-sack |
View: | 480 times |
Download: | 1 times |
Large Scale Processing for Semantic Web Technologies
SeminarDr. Harald Sack / Dr. Peter Tröger
Jörg Waitelonis / Magnus Knuth / Nadine LudwigHasso-Plattner-Institut für Softwaresystemtechnik
Universität PotsdamWintersemester 2010/11
Die nichtkommerzielle Vervielfältigung, Verbreitung und Bearbeitung dieser Folien ist zulässig (Lizenzbestimmungen CC-BY-NC).
Dienstag, 19. Oktober 2010
1. Dozenten / Tutoren
2. Semantic Web und Linked Data
3. Large Scale Processing im FutureSOC Lab
4. Administratives
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
2
Large Scale Processing for Semantic Web Technologies
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Dr. Harald Sack■ seit 1.1.2009 Senior Researcher am HPI und Leiter
der Forschungsgruppe ,Semantische Technologien‘
■ Forschungsschwerpunkte:
□ Semantic Web Technologien
□ Multimedia Retrieval
□ Wissensrepräsentation
■ Videosuchmaschine yovisto.com
3
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Dr. Peter Tröger
■ Seit Februar 2010 Senior Researcher am HPIim Bereich „Verlässliche Many-Core Systeme“
■ Forschungsschwerpunkte:
□ Verlässliche Systeme, Fehlervorhersage
□ Skalierbare Programmierung paralleler Systeme
■ Intel Single Chip Cloud Computer (SCC)
■ CiteMaster.net
4
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
MC0
MC1
MC2
MC3
System InterfaceVRC
Router
IA-32 Core0
L2$0256KB
L2$1256KB
IA-32 Core1
MPB16KB
Router Tile
2 core clusters in 6x4 2-D mesh
16B
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Dipl. Inform. Jörg Waitelonis
■ Studium Informatik Uni-Jena bis 2006
■ 2006-2007 Exist-Seed Projekt Osotis
■ seit 2007 Gründer von yovisto.com
■ Entwickler von REPLAY (ETH-Zürich)
■ Forschung: Semantic Web, Multimedia-Retrieval, Suchmaschinen Technologien
5
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Dipl.-Inf. Nadine Ludwig
■ Studium Informatik TU Ilmenau bis 2005
■ 2005-2010 TU Berlin:
□ kooperative Lernszenarien
□ Integration von Semantic Web Technologien in kooperative Lernplattformen
■ seit 05/2010 HPI:
□ Semantische Analyse, Entity Mapping, Disambiguierung
6
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Dipl. Inform. Magnus Knuth
■ Studium Informatik Uni Leipzig bis 2007
■ 2007-2010 Institut für Medizinische Informatik, Statistik und Epidemiologie Leipzig
■ Forschung: Semantic Web, Multimedia-Retrieval, Suchmaschinen Technologien
7
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
8
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Bereitstellung der wissenschaftlichen Präsentation im Internet
yovisto.com• Videosuchmaschine mit dem
Schwerpunkt akademischer Lehrveranstaltungen
• aktuell mehr als 10.000 Vorlesungen und wissenschaftliche Vorträge aus der ganzen Welt
• automatische Segmentierung und Videoanalyse
• benutzergenerierte Co-Annotation
• Social Tagging• Diskussionen• Rezensionen• Wikis• Lernmaterialien
• Zielgenauer Zugriff auf gesuchte Videoinhalte
www.yovisto.com
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
9
■ THESEUS Forschungsprogramm: Neue internetbasierte Wissensinfrastruktur.
■ UseCase Contentus: Technologien für die Mediathek der Zukunft.
■ Projekt Mediaglobe: Effizientes Arbeiten mit Mediadaten in Medienarchiven und Rundfunkanstalten.
■ effiziente Suche nach/in AV-Inhalten in Medienarchiven und Rundfunkanstalten
■ Arbeitsprozesslösung für die effiziente Erfassung, Aufbereitung und Verwertung von AV-Inhalten
Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren
Dienstag, 19. Oktober 2010
1. Dozenten / Tutoren
2. Semantic Web und Linked Data
3. Large Scale Processing im FutureSOC Lab
4. Administratives
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
10
Large Scale Processing for Semantic Web Technologies
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
11
The Web is huge....
To be more precise, the WWW is rather huge...•more than 25 x 109 documents in
Search engine indexes (TNL Blog: Google has 24 billion items index, considers MSN search nearest competitor, September 2005)
•Google Web Crawler found more than 1012 documents(The Official Google Blog: We knew the Web was Big....., Juli 25, 2008)
•New Google Search Index Caffeine comprises 100 Million Gigabytes of datai.e. 1017 Byte (SMX Video: Google’s Matt Cutts On Caffeine Launch, June 9, 2010,http://searchengineland.com/smx-video-googles-matt-cutts-on-caffeine-launch-43933)
•And then, there is also the DeepWeb (Darkweb) ...and it is supposed to be up to 500 time larger than the Surface Web(Bergman, 2001)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
12
The Web is growing...
Multimedia, Real-Time Data, Sensor Data, ....
in 06/2010: 7 TB/day
in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day
in 06/2010: 7 TB/dayDienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
12
The Web is growing...
Multimedia, Real-Time Data, Sensor Data, ....
in 06/2010: 7 TB/day
in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day
in 06/2010: 7 TB/dayDienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
13
How to find something on the Web?
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
14
The ‘Web of Data‘
Semantic Web Technologies
• Interoperable and machine understandabledata semantics
• Based on formal knowledge representations
• Creating a ‘Web of Data‘
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
15
Semantic Web and Linked Data
From World Wide Web to Web of Data„The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help… “
Prerequisites:
• Content can be read and interpreted correctly (=understood) by machines
Tim Berners-Lee, Semantic Web Roadmap, Sept 1998
Semantic Web• (natural language) web content is
explicitely annotated with semantic metadata
• semantic metadata encode the meaning (semantics) of web content and can be read andinterpreted correctly my machine
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
16
Semantic Web and Linked Data
Understanding Web Content - I
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
16
Semantic Web and Linked Data
Understanding Web Content - I
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
16
Semantic Web and Linked Data
Understanding Web Content - I
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
?...
?
text: „FAB“
fabulous
Entity MappingDisambiguation
?
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
16
Semantic Web and Linked Data
Understanding Web Content - I
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
Fabio CapelloManager ofUK National
Football Team
?...
?
text: „FAB“
fabulous
Entity MappingDisambiguation
?
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
16
Semantic Web and Linked Data
Understanding Web Content - I
Natural Language Processing• Technology from traditional Information
Retrieval (WWW Search Engines)
Fabio CapelloManager ofUK National
Football Team
?...
?
text: „FAB“
fabulous
Entity MappingDisambiguation
?
David JamesGoal Keeper of
UK NationalFootball Team
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
17
Semantic Web and Linked Data
Understanding Web Content - II
text: „FAB“
Fabio Capello
Entity Mapping
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
17
Semantic Web and Linked Data
Understanding Web Content - II
text: „FAB“
Fabio Capello
Entity Mapping
Soccer Manager
is a
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
17
Semantic Web and Linked Data
Understanding Web Content - II
text: „FAB“
Fabio Capello
Entity Mapping
Soccer Manager
is a
Person
is a
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
18
Semantic Web and Linked Data
Understanding Web Content - III
Fabio Capello (entity)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
18
Semantic Web and Linked Data
Understanding Web Content - III
Fabio Capello (entity)
Soccer Manager
is a
(class)
Class-membership has type
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
18
Semantic Web and Linked Data
Understanding Web Content - III
Fabio Capello (entity)
Soccer Manager
is a
(class)
Class-membership has type
Person
is a
(class)
superclass
subclass
is subclass of
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
19
Semantic Web and Linked Data
Understanding Web Content - IV
Fabio Capello
Soccer Manager
Person
is a
is aEntities
Classes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
19
Semantic Web and Linked Data
Understanding Web Content - IV
Fabio Capello
Soccer Manager
Person
is a
PlacehasBirthPlace
is aEntities
Classes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
19
Semantic Web and Linked Data
Understanding Web Content - IV
Fabio Capello
Soccer Manager
Person
is a
PlacehasBirthPlaceDate hasBirthDate
is aEntities
Classes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
19
Semantic Web and Linked Data
Understanding Web Content - IV
Fabio Capello
Soccer Manager
Person
is a
PlacehasBirthPlaceDate hasBirthDate
is a
hasBirthDate1946-06-18
is a
Entities
Classes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
19
Semantic Web and Linked Data
Understanding Web Content - IV
Fabio Capello
Soccer Manager
Person
is a
PlacehasBirthPlaceDate hasBirthDate
is a
hasBirthDate1946-06-18
is a
San Canzian d‘IsonzohasBirthPlace
is a
Entities
Classes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
20
Semantic Web and Linked Data
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
21
Semantic Web and Linked Data
Fabio Capello http://dbpedia.org/resource/Fabio_Capello
URI - Uniform Resource Identifier
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
22
Semantic Web and Linked Data
http://dbpedia.org/resource/Fabio_Capello
http://en.wikipediapedia.org/resource/Fabio_Capello
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
23
Semantic Web and Linked Datahttp://dbpedia.org/resource/Fabio_Capello
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
24
Semantic Web and Linked Data
http://dbpedia.org/resource/Fabio_Capello
RDF Resource Description Framework
:Fabio_Capello dbpp:birthPlace :San_Canzian_d%27Isonzo .:Fabio_Capello dbpp:birthDate “1946-06-18“ .:Fabio_Capello rdfs:type dbpo:SoccerManager .:Fabio_Capello rdfs:type dbpo:Person ....
:Fabio_Capello rdf:type dbpo:SoccerManager .
RDF Tripel RDF Subject RDF Property RDF Object
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
25
Semantic Web and Linked Data
http://dbpedia.org/ontology/soccer_manager
dbpo:SoccerManager rdf:type owl:class .dbpo:SoccerManager rdfs:subClassOf dbpo:Person .dbpo:SoccerManager rdfs:label “Soccer Manager“ .dbpp:birthPlace rdf:type rdf:Property .dbpp:birthPlace rdfs:domain dbpo:Person .dbpp:birthPlace rdfs:range dbpo:Place .dbpp:birthDate rdf:type rdf:Property .dbpp:birthDate rdfs:domain :Person .dbpp:birthDate rdfs:range xsd:date ....
RDF Schema
Person PlacehasBirthPlaceDate hasBirthDate
Soccer Manager
is a
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
26
Semantic Web and Linked DataUnderstanding Web Content - V
Fabio Capello
LivingPeople
PersonDate
hasBirthDate1946-06-18
hasBirthDate
is a
is a
is a
DeadPeople∩ =∅
logical constraint
is a
+ Rules (Description Logics)
∀x.∃y.hasDeathDate(x,y) ∧ Person(x) ∧ Date(y) → DeadPeople(x)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
27
Semantic Web and Linked Data
SELECT DISTINCT ?l ?l2 ?g FROM <http://dbpedia.org> WHERE { ?s dbpp:nationalteam ?o . ?s rdfs:label?l FILTER langMatches( lang(?l), "EN" ) . ?s dbpp:nationalgoals ?g FILTER(?g>10). ?s dbprop:nationalteam ?nat . ?nat rdfs:label ?l2 FILTER langMatches( lang(?l2), "EN" ).} ORDER BY DESC(?g)
Select all players of a soccer nationalteam that have scored more than 10 goals while inthe team
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
28
Semantic Web and Linked Data
Select all players of a soccer nationalteam that have scored more than 10 goals while in the team
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
29
Semantic Web and Linked Data
(RDF)
(URI)
M.Hausenblas, Quick Linked Data Introduction, http://www.slideshare.net/mediasemanticweb/quick-linked-data-introduction
Linked Data■ Term was originally coined by Tim Berners-Lee
(Tim Berners-Lee, Linked Data, 2006, http://www.w3.org/DesignIssues/LinkedData.html)
The Web of data is abouta dataand namingmodel on the Web
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
30
Semantic Web and Linked Data
Linked Data
■ Technical Principles
□ use URIs to identify things uniquely (not only documents...)
□ use HTTP URIs (URLs) so that these things can be referred to and looked up ("dereferenced") by people and user agents
□ use RDF as an universal data model to provide useful information about these things
□ include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
31
Semantic Web and Linked Data
Linked Data□ The application lf the Linked Data principles leads to the creation of a
,Web of Data‘
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
32
Semantic Web and Linked Data
Linking Open Data■ Public available structured data should be published as Linked Data
■ Various data sources should be interlinked
LOD-WikiPage: http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/
Dienstag, 19. Oktober 2010
Linked Data Challenges■ Coherence
relatively few, expensively maintained links
■ Qualitypartly low quality data and inconsistencies
■ Performancestill substantial penalties compared torelational database technologies
■ Data consumptionlarge scale processing, schema mapping anddata fusion still in its infancy
■ UsabilityMissing direct end user tools and network effect
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
33
Semantic Web and Linked Data
Sören Auer:"Linked Data: Now what?"ESWC2010 Panel Discussion
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
34
Problems and Experiments
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
35
Problems and Experiments
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
36
Problems and Experiments
A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
37
Problems and Experiments
Experiment Summary (1) Crawling the Semantic Web
(2) Structural Analysis
(3) Content-based Analysis
(4) Data Cleansing
(5) Augmenting Semantic Web Infrastructure
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
38
Problems and Experiments
So what? ■ Interesting Facts to find out about
Semantic Web & Linked Data
■How big is the Semantic Universe?
■ # tripel
■ # documents
■ # interlinking
■ Linking Open Data is only registered vocabulary/data in the LOD-Wiki→ 14b RDF triples
■What else is out there ... and how much of it?
■ ...and how do we get it?
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
39
Problems and Experiments
(1) Crawling the Semantic Web■Of course we are not the first to be out there...
■ SwoogleLi Ding et al: Finding and Ranking Knowledge on the Semantic Web, ISWC 2005.
■ Scutter/Slug Leigh Dodds: Slug: A Semantic Web Crawler, 2006
■ Sindice Giovanni Tumarello et al: Sindice.com - weaving the open linked data, ISWC 2007
→ 2.1b RDF triples
■ SWSE Andreas Harth et al: SWSE: Objects before Documents,
Semantic Web Challenge 2008, ISWC 2008
→ 1.1b RDF triples
■ FalconsG.Cheng et al.:Falcons: Searching and Browsing Entities on the Semantic Web, WWW17 2008.
→ 2.9b RDF triples
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
40
Problems and Experiments
(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be out there...
■ Structural Analysis of the ,early‘ WWW
IN44m nodes
SCC56m nodes
OUT44m nodes
unconnected components
unconnected components
tunnels
appendices
appendices
A. Broder et al.: Graph structure in the Web. In Comput. Netw. 33, 1-6 (Jun. 2000), 309-320.
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
41
Problems and Experiments
(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be there...
■ Structural Analysis of the ,early‘ Semantic Web
Weiyi Ge et al.: Object Link Structure in the Semantic Web, ESWC 2010
■ Experimental Setup
■ 18m RDF documents (Falcons crawl 2009)
■ 110m nodes with 190m edges■ Analysis of RDF link graph
■ average node degree: ≈3.4
■ effective diameter: ≈11.5
■ Largest connected component: ≈88% of all nodes
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
42
Problems and Experiments
(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...
http://pedantic-web.org/
A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010
■ 150k documents with more than 12m RDF triples
■ Discovered categories of symptoms:
■ incomplete → dead links
■ incoherent → no correct interpretation (local)
■ hijack → no correct interpretation (remote)
■ inconsistent → contradictions
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
43
Problems and Experiments
(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...
Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used
Leigh University Benchmark (LUBM) with 100b RDF triples
■ Computing the transitive closure (= reasoning)
■ Making implicit knowledge explicit
Fabio Capello San Canzian d‘IsonzohasBirthPlace
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
43
Problems and Experiments
(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...
Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used
Leigh University Benchmark (LUBM) with 100b RDF triples
■ Computing the transitive closure (= reasoning)
■ Making implicit knowledge explicit
Fabio Capello
Person
is a
San Canzian d‘IsonzohasBirthPlace
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
43
Problems and Experiments
(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...
Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used
Leigh University Benchmark (LUBM) with 100b RDF triples
■ Computing the transitive closure (= reasoning)
■ Making implicit knowledge explicit
Fabio Capello
Person
is a
PlacehasBirthPlace
San Canzian d‘IsonzohasBirthPlace
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
43
Problems and Experiments
(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...
Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used
Leigh University Benchmark (LUBM) with 100b RDF triples
■ Computing the transitive closure (= reasoning)
■ Making implicit knowledge explicit
Fabio Capello
Person
is a
PlacehasBirthPlace
San Canzian d‘IsonzohasBirthPlace
class membershipcan be deduced
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
44
Problems and Experiments
(4) Analyzing the Semantic Web III - Data Cleansing■ trying to clean out Linked Open Data and possibly also (partially) the
Semantic Web...
(1) Identify inconsistencies and ambiguities by (automated) content-based analysis
(2)Solve inconsistencies & ambiguities
■ if possible by reasoning
■ else by crowdsourcing (game-based evaluation, etc.)
Cleaning out the Augean stables...AUGEAN-STABLES: Extremely nasty and smelly warehouses of filth, straw and manure
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
45
Problems and Experiments
(5) Semantic Web Infrastructure - Tripel Stores■ RDF(S) Data is stored in Triple Stores
■ Basic idea:
■ Use 1 table with 3 columns (s,p,o)
■ For every row / row combinationcreate index structures for fast access(spo, sop, pos, pso, ops, osp)
■ Drawback: many self-joins needed(memory consumption)
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
46
Problems and Experiments
Experiment Summary (1) Crawling the Semantic Web
(2) Structural Analysis
(3) Content-based Analysis
(4) Data Cleansing
(5) Augmenting Semantic Web Infrastructure
Dienstag, 19. Oktober 2010
1. Dozenten / Tutoren
2. Semantic Web und Linked Data
3. Large Scale Processing im FutureSOC Lab
4. Administratives
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
47
Large Scale Processing for Semantic Web Technologies
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
48
Problems and Experiments
„Large Scale“ Processing for Semantic Web Technologies■ Three ways of doing anything faster [Pfister]
■Work harder
■Work smarter
■Get help
■ „A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem.“ (Foster 1995)
■ Linear speedup: n times more resources lead to n times less time for solving the same task
■ Linear scaleup: n times more resources solve an n times larger problem in the same time
CPU 1, Node 1
CPU 2, Node 1
CPU 3, Node 1
CPU 4, Node 2
CPU 5, Node 2
CPU 6, Node 2Sca
ling
Up
Scaling Out
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
49
Problems and Experiments
FutureSOC Lab■Collaboration with industry for software research on next-generation
X86 hardware
■Hewlett Packard DL980 G7: 8 x X7560 (8 Cores), 2048 GB RAM
■ Fujitsu RX600 S5 1: 4 x E7530 (6 Cores), 256 GB RAM
■ Fujitsu RX600 S5 2: 4x X7550 (8 Cores), 1024 GB RAM
■ Fujitsu RX200 S5 1: 2x E5540 (4 Cores), 36 GB RAM
■ Fujitsu RX300 S5 1: 1x E5540 (4 Cores), 12 GB RAM
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
50
Problems and Experiments
FutureSOC Lab■ Large scale processing of semantic web data:
■ Speeding up the processing of single work items through parallelization
■ Scaling up the number of processed triplets through distribution
■ Seminar participants will work on parallelization strategies for semantic web processing, practically tested on FutureSOC hardware
■ Parallelized XML parsers
■ Parallelized graph processing
■ I/O - optimized web crawling
■ Speed-up and scale-up of triplet stores
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Problems and Experiments
The Problem of Speedup through Parallelization■Well-researched problem in parallel databases (D. DeWitt, J. Gray)
□ Start-Up: Initialization of parallel activity, synchronization of results
□ Interference: Conflicts through access to shared data
□ Dispersion: Overal execution time depends on the slowest process
□ All problems increase with the number of processors
■Amdahls Law
□ P is the portion of the program that benefits from parallelization
□ Maximum speedup by N processors:
□ Maximum speedup tends to 1 / (1-P)□ Parallelism only useful with small amount
of processors or small (1-P)
51
s = (1−P )+P(1−P )+ P
N
Dienstag, 19. Oktober 2010
1 10 100 1000 1!104
5
10
15
20
P=90%
P=75%
P=50%
P=25%
P=10%
Number of processors
Speedup P=95%
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Problems and Experiments52 Amdahl‘s Law
Dienstag, 19. Oktober 2010
1. Dozenten / Tutoren
2. Semantic Web und Linked Data
3. Large Scale Processing im FutureSOC Lab
4. Administratives
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
53
Large Scale Processing for Semantic Web Technologies
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
54 Seminar Large Scale Processing for Semantic Web Technologies□Semesterwochenstunden: 4
□ECTS: 6
□Leistungsfeststellung:
□Schriftliche Ausarbeitung zum Vortragsthema (Umfang ca. 20 Seiten)□Umsetzung einer vorgegebenen Implementierungsaufgabe im Team□Präsentation der Ergebnisse
(Zwischenpräsentation, Endpräsentation, Wochenbesprechungen)
□Projektteams mit je 2-3 Studenten bearbeiten eine der vorgeschlagenen Aufgabenstellungen
Large Scale Processing for Semantic Web TechnologiesAdministratives
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
Administratives■Termine:
□Wöchentliches Seminargruppentreffen
□Termin nach Vereinbarung
□Zwischenpräsentation der Projektergebnisse
□Abschlusspräsentation der Ergebnisse
□Termin in der letzten Semesterwoche
55
Large Scale Processing for Semantic Web TechnologiesAdministratives
Dienstag, 19. Oktober 2010
Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam
56
Literatur
• P. Hitzler, S. Roschke, Y. Sure: Semantic Web Grundlagen, Springer, 2007.
• Grundlegende Materialien via bibsonomy-Bookmarks
• http://www.bibsonomy.org/user/lysander07/lsc1011
Large Scale Processing for Semantic Web TechnologiesAdministratives
Dienstag, 19. Oktober 2010