+ All Categories
Home > Technology > Large Scale Computing for Semantic Web Technologies

Large Scale Computing for Semantic Web Technologies

Date post: 10-Feb-2017
Category:
Upload: harald-sack
View: 480 times
Download: 1 times
Share this document with a friend
72
Large Scale Processing for Semantic Web Technologies Seminar Dr. Harald Sack / Dr. Peter Tröger Jörg Waitelonis / Magnus Knuth / Nadine Ludwig Hasso-Plattner-Institut für Softwaresystemtechnik Universität Potsdam Wintersemester 2010/11 Die nichtkommerzielle Vervielfältigung, Verbreitung und Bearbeitung dieser Folien ist zulässig (Lizenzbestimmungen CC-BY-NC ). Dienstag, 19. Oktober 2010
Transcript

Large Scale Processing for Semantic Web Technologies

SeminarDr. Harald Sack / Dr. Peter Tröger

Jörg Waitelonis / Magnus Knuth / Nadine LudwigHasso-Plattner-Institut für Softwaresystemtechnik

Universität PotsdamWintersemester 2010/11

Die nichtkommerzielle Vervielfältigung, Verbreitung und Bearbeitung dieser Folien ist zulässig (Lizenzbestimmungen CC-BY-NC).

Dienstag, 19. Oktober 2010

1. Dozenten / Tutoren

2. Semantic Web und Linked Data

3. Large Scale Processing im FutureSOC Lab

4. Administratives

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

2

Large Scale Processing for Semantic Web Technologies

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Dr. Harald Sack■ seit 1.1.2009 Senior Researcher am HPI und Leiter

der Forschungsgruppe ,Semantische Technologien‘

■ Forschungsschwerpunkte:

□ Semantic Web Technologien

□ Multimedia Retrieval

□ Wissensrepräsentation

■ Videosuchmaschine yovisto.com

3

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Dr. Peter Tröger

■ Seit Februar 2010 Senior Researcher am HPIim Bereich „Verlässliche Many-Core Systeme“

■ Forschungsschwerpunkte:

□ Verlässliche Systeme, Fehlervorhersage

□ Skalierbare Programmierung paralleler Systeme

■ Intel Single Chip Cloud Computer (SCC)

■ CiteMaster.net

4

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

MC0

MC1

MC2

MC3

System InterfaceVRC

Router

IA-32 Core0

L2$0256KB

L2$1256KB

IA-32 Core1

MPB16KB

Router Tile

2 core clusters in 6x4 2-D mesh

16B

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Dipl. Inform. Jörg Waitelonis

■ Studium Informatik Uni-Jena bis 2006

■ 2006-2007 Exist-Seed Projekt Osotis

■ seit 2007 Gründer von yovisto.com

■ Entwickler von REPLAY (ETH-Zürich)

■ Forschung: Semantic Web, Multimedia-Retrieval, Suchmaschinen Technologien

5

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Dipl.-Inf. Nadine Ludwig

■ Studium Informatik TU Ilmenau bis 2005

■ 2005-2010 TU Berlin:

□ kooperative Lernszenarien

□ Integration von Semantic Web Technologien in kooperative Lernplattformen

■ seit 05/2010 HPI:

□ Semantische Analyse, Entity Mapping, Disambiguierung

6

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Dipl. Inform. Magnus Knuth

■ Studium Informatik Uni Leipzig bis 2007

■ 2007-2010 Institut für Medizinische Informatik, Statistik und Epidemiologie Leipzig

■ Forschung: Semantic Web, Multimedia-Retrieval, Suchmaschinen Technologien

7

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

8

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Bereitstellung der wissenschaftlichen Präsentation im Internet

yovisto.com• Videosuchmaschine mit dem

Schwerpunkt akademischer Lehrveranstaltungen

• aktuell mehr als 10.000 Vorlesungen und wissenschaftliche Vorträge aus der ganzen Welt

• automatische Segmentierung und Videoanalyse

• benutzergenerierte Co-Annotation

• Social Tagging• Diskussionen• Rezensionen• Wikis• Lernmaterialien

• Zielgenauer Zugriff auf gesuchte Videoinhalte

www.yovisto.com

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

9

■ THESEUS Forschungsprogramm: Neue internetbasierte Wissensinfrastruktur.

■ UseCase Contentus: Technologien für die Mediathek der Zukunft.

■ Projekt Mediaglobe: Effizientes Arbeiten mit Mediadaten in Medienarchiven und Rundfunkanstalten.

■ effiziente Suche nach/in AV-Inhalten in Medienarchiven und Rundfunkanstalten

■ Arbeitsprozesslösung für die effiziente Erfassung, Aufbereitung und Verwertung von AV-Inhalten

Large Scale Processing for Semantic Web TechnologiesDozenten / Tutoren

Dienstag, 19. Oktober 2010

1. Dozenten / Tutoren

2. Semantic Web und Linked Data

3. Large Scale Processing im FutureSOC Lab

4. Administratives

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

10

Large Scale Processing for Semantic Web Technologies

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

11

The Web is huge....

To be more precise, the WWW is rather huge...•more than 25 x 109 documents in

Search engine indexes (TNL Blog: Google has 24 billion items index, considers MSN search nearest competitor, September 2005)

•Google Web Crawler found more than 1012 documents(The Official Google Blog: We knew the Web was Big....., Juli 25, 2008)

•New Google Search Index Caffeine comprises 100 Million Gigabytes of datai.e. 1017 Byte (SMX Video: Google’s Matt Cutts On Caffeine Launch, June 9, 2010,http://searchengineland.com/smx-video-googles-matt-cutts-on-caffeine-launch-43933)

•And then, there is also the DeepWeb (Darkweb) ...and it is supposed to be up to 500 time larger than the Surface Web(Bergman, 2001)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

12

The Web is growing...

Multimedia, Real-Time Data, Sensor Data, ....

in 06/2010: 7 TB/day

in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day

in 06/2010: 7 TB/dayDienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

12

The Web is growing...

Multimedia, Real-Time Data, Sensor Data, ....

in 06/2010: 7 TB/day

in 05/2010: •24 h of video upload / minute•2 billion streamed videos per day

in 06/2010: 7 TB/dayDienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

13

How to find something on the Web?

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

14

The ‘Web of Data‘

Semantic Web Technologies

• Interoperable and machine understandabledata semantics

• Based on formal knowledge representations

• Creating a ‘Web of Data‘

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

15

Semantic Web and Linked Data

From World Wide Web to Web of Data„The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help… “

Prerequisites:

• Content can be read and interpreted correctly (=understood) by machines

Tim Berners-Lee, Semantic Web Roadmap, Sept 1998

Semantic Web• (natural language) web content is

explicitely annotated with semantic metadata

• semantic metadata encode the meaning (semantics) of web content and can be read andinterpreted correctly my machine

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

16

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

16

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

16

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

16

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Fabio CapelloManager ofUK National

Football Team

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

16

Semantic Web and Linked Data

Understanding Web Content - I

Natural Language Processing• Technology from traditional Information

Retrieval (WWW Search Engines)

Fabio CapelloManager ofUK National

Football Team

?...

?

text: „FAB“

fabulous

Entity MappingDisambiguation

?

David JamesGoal Keeper of

UK NationalFootball Team

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

17

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

17

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Soccer Manager

is a

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

17

Semantic Web and Linked Data

Understanding Web Content - II

text: „FAB“

Fabio Capello

Entity Mapping

Soccer Manager

is a

Person

is a

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

18

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

18

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Soccer Manager

is a

(class)

Class-membership has type

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

18

Semantic Web and Linked Data

Understanding Web Content - III

Fabio Capello (entity)

Soccer Manager

is a

(class)

Class-membership has type

Person

is a

(class)

superclass

subclass

is subclass of

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

19

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

is aEntities

Classes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

19

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlace

is aEntities

Classes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

19

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is aEntities

Classes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

19

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is a

hasBirthDate1946-06-18

is a

Entities

Classes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

19

Semantic Web and Linked Data

Understanding Web Content - IV

Fabio Capello

Soccer Manager

Person

is a

PlacehasBirthPlaceDate hasBirthDate

is a

hasBirthDate1946-06-18

is a

San Canzian d‘IsonzohasBirthPlace

is a

Entities

Classes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

20

Semantic Web and Linked Data

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

21

Semantic Web and Linked Data

Fabio Capello http://dbpedia.org/resource/Fabio_Capello

URI - Uniform Resource Identifier

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

22

Semantic Web and Linked Data

http://dbpedia.org/resource/Fabio_Capello

http://en.wikipediapedia.org/resource/Fabio_Capello

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

23

Semantic Web and Linked Datahttp://dbpedia.org/resource/Fabio_Capello

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

24

Semantic Web and Linked Data

http://dbpedia.org/resource/Fabio_Capello

RDF Resource Description Framework

:Fabio_Capello dbpp:birthPlace :San_Canzian_d%27Isonzo .:Fabio_Capello dbpp:birthDate “1946-06-18“ .:Fabio_Capello rdfs:type dbpo:SoccerManager .:Fabio_Capello rdfs:type dbpo:Person ....

:Fabio_Capello rdf:type dbpo:SoccerManager .

RDF Tripel RDF Subject RDF Property RDF Object

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

25

Semantic Web and Linked Data

http://dbpedia.org/ontology/soccer_manager

dbpo:SoccerManager rdf:type owl:class .dbpo:SoccerManager rdfs:subClassOf dbpo:Person .dbpo:SoccerManager rdfs:label “Soccer Manager“ .dbpp:birthPlace rdf:type rdf:Property .dbpp:birthPlace rdfs:domain dbpo:Person .dbpp:birthPlace rdfs:range dbpo:Place .dbpp:birthDate rdf:type rdf:Property .dbpp:birthDate rdfs:domain :Person .dbpp:birthDate rdfs:range xsd:date ....

RDF Schema

Person PlacehasBirthPlaceDate hasBirthDate

Soccer Manager

is a

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

26

Semantic Web and Linked DataUnderstanding Web Content - V

Fabio Capello

LivingPeople

PersonDate

hasBirthDate1946-06-18

hasBirthDate

is a

is a

is a

DeadPeople∩ =∅

logical constraint

is a

+ Rules (Description Logics)

∀x.∃y.hasDeathDate(x,y) ∧ Person(x) ∧ Date(y) → DeadPeople(x)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

27

Semantic Web and Linked Data

SELECT DISTINCT ?l ?l2 ?g FROM <http://dbpedia.org> WHERE { ?s dbpp:nationalteam ?o . ?s rdfs:label?l FILTER langMatches( lang(?l), "EN" ) . ?s dbpp:nationalgoals ?g FILTER(?g>10). ?s dbprop:nationalteam ?nat . ?nat rdfs:label ?l2 FILTER langMatches( lang(?l2), "EN" ).} ORDER BY DESC(?g)

Select all players of a soccer nationalteam that have scored more than 10 goals while inthe team

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

28

Semantic Web and Linked Data

Select all players of a soccer nationalteam that have scored more than 10 goals while in the team

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

29

Semantic Web and Linked Data

(RDF)

(URI)

M.Hausenblas, Quick Linked Data Introduction, http://www.slideshare.net/mediasemanticweb/quick-linked-data-introduction

Linked Data■ Term was originally coined by Tim Berners-Lee

(Tim Berners-Lee, Linked Data, 2006, http://www.w3.org/DesignIssues/LinkedData.html)

The Web of data is abouta dataand namingmodel on the Web

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

30

Semantic Web and Linked Data

Linked Data

■ Technical Principles

□ use URIs to identify things uniquely (not only documents...)

□ use HTTP URIs (URLs) so that these things can be referred to and looked up ("dereferenced") by people and user agents

□ use RDF as an universal data model to provide useful information about these things

□ include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

31

Semantic Web and Linked Data

Linked Data□ The application lf the Linked Data principles leads to the creation of a

,Web of Data‘

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

32

Semantic Web and Linked Data

Linking Open Data■ Public available structured data should be published as Linked Data

■ Various data sources should be interlinked

LOD-WikiPage: http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/

Dienstag, 19. Oktober 2010

Linked Data Challenges■ Coherence

relatively few, expensively maintained links

■ Qualitypartly low quality data and inconsistencies

■ Performancestill substantial penalties compared torelational database technologies

■ Data consumptionlarge scale processing, schema mapping anddata fusion still in its infancy

■ UsabilityMissing direct end user tools and network effect

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

33

Semantic Web and Linked Data

Sören Auer:"Linked Data: Now what?"ESWC2010 Panel Discussion

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

34

Problems and Experiments

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

35

Problems and Experiments

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

36

Problems and Experiments

A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

37

Problems and Experiments

Experiment Summary (1) Crawling the Semantic Web

(2) Structural Analysis

(3) Content-based Analysis

(4) Data Cleansing

(5) Augmenting Semantic Web Infrastructure

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

38

Problems and Experiments

So what? ■ Interesting Facts to find out about

Semantic Web & Linked Data

■How big is the Semantic Universe?

■ # tripel

■ # documents

■ # interlinking

■ Linking Open Data is only registered vocabulary/data in the LOD-Wiki→ 14b RDF triples

■What else is out there ... and how much of it?

■ ...and how do we get it?

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

39

Problems and Experiments

(1) Crawling the Semantic Web■Of course we are not the first to be out there...

■ SwoogleLi Ding et al: Finding and Ranking Knowledge on the Semantic Web, ISWC 2005.

■ Scutter/Slug Leigh Dodds: Slug: A Semantic Web Crawler, 2006

■ Sindice Giovanni Tumarello et al: Sindice.com - weaving the open linked data, ISWC 2007

→ 2.1b RDF triples

■ SWSE Andreas Harth et al: SWSE: Objects before Documents,

Semantic Web Challenge 2008, ISWC 2008

→ 1.1b RDF triples

■ FalconsG.Cheng et al.:Falcons: Searching and Browsing Entities on the Semantic Web, WWW17 2008.

→ 2.9b RDF triples

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

40

Problems and Experiments

(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be out there...

■ Structural Analysis of the ,early‘ WWW

IN44m nodes

SCC56m nodes

OUT44m nodes

unconnected components

unconnected components

tunnels

appendices

appendices

A. Broder et al.: Graph structure in the Web. In Comput. Netw. 33, 1-6 (Jun. 2000), 309-320.

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

41

Problems and Experiments

(2) Analyzing the Semantic Web I - Structural Analysis■ Again we are not the first to be there...

■ Structural Analysis of the ,early‘ Semantic Web

Weiyi Ge et al.: Object Link Structure in the Semantic Web, ESWC 2010

■ Experimental Setup

■ 18m RDF documents (Falcons crawl 2009)

■ 110m nodes with 190m edges■ Analysis of RDF link graph

■ average node degree: ≈3.4

■ effective diameter: ≈11.5

■ Largest connected component: ≈88% of all nodes

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

42

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

http://pedantic-web.org/

A. Hoigan et al: Weaving the Pedantic Web, LDOW 2010

■ 150k documents with more than 12m RDF triples

■ Discovered categories of symptoms:

■ incomplete → dead links

■ incoherent → no correct interpretation (local)

■ hijack → no correct interpretation (remote)

■ inconsistent → contradictions

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

43

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello San Canzian d‘IsonzohasBirthPlace

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

43

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

San Canzian d‘IsonzohasBirthPlace

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

43

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

PlacehasBirthPlace

San Canzian d‘IsonzohasBirthPlace

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

43

Problems and Experiments

(3) Analyzing the Semantic Web II - Content-Based Analysis■ Again we are not the first to be there...

Urbani et al: OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, ESWC 2010■ Artificial Benchmark dataset used

Leigh University Benchmark (LUBM) with 100b RDF triples

■ Computing the transitive closure (= reasoning)

■ Making implicit knowledge explicit

Fabio Capello

Person

is a

PlacehasBirthPlace

San Canzian d‘IsonzohasBirthPlace

class membershipcan be deduced

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

44

Problems and Experiments

(4) Analyzing the Semantic Web III - Data Cleansing■ trying to clean out Linked Open Data and possibly also (partially) the

Semantic Web...

(1) Identify inconsistencies and ambiguities by (automated) content-based analysis

(2)Solve inconsistencies & ambiguities

■ if possible by reasoning

■ else by crowdsourcing (game-based evaluation, etc.)

Cleaning out the Augean stables...AUGEAN-STABLES: Extremely nasty and smelly warehouses of filth, straw and manure

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

45

Problems and Experiments

(5) Semantic Web Infrastructure - Tripel Stores■ RDF(S) Data is stored in Triple Stores

■ Basic idea:

■ Use 1 table with 3 columns (s,p,o)

■ For every row / row combinationcreate index structures for fast access(spo, sop, pos, pso, ops, osp)

■ Drawback: many self-joins needed(memory consumption)

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

46

Problems and Experiments

Experiment Summary (1) Crawling the Semantic Web

(2) Structural Analysis

(3) Content-based Analysis

(4) Data Cleansing

(5) Augmenting Semantic Web Infrastructure

Dienstag, 19. Oktober 2010

1. Dozenten / Tutoren

2. Semantic Web und Linked Data

3. Large Scale Processing im FutureSOC Lab

4. Administratives

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

47

Large Scale Processing for Semantic Web Technologies

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

48

Problems and Experiments

„Large Scale“ Processing for Semantic Web Technologies■ Three ways of doing anything faster [Pfister]

■Work harder

■Work smarter

■Get help

■ „A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem.“ (Foster 1995)

■ Linear speedup: n times more resources lead to n times less time for solving the same task

■ Linear scaleup: n times more resources solve an n times larger problem in the same time

CPU 1, Node 1

CPU 2, Node 1

CPU 3, Node 1

CPU 4, Node 2

CPU 5, Node 2

CPU 6, Node 2Sca

ling

Up

Scaling Out

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

49

Problems and Experiments

FutureSOC Lab■Collaboration with industry for software research on next-generation

X86 hardware

■Hewlett Packard DL980 G7: 8 x X7560 (8 Cores), 2048 GB RAM

■ Fujitsu RX600 S5 1: 4 x E7530 (6 Cores), 256 GB RAM

■ Fujitsu RX600 S5 2: 4x X7550 (8 Cores), 1024 GB RAM

■ Fujitsu RX200 S5 1: 2x E5540 (4 Cores), 36 GB RAM

■ Fujitsu RX300 S5 1: 1x E5540 (4 Cores), 12 GB RAM

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

50

Problems and Experiments

FutureSOC Lab■ Large scale processing of semantic web data:

■ Speeding up the processing of single work items through parallelization

■ Scaling up the number of processed triplets through distribution

■ Seminar participants will work on parallelization strategies for semantic web processing, practically tested on FutureSOC hardware

■ Parallelized XML parsers

■ Parallelized graph processing

■ I/O - optimized web crawling

■ Speed-up and scale-up of triplet stores

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Problems and Experiments

The Problem of Speedup through Parallelization■Well-researched problem in parallel databases (D. DeWitt, J. Gray)

□ Start-Up: Initialization of parallel activity, synchronization of results

□ Interference: Conflicts through access to shared data

□ Dispersion: Overal execution time depends on the slowest process

□ All problems increase with the number of processors

■Amdahls Law

□ P is the portion of the program that benefits from parallelization

□ Maximum speedup by N processors:

□ Maximum speedup tends to 1 / (1-P)□ Parallelism only useful with small amount

of processors or small (1-P)

51

s = (1−P )+P(1−P )+ P

N

Dienstag, 19. Oktober 2010

1 10 100 1000 1!104

5

10

15

20

P=90%

P=75%

P=50%

P=25%

P=10%

Number of processors

Speedup P=95%

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Problems and Experiments52 Amdahl‘s Law

Dienstag, 19. Oktober 2010

1. Dozenten / Tutoren

2. Semantic Web und Linked Data

3. Large Scale Processing im FutureSOC Lab

4. Administratives

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

53

Large Scale Processing for Semantic Web Technologies

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

54 Seminar Large Scale Processing for Semantic Web Technologies□Semesterwochenstunden: 4

□ECTS: 6

□Leistungsfeststellung:

□Schriftliche Ausarbeitung zum Vortragsthema (Umfang ca. 20 Seiten)□Umsetzung einer vorgegebenen Implementierungsaufgabe im Team□Präsentation der Ergebnisse

(Zwischenpräsentation, Endpräsentation, Wochenbesprechungen)

□Projektteams mit je 2-3 Studenten bearbeiten eine der vorgeschlagenen Aufgabenstellungen

Large Scale Processing for Semantic Web TechnologiesAdministratives

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

Administratives■Termine:

□Wöchentliches Seminargruppentreffen

□Termin nach Vereinbarung

□Zwischenpräsentation der Projektergebnisse

□Abschlusspräsentation der Ergebnisse

□Termin in der letzten Semesterwoche

55

Large Scale Processing for Semantic Web TechnologiesAdministratives

Dienstag, 19. Oktober 2010

Seminar: Large Scale Computing 4 Semantic Web Technologies, Dr. Harald Sack et. al., Hasso-Plattner-Institut, Universität Potsdam

56

Literatur

• P. Hitzler, S. Roschke, Y. Sure: Semantic Web Grundlagen, Springer, 2007.

• Grundlegende Materialien via bibsonomy-Bookmarks

• http://www.bibsonomy.org/user/lysander07/lsc1011

Large Scale Processing for Semantic Web TechnologiesAdministratives

Dienstag, 19. Oktober 2010


Recommended