10/21/13
1
15383: txt proc!
15383: Intro to Text Processing
Behrang Mohit
15383: txt proc!
Welcome! !लकम !مرحبا
• Welcome To 15383!
• Instructor: Behrang Mohit – Office: 1004 – first-‐[email protected]
2
10/21/13
2
15383: txt proc!
The AI Dream
• CreaKng intelligent systems capable of simulaKng humans
3
15383: txt proc!
Language and Text
• Has been present since early days of human civilizaKon.
4
10/21/13
3
15383: txt proc!
21st Century: So Much Text!
• Problem: InformaKon overload!
5
15383: txt proc!
21st Century: So Much Text!
• ExponenKal growth of text in the surface web and also the deep web. – 400m tweets/day
6
10/21/13
4
15383: txt proc!
Generate, Organize and Process
• Need to generate, organize and process text: – Different topics and genres • News, science, sport, film subKtles, children stories, jokes,…
– Different languages – Different pla_orms and mediums • prints, desktop, mobile device, TV, … • Internet
– Official channels (government and corporate webpages) – Personal pages, social media
7
15383: txt proc!
Examples of Text Processing Tasks
• Searching and categorizing • ExtracKng informaKon from text – Who is doing what to whom when
• Summarize text and answer quesKons
• Translate • Understand text • Chat and counsel humans (psychotherapy)
8
10/21/13
5
15383: txt proc!
15383: What Will We Learn Here?
• Features of text: Corpus, Encoding
• Basics of staKsKcs • Text OrganizaKon:
– Document ClassificaKon
– Search: InformaKon Retrieval
9
15383: txt proc!
15383: What Will We Learn Here?
• Features of text: Corpus, Encoding
• Basics of staKsKcs • Text OrganizaKon:
– Document ClassificaKon
– Search: InformaKon Retrieval • LinguisKcs in text processing
– a.k.a. Natural Language Processing
• Python programming (NLTK)
• Course project: SenKment Analysis on Social Media
10
10/21/13
6
15383: txt proc!
Text As A Medium
• Natural Language Processing – Speech – Text • Focus on natural language text
11
15383: txt proc!
Today
• An overview of topics
12
10/21/13
7
15383: txt proc!
Text And Its Encodings
• Text: Words, sentences, documents, … • Processing and organizing large volumes of text – Corpus (corpora) • For building and evaluaKng text processing systems
• Might include extra linguisKcs informaKon
• Encoding the text
13
15383: txt proc!
StaKsKcs In Text Processing
• Rule-‐based systems vs. staKsKcal systems • ProbabiliKes • StaKsKcal learning – Supervised learning
14
10/21/13
8
15383: txt proc!
Text OrganizaKon
• Large volumes of text organized text • Document classificaKon – Sport, poliKcs, science, … – Email classificaKon • Work, Fun, Spam, …
• Searching documents – Ask, Google, Bing, etc.
15
15383: txt proc!
Natural Language Processing is …
• NLP or – ComputaKonal LinguisKcs
– Human Language Technologies
• Goal: Making computers capable of using human language as their input or output, performing intelligent tasks.
16
10/21/13
9
15383: txt proc!
NLP and ArKficial Intelligence
• NLP is the fundamental problem of ArKficial Intelligence (AI).
• Turing test for the intelligence of a machine – If a human judge can not disKnguish between a machine and human in a conversaKon framework, the machine passes the Turing test.
17
15383: txt proc!
The Dream of Talking Machines!
• 2001 Space Odyssey – Dave (human): Open the door Hal
– HAL (machine): I’m sorry Dave, I can’t do that.
• HAL: An intelligent system capable of: – Understanding and genera2ng human language
18
10/21/13
10
15383: txt proc!
LinguisKc Layers
• PhoneKcs • Phonology • Morphology
• Syntax • SemanKcs
• PragmaKcs
• Discourse
19
15383: txt proc!
LinguisKcs Layers: Morphology
• What are building blocks of words? – goes go + es
– premest Preny + est
• Different levels of complicaKon in morphology – English – Arabic, Finnish, Turkish • wsyaktobun w + s + yaktob + un • And will write they and they will write
20
10/21/13
11
15383: txt proc!
LinguisKc Layers: Syntax
• How do words come together to form more complex units? – Phrases, sentences, relaKonship between phrases – Mostly at the sentence level – Zeinab bought a book .
Noun Verb Det Noun PunctuaKon Subject Verb Object
21
15383: txt proc!
LinguisKc Layers: SemanKcs
• What is the meaning of terms in a sentence – Suhail bought a book. Commercial transacKon:
Buyer: Suhail AcKon: buying Commodity: book
22
10/21/13
12
15383: txt proc!
LinguisKc Layers: PragmaKcs and Discourse
• Going beyond a sentence-‐level analysis – Ahmad arrived in Doha. He was accompanied by his family. They went directly to a wedding from the airport.
23
15383: txt proc!
LinguisKcs Layers
• PhoneKcs • Phonology • Morphology
• Syntax • Seman2cs
• PragmaKcs
• Discourse
24
10/21/13
13
15383: txt proc!
Two Major NLP Challenges
• Challenge 1: Gemng the proper linguisKc representaKon of the input – From sound waves to text – From the sentence to syntacKc tree • Mariem bought a book Mariem: sub, book: Obj
– ….
25
15383: txt proc!
Two Major NLP Challenges
• Challenge 2: Ambiguity in language
• A language understanding example – “At last, a computer that understands you like your mother!” • Ad from Microsop (in early 1980s)
• Example by Stuart Shiebert
26
10/21/13
14
15383: txt proc!
A Computer That Understands…
• At last, a computer that understands you like your mother!
• Computer understands you as well as your mother understands you
• Computer understands that you like your mother • Computer understands you as well as it understands your mother
• Problem: Ambiguity in human expressions
27
15383: txt proc!
A Computer That Understands…
• At last, a computer that understands you like your mother!
• Computer understands you as well as your mother understands you
• Computer understands that you like your mother • Computer understands you as well as it understands your mother
• Problem: Ambiguity in human expressions
28
10/21/13
15
15383: txt proc!
A Computer That Understands…
• At last, a computer that understands you like your mother!
• Computer understands you as well as your mother understands you
• Computer understands that you like your mother • Computer understands you as well as it understands your mother
• Problem: Ambiguity in human expressions
29
15383: txt proc!
The Ambiguity Problem
• Humans use common-‐sense, bits of culture, world knowledge in their expressions! – Do computers understand all of those?
30
10/21/13
16
15383: txt proc!
Levels of Ambiguity: AcousKcs
• Understands you like your mother
• Understands you lie cured mother
• It is hard to recognize speech. • It is hard to wreck a nice beach.
31
15383: txt proc!
Levels of Ambiguity: Syntax
• Different sentence structure (syntax):
– Computer that understands you (like your mother
[does])
– Computer that understand ([that] you like your mother)
32
10/21/13
17
15383: txt proc!
Levels of Ambiguity: SemanKcs
• … knows you like your mother • The female parent
– Most probably
• A vat (dish) for making vinegar
• We put our money in the bank – Money bury under the mud (river bank)! – Financial insKtuKon • Most probably
33
15383: txt proc!
Levels of Ambiguity: Discourse
• Leila says they are selling a computer that knows you like your mother. But she doesn’t seem to be happy about it.
– Who does she refer to? • mother, computer, Leila?
34
10/21/13
18
15383: txt proc!
Let’s Disambiguate
• I saw her duck with a telescope
35
15383: txt proc!
Let’s Disambiguate
• I saw her duck with a telescope
• I used a telescope to see her duck • I saw her duck that was carrying a telescope. • I used a telescope to see her ducking • I saw her ducking using a telescope • I cut her duck with a telescope • ….
36
10/21/13
19
15383: txt proc!
Let’s Disambiguate
• I saw her duck with a telescope
• I used a telescope to see her duck • I saw her duck that was carrying a telescope. • I used a telescope to see her ducking • I saw her ducking using a telescope • I cut her duck with a telescope • ….
37
15383: txt proc!
Let’s Disambiguate
• I saw her duck with a telescope
• I used a telescope to see her duck • I saw her duck that was carrying a telescope. • I used a telescope to see her ducking • I saw her ducking using a telescope • I cut her duck with a telescope • ….
38
10/21/13
20
15383: txt proc!
Let’s Disambiguate
• I saw her duck with a telescope
• I used a telescope to see her duck • I saw her duck that was carrying a telescope. • I used a telescope to see her ducking • I saw her ducking using a telescope • I cut her duck with a telescope • ….
39
15383: txt proc!
LinguisKcs Layers
• PhoneKcs • Phonology • Morphology
• Syntax • Seman2cs
• PragmaKcs
• Discourse
40
10/21/13
21
15383: txt proc!
Course Plan
1. Text 2. Text organizaKon 3. Processing with three linguisKc layers – Words (morphology), Syntax and SemanKcs
4. NLP applicaKons
41
15383: txt proc!
ApplicaKon: Spell Checking
• Ali arrived at scool – scull – school – cool – spool
• Idea: Look at the previous words to decide between the given correct opKons. – Use staKsKcs
• Pr(arrived at school) • Pr(arrived at cool) • Pr(…)
42
10/21/13
22
15383: txt proc!
ApplicaKon: Named EnKty RecogniKon
• Names of Persons, LocaKons, OrganizaKon, …
• George Washington ruled America for two terms.
• George Washington University announced …
• As George was walking in Washington, he …
43
15383: txt proc!
ApplicaKon: Text summarizaKon
• Summarizing large volumes of text – Locate the important parts of the text and form sentences with them. • Natural language generaKon
– Useful for governments, companies, etc.
– Word Processing and browser offer the service
44
10/21/13
23
15383: txt proc!
ApplicaKon: Machine TranslaKon
• Text translaKon from one language to another – Dealing with differences in two languages • English: Subject-‐verb-‐object • Arabic: Verb Subject Object
– AmbiguiKes in two languages – SemanKc differences: • Concept of cousin
45
15383: txt proc!
ApplicaKon: SenKment Analysis
• Imagine – Your company (e.g. Apple) has released a new product (e.g. iphone) and wants esKmate the iniKal reacKon of customers
– You’re campaigning for a poliKcian and you want to esKmate people’s reacKon to his last night speech.
46
10/21/13
24
15383: txt proc!
ApplicaKon: SenKment Analysis
• DisKnguish between objecKve and subjecKve statements. – News vs. Opinion
• Find polarity of statements – Product reviews:
• The new laptop is hot! • The new laptop gets very hot!
• Example: Organizing hundreds of film reviews – “This is a feel-‐good blockbuster produc?on with an excellent
technical setup.” – Bonom-‐line: Does this author likes the movie?
47
15383: txt proc!
PosiKve, NegaKve or …?
• You’ll see a tweet and will say if it’s – SubjecKve or ObjecKve • Does it carry any opinion/senKment?
– If subjecKve: Posi2ve, Nega2ve or Neutral?
48
10/21/13
25
15383: txt proc!
PosiKve, NegaKve or …?
• “AuthoriKes are only too aware that Kashgar is 4,000 kilometres (2,500 miles) from Beijing but only a tenth of the distance from the Pakistani border, and are desperate to ensure instability or militancy does not leak over the fronKers.”
49
15383: txt proc!
PosiKve, NegaKve or …?
• friday evening plans were great, but saturday's plans didnt go as expected -‐-‐ i went dancing & it was an ok club, but terribly crowded :-‐(
50
10/21/13
26
15383: txt proc!
PosiKve, NegaKve or …?
• Iran and 5+1 walk in to new rounds of negoKaKons on the nuke program. #iran #us
51
15383: txt proc!
PosiKve, NegaKve or …?
• “obama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemies what we had. #Coward #Traitor.”
52
10/21/13
27
15383: txt proc!
PosiKve, NegaKve or …?
• “My graduaKon speech: "I'd like to thanks Google, Wikipedia and my computer! :D #iThingteens”
53
15383: txt proc!
PosiKve, NegaKve or …?
• WHY DO YOU GUYS ALL HAVE MR. KENNEDY! HE’S A F…G DOUCHE!
54
10/21/13
28
15383: txt proc!
Required knowledge
• Language: – Words: desperate, leak, impeach, TREASON
– Sarcasm • My graduaKon speech: "I'd like to thanks Google, Wikipedia and my computer!
– Social media’s non-‐formal and erroneous language • didnt, :-‐) :-‐( omg thanx gr8 b4
• Medium: – Hash tags in tweets
55
15383: txt proc!
Problem: Analyzing emoKons (Contd.)
• Mass analysis of linguisKc emoKons – On Social Networks
56
10/21/13
29
15383: txt proc!
CommunicaKon
• Behrang Mohit – CMUQ 1004
• Course webpage: – hnp://www.qatar.cmu.edu/~behrang/15383/
57
15383: txt proc!
AdministraKve
• This is mostly a project-‐based course – Most of your grade is decided by the programming assignments and the final project.
58
AEendance and Par2cipa2on
5%
Python Labs 10%
Quizzes 10%
Programming assignments
30%
Final Project 45%
10/21/13
30
15383: txt proc!
Resources
• Lecture notes • Natural Language Processing with Python
• Further Reading: Speech and Language Processing – By: Dan Jurafsky and Jim MarKn, PrenKce Hall 2009.
– Three copies on reserve at the library.
59
15383: txt proc!
NLTK
• Natural Language Toolkit – Python library for natural language processing
– @library – Online: hnp://www.nltk.org/book
60