Karin Haenelt 14.5.2010

transcript

Überführung regulärer Ausdrücke in endliche Automaten

Der Algorithmus von Glushkov und McNaughton/Yamada

Karin Haenelt

14.5.2010

Inhalt

Quelle Prinzip des Algorithmus Algorithmus

Parsing des regulären Ausdrucks Erkennung Syntaxbaum Funktionen nullable, firstpos, lastpos und followpos

Konstruktion des Automaten Ein linguistisches Beispiel Implementierung Komplexität

Quelle

Der Algorithmus wurde unabhängig entwickelt von Viktor M. Glushkov (1960, 1961) Robert McNaughton und Hisao Yamada (1960)

3© Karin Haenelt,RegEx → FSA: Glushkov und MacNaughton/Yamada 14.5.2010

Inhalt

Prinzip des Algorithmus

bestimmte Positionen des regulären Ausdrucksstehen inbestimmten Folgerelationen in den zugehörigen Zeichenketten

die Folgerelationen sind eindeutig bestimmt über die Abfolge und Definition der Operatoren des regulären Ausdrucks

die Folgerelationen legen die Menge der möglichen Zustände und die zeitliche Ordnung der Zustände fest

es geht also darum die Folgerelationen zu berechnen aus dem Ergebnis der Berechnung den DEA zu konstruieren

Prinzip des AlgorithmusBeispiel -1-

wie sehen dieZeichenketten aus,die dieser Ausdruck beschreibt?

welchem Zeichen kann welches Zeichen folgen?

RE = dete? adje* nomn REp = dete?1 adje*2 nomn3 #4

mögliche Startpositionen

mögliche Folgepositionen

adje2, nomn3

adje2 (da dete1 optional ist)nomn3 (dete1 und adje2 opt.)

1,2,3 2,3 4dete1

Start- und Folgepositionen sind eindeutig bestimmt durch die Semantik der Operatoren

Wenn E= dete? adje* nomn in eine Zeichenkette aus L(E) umgewandelt wird, kann das erste Zeichen der Zeichenkette sein dete weil es vorne steht adje wegen des ?-Operators über dete nomn wegen des Stern-Operators über adje

Startzustand aus Positionsmenge {1,2,3}

Prinzip des Algorithmus

Wie berechnet man die Start- und Folgepositionen? Erkennung des Ausdrucks mit dem Algorithmus von

Hopcroft/Ullman 1988 Aufbau eines binären Syntaxbaumes

Blätter: Positionen innere Knoten: Operatoren

Berechnung der Abfolgen während der bottom-up-Konstruktion des Syntaxbaumes

Wie erzeugt man den Automaten? auf der Basis der Abfolgeinformation nach dem Algorithmus

von Glushkov bzw. MacNaughton/Yamada

Inhalt

Algorithmus

Parsing des regulären Ausdrucks

Algorithmus von Hopcroft/Ullman 1988 Erweiterung 1

Konstruktion des Syntaxbaumes

- binärer Baum - bottom-up-Konstruktion - Blätter: Positionen - innere Knoten: Operatoren Erweiterung 2

Funktionen nullable, firstpos, lastpos und followpos

- Berechnung der Positionsfolge über die Struktur des Syntaxbaumes

- bottom-up-Anwendung der Funktionen auf die Operator-Knoten im Syntaxbaum

Erzeugung eines DEA

Algorithmus von Glushkov bzw. McNaughton/Yamada

Der Syntaxbaum

(a1|b2)*a3b4b5#6

Aho/Sethi/Ullmann 1986: 139

Die Funktionennullable, firstpos, lastpos und followpos

Der Automat wird aufgebaut aus den Ergebnissen von firstpos(root) und followpos(i)

Zur Ermittlung von followpos(i)wird firstpos(n) und lastpos(n) benötigt

Zur Ermittlung von firstpos(n) und lastpos(n)wird nullable(n) benötigt

nullable(n) ist knotenspezifisch definiert Die Funktionen werden bottom-up auf den Syntaxbaum

angewendet

Die Funktionennullable, firstpos, lastpos und followpos

Def Sei E der Teilausdruck, der durch den Syntaxbaum mit Wurzel n repräsentiert wird, und E’ ein regulärer Ausdruck, in dem jedes Symbol durch seine Position ersetzt ist, dann ist

nullable(n) die Feststellung, ob auch die leere Kette eine Zeichenkette aus L(E’) sein kann. nullable(n) liefert den Wert true, wenn n nullable ist, andernfalls den Wert false.

firstpos(n) die Menge der Positionen, die am Anfang einer Kette aus L(E’) stehen können

lastpos(n) die Menge der Positionen, die am Ende einer Kette aus L(E’) stehen können

followpos(i) die Menge der Positionen in L(E’), die Position i folgen können. ■

Die Berechnung der Funktionennullable, firstpos, lastpos und followpos

Knoten n nullable(n) firstpos(n) lastpos(n) terminaler ε-Knoten

terminaler Knoten mit Position i

false {i} {i}

nullable(c1) OR nullable(c2)

firstpos(c1) firstpos(c2)

lastpos(c1) lastpos(c2)

nullable(c1) AND nullable(c2)

if nullable(c1) then firstpos(c1) firstpos(c2) else firstpos(c1)

if nullable(c2) then lastpos(c1) lastpos(c2) else lastpos(c2)

firstpos(c1)

lastpos(c1)

Der annotierte Syntaxbaum

(a1|b2)*a3b4b5#6 {1,2,3}

{1} {1}

{2} {2}

{1,2} {1,2}

{3} {3}

{1,2,3} {3}

{4} {4}

{5} {5}

{6} {6}

{1,2} {1,2}

{1,2,3}

firstpos lastpos

nullable

Knoten n followpos(i) nur auf Positionen definiert

definiert keine Reihenfolge, keine Auswirkung auf followpos(i)

:)( 1clastposi )()()( 2cfirstposifollowposifollowpos

:)(nlastposi )()()( nfirstposifollowposifollowpos

Berechnung von followpos

Beispiel zur Berechnung von followpos:Stern-Knoten

{1} {1}1 2

{2} {2}

{1,2} {1,2}

{1,2} {1,2}n

followpos(1) = {1,2}followpos(2) = {1,2}

followpos(1).addAll( {1,2})followpos(2).addAll ({1,2})

)(nlastposi

:)(nlastposi)()()( nfirstposifollowposifollowpos

)(nfirstpos

(a|b)*

Beispiel zur Berechnung von followpos:Konkatenations-Knoten

:)( 1clastposi)()()( 2cfirstposifollowposifollowpos

(a|b)*a

{1} {1}1 2

{2} {2}

{1,2} {1,2}

{3} {3}

{1,2} {1,2}

n{1,2,3} ●

followpos(1) = {1,2,3}followpos(2) = {1,2,3}

followpos(1).addAll( {3})followpos(2).addAll ({3})

)( 1clastposi )( 2cfirstpos

Konstruktion der Automaten aus den followpos-Mengen

1,2,3,6

1,2,3,4

1,2,3,5

23 4 5 6

ab b #b

Knoten followpos 1 {1,2,3} 2 {1,2,3} 3 {4} 4 {5} 5 {6} 6 -

Aho/Sethi/Ullmann 1986: 136-140

Algorithmus von Gluskov bzw. McNaughton/Yamada: Konstruktion des DEA

Eingabe Ein Syntaxbaum für einen regulären Ausdruck E#, annotiert durch die Funktionen firstpos(n) und followpos(i)

Ausgabe Ein DEA, der L(E) erkennt Methode initial ist firstpos(root) der einzige unmarkierte Zustand in

Dstates while (es gibt einen unmarkierten Zustand T in Dstates) do begin markiere T; for jedes Eingabesymbol a do begin sei U die Menge der Positionen in followpos(p) für eine Position p in T, und das Symbol in Position p ist a if (U ist nicht leer und nicht in Dstates) then füge U als unmarkierten Zustand zu Dstates hinzu Dtran[T,a] := U end end

Aho/Sethi/Ullmann 1986: 141© Karin Haenelt,RegEx → FSA: Glushkov und MacNaughton/Yamada 14.5.2010

DStates Position Eingabesymbole a b T0 firstpos(root)

= {1,2,3} 1 (followpos(1) = {1,2,3}) - 2 - (followpos(2) = {1,2,3}) 3 (followpos(3) = {4}) - = {1,2,3,4} = {1,2,3}

T1 {1,2,3,4} 1 (followpos(1) = {1,2,3}) - 2 - (followpos(2) = {1,2,3}) 3 (followpos(3) = {4}) - 4 - followpos(4) = {5}) = {1,2,3,4} = {1,2,3,5}

T2 {1,2,3,5} 1 (followpos(1) = {1,2,3}) - 2 - (followpos(2) = {1,2,3}) 3 (followpos(3) = {4}) - - 5 - (followpos(5) = {6}) = {1,2,3,4} = {1,2,3,6}

T3 {1,2,3,6} 1 (followpos(1) = {1,2,3}) - 2 - (followpos(2) = {1,2,3}) 3 (followpos(3) = {4}) - 6 - - = {1,2,3,4} = {1,2,3}

23© Karin Haenelt,

RegEx → FSA: Glushkov und MacNaughton/Yamada 14.5.2010

DStates Pos Eingabesymbole a b T0 first (root)

= {1,2,3} 1 follow (1) = {1,2,3} - 2 - follow (2) = {1,2,3} 3 follow (3) = {4} - = {1,2,3,4} = {1,2,3}

T1 {1,2,3,4} 1 follow (1) = {1,2,3} - 2 - follow (2) = {1,2,3} 3 follow (3) = {4} - 4 - follow (4) = {5} = {1,2,3,4} = {1,2,3,5}

T2 {1,2,3,5} 1 follow (1) = {1,2,3} - 2 - follow (2) = {1,2,3} 3 follow (3) = {4} - - 5 - follow (5) = {6} = {1,2,3,4} = {1,2,3,6}

T3 {1,2,3,6} 1 follow (1) = {1,2,3} - 2 - follow (2) = {1,2,3} 3 follow (3) = {4} - 6 - - = {1,2,3,4} = {1,2,3}

Position i

mögliche Nachfolgepositionen follow(i)

a1 {a1,b2,a3} b2 {a1,b2,a3} a3 {b4} b4 {b5} b5 {#6} #6 -

1,2,3,6

1,2,3,4

1,2,3,5

3 a1,a3

b2,b4 b2,b5

Inhalt

Ein linguistisches Beispiel, 1a

{1,2,3}

*{3} {3}

{1,2,3}

{4} {4}

{5} {5}{1,2,3,4}

{1,2,3,4}

{ } { }

{1,2} {1,2}

{3} {3}adje

dete? adje* nomnVariante 1:• Darstellung von a? durch (a|ε)• bei Automatenkonstruktion

werden für ε keine Kanten konstruiert (nur für Eingabe- Symbole a)

2nullablefirstposlastpos

{i}{i}

Ein linguistisches Beispiel, 1b

*{2} {2}

{3} {3}

{4} {4}{1,2,3}

{1,2,3}

{2} {2}adje

dete? adje* nomnVariante 2:direkte Angabe von Regeln für ?: nullable(n) := true firstpos (n) := firstpos(left) lastpos(n) := lastpos(left)

nullablefirstposlastpos

{i}{i}

Ein linguistisches Beispiel, 2a+b

Knoten Symbol followpos Symbole 1 dete1 {2,3} adje2, nomn3 2 adje2 {2,3} adje2, nomn3 3 nomn3 {4} #4 4 #4 - -

dete? adje* nomn

2,3 4start

Knoten Symbol followpos Symbole 1 dete1 {3,4} adje3, nomn4 2 ε2 {3,4} adje3, nomn4 3 adje3 {3,4} adje3, nomn4 4 nomn4 {5} #4 5 #5 - -

1,2,3,4

3,4 5start

Variante 2 für ? Variante 1 für ?

Inhalt

Implementierung

Einfügen der Konstruktionsschritte für einen Parsebaum bzw. Syntaxbaum

Zuordnung von Baumkonstruktoren zu den Elementen des regulären Ausdrucks Terminalsymbole (Zeile 21) Klammern (Zeile 23, 27) unäre Operatoren (*,+,?) (Zeile 31)

binäre Operatoren (·, |) (Zeile 4, 13)

class TreeNode { String info; int type; TreeNode left;

TreeNode right; boolean nullable; HashSet<TreeNode> firstpos; HashSet<TreeNode> lastpos; HashSet<TreeNode> followpos; public TreeNode(int type, TreeNode left, TreeNode right) { this.type = type;

this.left = left; this.right = right; switch(type) { case OR:

nullable = left.nullable() || right.nullable(); firstpos = new HashSet<TreeNode>(left.firstpos()); firstpos.addAll(right.firstpos()); lastpos = new HashSet<TreeNode>(left.lastpos()); lastpos.addAll(right.lastpos()); break;

case … } } }

Implementierung: TreeNode

Definition der Knoten, 1

n ist ein Knoten mit Kennung : nullable(n) := false; firstpos(n) := ; lastpos(n) := ; n ist ein Knoten mit Kennung : nullable(n) := true; firstpos(n) := ; lastpos(n) := ; n ist ein Knoten mit Kennung i: nullable(n) := false; firstpos(n) := {i}; lastpos(n) := {i}; followpos(i) := ;

Brüggemann-Klein, 1992

n ist ein Knoten mit Kennung | nullable(n) := nullable(left) or nullable(right); firstpos(n) := firstpos(left) firstpos(right); lastpos(n) := lastpos(left) lastpos(right); n ist ein Knoten mit Kennung • nullable(n) := nullable(left) and nullable(right); for each i in lastpos(left) do followpos(i) := followpos(i) firstpos(right); if nullable(left) then firstpos(n) := firstpos(left) firstpos(right); else firstpos(n) := firstpos(left); if nullable(right) then lastpos(n) := lastpos(left) lastpos(right); else lastpos(n) := lastpos(right);

n ist ein Knoten mit Kennung * nullable(n) := true; for each i in lastpos(left) do followpos(i) := followpos(i) firstpos(right); firstpos(n) := firstpos(left); lastpos(n) := lastpos(left);

Inhalt

Komplexität: Platz

maximale Anzahl der Zustände: 2p+1,p = Anzahl der Positionen

(Anzahl möglicher Teilmengen, die aus p Objekten gebildet werden können, beträgt 2p)

max. Anzahl wird in der Praxis kaum erreicht: der reg. Ausdruck schränkt Anzahl der möglichen

Teilmengen sehr stark ein Algorithmus konstruiert nur die tatsächlich vorkommenden

Teilmengen(lazy implementation der Teilmengen)

minimale Anzahl der Zustände: p+1(falls nur Konkatenation von Symbolen vorkommt)

Glushkov, 1961McNaughton/Yamada, 1960

Komplexität: Zeit

Implementierungen mit kubischer Zeit Aho/Sethi/Ullman, 1986 Berry/Sethi, 1986

Implementierung mit quadratischer Zeit für allgemeine Ausdrücke (d.h. auch ambige) Brüggemann-Klein, 1992

Implementierung mit linearer Zeit für deterministische Ausdrücke Brüggemann-Klein, 1992

Literatur

Quellen Glushkov, Viktor M.(1961). The abstract theory of automata. In: Russian Mathematical

Surveys 16, S. 1-53. Glushkov, Viktor M.(1960). On Synthesis Algorithm for Abstract Automata, Ukr.

Mathem. Zhurnal, 12(2), S.147-156 (in Russisch) McNaughton, Robert und Hisao Yamada (1960). Regular expressions and state

graphs for automata. In IEEE Transactions on Electronic Computers 9 (1): S. 39-47. Erläuterungen

Aho, Alfred V.; Sethi, Ravi und Jeffrey D. Ullman (1986). Compilers. Principles, Techniques and Tools. Addison-Wesley Publishing Company.

weitere Literatur Berry, Gérard und Ravi Sethi (1986). From regular expressions to deterministic

automata. In: Theoretical Computer Science 48 (3), S. 117-126. Brüggemann-Klein, Anne (1992). Regular Expressions into Finite Automata,

Proceedings of Latin '92, 1st Latin American Symposium on Theoretical Informatics, São Paulo, Brazil, April 6-10, 1992. (Lecture Notes in Computer Science 583) Heidelberg: Springer Verlag.

Versionen

14.05.2010, 9.5.2010, 14.09.2007, 05.05.2007, 21.05.2006, 20.03.2005

Karin Haenelt 14.5.2010

Documents