+ All Categories
Home > Documents > Adaptive Audio and Video Processing for Electronic Chalkboard ...

Adaptive Audio and Video Processing for Electronic Chalkboard ...

Date post: 03-Jan-2017
Category:
Upload: phamtram
View: 232 times
Download: 3 times
Share this document with a friend
215
Adaptive Audio and Video Processing for Electronic Chalkboard Lectures Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften am Fachbereich Mathematik und Informatik der Freien Universit¨ at Berlin vorgelegt von Gerald Friedland 30. Oktober 2006
Transcript
Page 1: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Adaptive Audio and Video Processing for

Electronic Chalkboard Lectures

Dissertation

zur Erlangung des akademischen Grades einesDoktors der Naturwissenschaften

am Fachbereich Mathematik und Informatikder Freien Universitat Berlin

vorgelegt von

Gerald Friedland

30. Oktober 2006

Page 2: Adaptive Audio and Video Processing for Electronic Chalkboard ...

ii

Adaptive Audio and Video Processing for Electronic Chalkboard Lectures, doc-toral dissertation, department of mathematics and computer science, Freie Uni-versitat Berlin, Germany, October 2006.

Copyright c©2006 by Gerald Friedland. All rights reserved.

ISBN: 978-1-4303-0388-6DDC: 006.7784 [DDC22ger] ; 371.33467 [DDC22ger]

This distribution version differs from the original. For technical reasons thecontent has been scaled from A4 to letter format and Appendix G shows noimages. Current versions of the software described herein can be downloaded athttp://www.echalk.de.

Reviewers:Prof. Dr. Raul Rojas, Freie Universitat BerlinProf. Dr. Ulrich Kortenkamp, Padagogische Hochschule Schwabisch Gmund

Page 3: Adaptive Audio and Video Processing for Electronic Chalkboard ...

iii

Acknowledgements

The work presented in this dissertation has been strongly influenced by a lot offeedback we received from the users of World Wide Radio, the E-Chalk system,and the GIMP implementation of SIOX, so there are a number of people whoactually deserve credit. Unfortunately, time and space are limited, which forcesme to cut this list down to a bare handful.

First of all I would like to thank Stiftung Jugend forscht e.V. for the con-tinuous encouragement they give to young people to start scientific projects. Istill draw on the motivation and inspiration I received during the few days ofthe “Bundeswettbewerb” in Munich in 1998. In this context I would like tothank the early inspirers and helpers of the World Wide Radio System, namelyOlav Surawski and Tobias Lasser. I would like to thank Maximilan Benker forletting me encode the GIOVE lectures using WWR2 and Bernhard Frotschl forhis helping hand on the early implementation of WWR2.

I want to thank all the contributors to the E-Chalk system who are listed inAppendix A and all those who I might have forgotten. Of course, I would liketo thank my colleagues Kristian Jantz, Benjamin Jankovic, Ernesto Tapia, KarlPauls, Tobias Lenz, and also Richard S. Hall, who has been an important mentorto me. I thank Lars Knipping with whom I shared both trouble and success.I would like to thank my colleague Christian Zick for his support whenever Ineeded it and the preparation of numerous test photos and video sequencesduring the entire time. I would like to thank Sven Neumann, maintainer ofGIMP, as well as many other open source developers for their time-consumingefforts in optimizing the SIOX algorithm.

Credit also goes to Theda Radtke and Ulrich Kernbach at the “DeutschesMuseum”, Guido Reuter, and the MOSES team at Technische UniversitatBerlin. I would also like to credit CSEM and PMD Technologies for provid-ing us with time-of-flight 3D cameras and Opticom for providing us with theOpera software. Special thanks go to Hans-Ulrich Kobialka who provided uswith a stereo camera. I would like to thank Peter Monnerjahn, who spent manydays polishing my English, as well as Robert Mertens, Kristian Jantz, ChristianZick, Benjamin Jankovic, and David Schneider for reviewing several chapters.

I would like to thank my family and most importantly my wife, Yvonne. Herlove and support and her patience during the development of this project havegiven me the strength to carry out the work that resulted in this dissertation.Finally, I would like to thank my supervisor, Raul Rojas, for always having anopen door and providing me with a lot of freedom and a working environmentthat allowed me to be creative.

Page 4: Adaptive Audio and Video Processing for Electronic Chalkboard ...

iv

Page 5: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Overview of this Document . . . . . . . . . . . . . . . . . . . . . 3

2 Automated Lecture Recording 52.1 Distance Education . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Technology-AugmentedClassroom Teaching . . . . . . . . . . . . 52.3 Lecture Recording Without Automation . . . . . . . . . . . . . . 62.4 Classroom 2000/eClass . . . . . . . . . . . . . . . . . . . . . . . . 92.5 LecCorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Authoring on the Fly . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Lecturnity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.8 Lectopia/iLecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.9 Camtasia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.10 tele-TASK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.11 Classroom Presenter . . . . . . . . . . . . . . . . . . . . . . . . . 162.12 A Minimalistic Automated Lecture Recording System . . . . . . 172.13 Other Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 The E-Chalk System 213.1 E-Chalk’s Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 The Software System . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Usage Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Distance Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Editing Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Server Architecture 294.1 Preliminary Considerations . . . . . . . . . . . . . . . . . . . . . 294.2 Existing Multimedia Architectures . . . . . . . . . . . . . . . . . 314.3 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Java as Execution Platform . . . . . . . . . . . . . . . . . . . . . 334.5 The Component Framework . . . . . . . . . . . . . . . . . . . . . 344.6 Component Discovery . . . . . . . . . . . . . . . . . . . . . . . . 374.7 Component Assembly . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7.1 Processing Nodes . . . . . . . . . . . . . . . . . . . . . . . 384.7.2 The Processing Graph . . . . . . . . . . . . . . . . . . . . 404.7.3 Resolving the Media Graph . . . . . . . . . . . . . . . . . 42

v

Page 6: Adaptive Audio and Video Processing for Electronic Chalkboard ...

vi CONTENTS

4.7.4 Identifying Media Formats . . . . . . . . . . . . . . . . . . 434.7.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . 434.7.6 Top-Level Application . . . . . . . . . . . . . . . . . . . . 45

4.8 Limits of the Approach . . . . . . . . . . . . . . . . . . . . . . . 454.9 Practical Usage Examples . . . . . . . . . . . . . . . . . . . . . . 464.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Client Architecture 515.1 Preliminary Considerations . . . . . . . . . . . . . . . . . . . . . 515.2 The Java Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Board Client . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.2 Audio Client . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.3 Video Client . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.4 Slide-show Client . . . . . . . . . . . . . . . . . . . . . . . 565.2.5 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Playback as Video . . . . . . . . . . . . . . . . . . . . . . . . . . 575.4 MPEG-4 Replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4.1 Encoding E-Chalk Lectures in MPEG-4 . . . . . . . . . . 615.4.2 Practical Experiences . . . . . . . . . . . . . . . . . . . . 62

5.5 A Note on Bandwidth Requirements . . . . . . . . . . . . . . . . 635.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Audio Storage and Transmission 676.1 Evolution of E-Chalk’s Audio System . . . . . . . . . . . . . . . 67

6.1.1 The World Wide Radio System . . . . . . . . . . . . . . . 676.1.2 World Wide Radio 2 . . . . . . . . . . . . . . . . . . . . . 696.1.3 The E-Chalk Audio System . . . . . . . . . . . . . . . . . 70

6.2 E-Chalk’s Default Audio System . . . . . . . . . . . . . . . . . . 726.2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2.2 Live Transmission and Archiving . . . . . . . . . . . . . . 72

6.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.1 Lecture Repair Tool . . . . . . . . . . . . . . . . . . . . . 746.3.2 Audio Format Converter . . . . . . . . . . . . . . . . . . . 746.3.3 E-Chalk Broadcaster . . . . . . . . . . . . . . . . . . . . . 75

7 Active Audio Recording 777.1 Audio Recording in Classrooms . . . . . . . . . . . . . . . . . . . 77

7.1.1 Usability Problems . . . . . . . . . . . . . . . . . . . . . . 777.1.2 Distortion Sources . . . . . . . . . . . . . . . . . . . . . . 787.1.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 797.1.4 Ideal Audio-Recording Conditions . . . . . . . . . . . . . 79

7.2 Improving Audio-Recording Quality . . . . . . . . . . . . . . . . 807.3 Before the First Lecture . . . . . . . . . . . . . . . . . . . . . . . 82

7.3.1 Detection of Sound Equipment . . . . . . . . . . . . . . . 837.3.2 Recording of Floor Noise . . . . . . . . . . . . . . . . . . 837.3.3 Dynamic-Range Adjustment . . . . . . . . . . . . . . . . 847.3.4 Measuring Signal-to-Noise Ratio . . . . . . . . . . . . . . 847.3.5 Fine-Tuning and Simulation . . . . . . . . . . . . . . . . . 857.3.6 Summary and Report . . . . . . . . . . . . . . . . . . . . 85

7.4 During Lecture Recording . . . . . . . . . . . . . . . . . . . . . . 86

Page 7: Adaptive Audio and Video Processing for Electronic Chalkboard ...

CONTENTS vii

7.4.1 Mixer Monitor . . . . . . . . . . . . . . . . . . . . . . . . 867.4.2 Mixer Control . . . . . . . . . . . . . . . . . . . . . . . . . 877.4.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.4.4 Final Processing . . . . . . . . . . . . . . . . . . . . . . . 88

7.5 Practical Experiences . . . . . . . . . . . . . . . . . . . . . . . . 897.6 Limits of the Approach . . . . . . . . . . . . . . . . . . . . . . . 907.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8 Video Storage and Transmission 918.1 Preliminary Considerations . . . . . . . . . . . . . . . . . . . . . 918.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.3 Configuring the Video Server . . . . . . . . . . . . . . . . . . . . 948.4 Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9 Merging Video and Blackboard 979.1 Split Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.2.1 Transmission of Gestures and Facial Expressions . . . . . 1009.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 100

9.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039.4 Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 104

9.4.1 Simple Approaches . . . . . . . . . . . . . . . . . . . . . . 1049.4.2 Motion-Based Segmentation . . . . . . . . . . . . . . . . . 1069.4.3 A Combined Approach . . . . . . . . . . . . . . . . . . . . 1079.4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9.5 Robust Real-Time Instructor Extraction . . . . . . . . . . . . . . 1109.5.1 Conversion to CIELAB . . . . . . . . . . . . . . . . . . . 1109.5.2 Gathering Background Samples . . . . . . . . . . . . . . . 1119.5.3 Building a Model of the Background . . . . . . . . . . . . 1139.5.4 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . 1149.5.5 Board Stroke Suppression . . . . . . . . . . . . . . . . . . 115

9.6 Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159.7 Limits of the Approach . . . . . . . . . . . . . . . . . . . . . . . 1169.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

10 Generalizing the Instructor Extraction 11910.1 The State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . 11910.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . 121

10.2.1 Construction of Color Signatures . . . . . . . . . . . . . . 12210.2.2 Classification of Unknown Pixels . . . . . . . . . . . . . . 12410.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . 125

10.3 Segmentation of Still Images . . . . . . . . . . . . . . . . . . . . 12510.4 Sub-pixel Refinement . . . . . . . . . . . . . . . . . . . . . . . . . 12710.5 Extraction of Multiple Objects . . . . . . . . . . . . . . . . . . . 12910.6 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 13010.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

10.7.1 Benchmarking and Tuning of Thresholds . . . . . . . . . 13310.7.2 Testing Assumptions . . . . . . . . . . . . . . . . . . . . . 13510.7.3 Other Means of Evaluation . . . . . . . . . . . . . . . . . 136

10.8 Limits of the Approach . . . . . . . . . . . . . . . . . . . . . . . 138

Page 8: Adaptive Audio and Video Processing for Electronic Chalkboard ...

viii CONTENTS

10.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

11 Hardware-Supported Instructor Extraction 14311.1 The Time-of-Flight Principle . . . . . . . . . . . . . . . . . . . . 14311.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14411.3 Technical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14511.4 Segmentation Approach . . . . . . . . . . . . . . . . . . . . . . . 14711.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

12 Conclusion 15112.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15112.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15212.3 Final Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

A E-Chalk: Project Overview 155

B SOPA: Technical Details 161B.1 DTD of SOPA Graph Serialization . . . . . . . . . . . . . . . . . 161B.2 LDAP Query Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 161B.3 SOPA Command Line Commands . . . . . . . . . . . . . . . . . 162B.4 A Minimal PipeNode . . . . . . . . . . . . . . . . . . . . . . . . . 163

C Board-Event Encoding 165C.1 The E-Chalk Board Format . . . . . . . . . . . . . . . . . . . . . 165C.2 Mapping E-Chalk Events to MPEG-4 BIFS . . . . . . . . . . . . 166

D E-Chalk’s Audio Format 169D.1 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169D.2 Zipped packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170D.3 Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

E Audio Recording Tools 171E.1 VU Meter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171E.2 Graphic Equalizer . . . . . . . . . . . . . . . . . . . . . . . . . . 172E.3 Assessment of the Audibility of Noise . . . . . . . . . . . . . . . . 172E.4 Equipment Grading . . . . . . . . . . . . . . . . . . . . . . . . . 173

F E-Chalk’s Video Format 175F.1 Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175F.2 Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

G SIOX Benchmark Results 179

Bibliography 181

Web References 197

List of Figures 205

Page 9: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 1

Introduction

This thesis describes my research on improving the production of distance lec-tures directly in the classroom. The work is conducted with the concrete ex-ample of E-Chalk, a software system for recording and transmitting electronicwhiteboard lectures over the Internet. Several tasks that formerly required tech-nical personnel are now performed by the computer. A novel combination ofvideo and board content is proposed that solves well-known ergonomic problemsand achieves a higher degree of media integration. Although developed insidethe scope of a concrete system, the results are general and the realized methodsmay be used in a variety of applications. In essence, this dissertation shows thatcomputers can be utilized much better to automate the creation of multimediacontent than the current state of the art by making the following contributions:

• The architectural base for E-Chalk is provided by a novel multimediasoftware framework. Based on state-of-the-art solutions for component-based software development, the system simplifies the implementation andthe configuration of multimedia processing applications. It facilitates theintegration of diverse multimedia content already on the architecture level.

• An Internet audio broadcasting system is presented that makes studio-less voice recording easier by automating several tasks usually handled bytechnicians. The solution described in this document measures the qualityof the sound hardware used, monitors possible hardware malfunctions,prevents common user mistakes, and provides online sound-enhancementmechanisms.

• Using a novel segmentation algorithm, the lecturer image is extracted froma video and then pasted on top of the vector-graphics image of the board.This allows to transmit the facial expressions and gestures of the instructorin direct correspondence to vector-based handwritten board content. Theresulting lecture allows to watch the instructor acting in front of the boardwithout occluding any board content.

• Finally, the instructor video segmentation approach is generalized. Themethod can be used for interactive object extraction in generic image orvideo processing software. This thesis presents benchmark results andshows how the approach has been implemented in several popular open-source applications.

1

Page 10: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2 CHAPTER 1. INTRODUCTION

1.1 Motivation

Not even three decades ago, texts were regularly produced with mechanicaltypewriters. While it was difficult at best to write a typo-free business letter,the creation of a book or a professional article from scratch involved numeroussteps that were usually handled in collaboration with several professionals, suchas the “markup persons”. Today, word processors take over almost all of thesesteps automatically. It is possible to create text documents from scratch withoutbeing a typography expert or having to relying on the help of one. Images,tables, or mathematical formulas can be easily integrated into the document.Spell and grammar checkers even automatically underline suspicious words whiletyping. What took several people in the past is one step today. But that’snot everything. A text can contain clickable references to other documents, orinteractive forms can be filled out directly on the screen. In other words, thecomputer even extended the definition of text in several ways, making new typesof documents possible.

However, when creating documents that contain sound, video, or other datadirectly encoded for human sensory perception, the abilities of the computerare more comparable to those of the typewriter. It is very difficult to createa decent audio recording in one step and without professional help. Combin-ing several types of content, for example, video and vector data, is not easilypossible for a non-expert and usually involves several steps. Although manycomputer manufacturers have recently begun to advertise their products withslogans about video and audio processing capabilities, the computer is almostexclusively used for archiving, transmitting, and playing back multimedia con-tent. The integrated creation of content that has become commonplace in thetext domain is not yet available for multimedia data. One of the reasons for thisis that computers have just started to be able to display, store, and transmitthe huge amounts of data required for encoding information that can directlybe perceived by the human senses, and the full potential of computer-basedmultimedia content creation has yet to be explored.

When the automated creation of audiovisual content is a pressing need, as inthe live broadcasting and recording of university lectures, most technicians andengineers do not only rely on traditional video and audio hardware, the workflows are also still similar to the work flows established decades ago. Lectures areusually just captured with a camera, and the use of the computer, for examplefor digital storage and Internet broadcasting, is in most cases just an additionalstep after manual recording, digitization, and cutting. As a result, tremendouseffort and money has to be put into the creation of distance lectures, and/orthe resulting presentation quality is far from optimal. The main reason for thisis that it is not yet well understood how multimedia content creation can beautomated.

Using the example of E-Chalk, this thesis provides evidence that computersare able to facilitate the creation of multimedia content substantially and thatwe have not reached the final step in recording automation yet. Furthermore,this dissertation shows how computers can provide new ways of presenting mul-timedia content without manual editing, which go well beyond the capabilitiesof mere audio and video replay.

Page 11: Adaptive Audio and Video Processing for Electronic Chalkboard ...

1.2. OVERVIEW OF THIS DOCUMENT 3

Figure 1.1: Conceptual overview of the structure of this dissertation.

1.2 Overview of this Document

This dissertation is structured as shown in Figure 1.1. Chapter 2 starts with thereview of related work on the automated creation of distance-education coursesfrom the classroom. Then the philosophy of the E-Chalk system as well as anoverview of the entire project is presented in Chapter 3. Chapter 4 introducessoftware-architectural considerations for the design of multimedia systems forclassroom teaching and introduces the component-based framework that is thebase of the E-Chalk server system. Technical considerations for the transmissionand playback of E-Chalk lectures on the client side are presented in Chapter 5.Based on the observations presented in the previous chapters, Chapter 6 de-scribes audio capturing, transmission, and archiving. Chapter 7 then describesthe Active Recording approach to facilitate the production of voice recordingsby simulating several tasks usually handled by audio technicians. Video captur-ing, transmission, and archiving is described in Chapter 8. Chapter 9 continueswith the extraction of the lecturer image out of a video in order to paste it ontop of the vector-graphics image of the board. SIOX, a generalization of themethod used for the instructor video segmentation is presented in Chapter 10.Chapter 11 finally suggests how SIOX might be combined with range sensors toprovide another solution for the extraction of the lecturer image from a video. Inthe end, Chapter 12 concludes this thesis with a summary and a brief presenta-tion of future work. Appendix A provides an overview of the components of theE-Chalk system along with all contributors. Appendices B to G provide furthertechnical details on the component framework and E-Chalk’s audio and videosubsystems, as well as the details of the benchmark described in Chapter 10.

Page 12: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4 CHAPTER 1. INTRODUCTION

Page 13: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 2

Automated LectureRecording

2.1 Distance Education

The word distance education was introduced into the English language as partof a new institutionalisation of remote education. In 1969, the Open Univer-sity [89] was founded in the UK as the first university where no student couldattend to a class physically. Soon after, the “Universidad Nacional de Edu-cacion a Distancia” (UNED) [90] in Spain was founded (1972), as well as the“FernUniversitat in Hagen” in Germany (1974) [79]. Even though the rootsof the Internet in the 1960s and 1970s are closely connected to universities,videotapes, the telephone, and in the 1980s also cable and satellite delivery re-mained the transmission media for distance courses until the end of the 1980s. Ittook until the 1990s for the Internet to be taken seriously as distance educationmedium. Then, however, the popularity of distance education started to growat an unprecedented rate. Today, each of the above-mentioned universities hasmore students than could be reached by attendance teaching.

Moreover, the ubiquitous availability of connected computers has addeda whole new dimension to distance education as regular, i. e., non-distance-education, universities began to see opportunities for a new way of teaching.This approach was given the name e-learning. Today, almost every universityhas its own e-learning project that aims to utilize the opportunities provided bythe Internet or other media that allow easy deployment of content to improveeducation. Among these projects, one can differentiate between pure distancelearning (a synonym for distance education) approaches and blended learning.Blended learning is a hybrid approach that combines e-learning with traditionaleducation methods.

2.2 Technology-AugmentedClassroomTeaching

During the 1990s, digital technology also set foot in the classrooms of attendanceuniversities, which began to enhance traditional teaching methods by usingcomputers. Although usage varies from subject to subject and from teacher

5

Page 14: Adaptive Audio and Video Processing for Electronic Chalkboard ...

6 CHAPTER 2. AUTOMATED LECTURE RECORDING

to teacher, three approaches can be observed to have reached a predominantposition in the field of computer-supported education: Intensive use of slide-show presentations, educational mini applications (e. g., specialized software,dynamic web pages, or Java Applets), and video recording of lectures and theirtransmission via the Internet.

Slide-show presentations have long since replaced overhead projector slides.The structure of the presentation is entirely planned in advance taking intoaccount all required resources. Visual means like tables, diagrams, images,or even animations can be directly presented to the audience. For distanceeducation use, computer-generated slides may be printed out or put onto websites, so that students do not have to copy the content for later recalling.

Specialized educational mini-applications are used for presentations in class-rooms as well as for individual training of a student who sits in front of his owncomputer. Pedagogical software like this is particularly common in K12 educa-tion because there is a wide range of commercially available programs. Researchuniversities usually prefer to develop their own solutions often targeted to theaudience of a single course.

Ever since, distance education has been a way to deploy educational contentto more people than would otherwise be reachable using attendance teaching.As a result, many attendance universities have welcomed the upcoming popular-ity of the Internet and started to think about enhancing classroom teaching byproviding students with additional distance teaching lessons. Existing solutionseither focus on recording and transmitting a session or using video conferencingtools to establish a bidirectional connection (i. e., a feedback channel) [Knip-ping, 2005]. Although this approach does not support the teaching process inclassroom, recording a video of the entire lecture containing the picture of theboard, the lecturer, and an audio track enables students to follow a lectureremotely or to replay previous sessions. A reason often given for combining dis-tance education with classroom teaching is that universities have to cope withan increasing number of students while at the same time universities more oftenthan not have to face drastic cuts in their funding.

With computer technology already being used in classroom teaching, it seemsto be but a small step to automate the processes of creating distance learningmaterial directly from classroom teaching. The following sections describe anumber of projects and commercial products that were created to automate thecreation of distance education material in attendance universities.

2.3 Lecture Recording Without Automation

Often, standard audio/video encoding and broadcasting applications are used torecord and transmit lectures over the Internet. The reason for this is primarilytheir commercial availability and the straightforward handling of state-of-the-art Internet broadcasting software. The downside is that they require manualoperation by technical staff as they are not particularily designed for automatedrecording. Figure 2.1 shows the typical work flow needed for creating web castswith such software. RealNetworks, Inc. write into the “Helix Universal ServerAdministration Guide” [43]:

Encoding a media clip or broadcast is the last step of a processthat involves capturing, digitizing, editing, and optimizing audio

Page 15: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.3. LECTURE RECORDING WITHOUT AUTOMATION 7

Figure 2.1: Typical work flow for streaming media with commercially availableweb casting software. Picture taken from the Helix Universal Server Administra-tion Guide [43].

or video data. A streaming media author uses various productiontools to accomplish these jobs. These tools typically include videocameras, microphones, recording media such as tapes or CDs, mixinghardware, and audio and video editing software. You can use anytools you want to capture and edit audio and video input. You justneed to ensure that your tools can save digitized files in formats thatyour encoding tools can accept.

However, most of the automation projects described in this chapter are builton top of standard commercial Internet broadcasting systems. It is therefore use-ful to take a more detailed look at the three most important ones, namely theWindows Media Platform [34] by Microsoft, Inc., the products from RealNet-works, Inc. [42], and QuickTime [8] by Apple, Inc. When looking at the smallerdetails, there are differences between the three systems. However, these are notimportant enough to merit discussion of each of them individually. Instead, thissection will discuss them all at once. The main scope of commercial encodersis the transmission of audio and video data via the Internet and their digitalarchiving in files. Each of the systems consists of a three-layer architecturewhich contains an encoder part, a server part, and a client part. Figure 2.2illustrates a typical configuration.

The encoder captures live audio and video content and delegates compressionto a codec provided by the operating system. Pre-recorded content is handledby using the codec as a converter. Current encoders feature flexible encodingmodes with constant and variable bit rates. In addition to stream-capturingdevices (such as sound cards, video cards, or FireWire interfaces), encodersare also able to read still images and capture screen shots. Most encoderscan capture the entire desktop screen, individual windows, or a region andbroadcast or encode it to files. Encoders normally provide a user interface tocontrol everything necessary for live event production, such as pre-defining playlists and switching between live and pre-recorded sources. Several encoders alsosupport controlling conventional hardware devices attached to the computer:Playlists can also include commands like rewind, play, or pause to be sent todigital video cameras and video tape recorders. Mostly, time-code data is alsocaptured from the original source for frame-accurate seeking. Several encodersare also able to integrate presentation slides, for example Microsoft Produceralso encodes Microsoft PowerPoint [32] presentations.

Page 16: Adaptive Audio and Video Processing for Electronic Chalkboard ...

8 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.2: The general software architecture of commercial broadcastingservers: The captured data is compressed by encoder software and then broadcastby a server program that may reside on a different computer. Specialized clientsoftware receives the stream and plays it back on different hardware platforms.

The server part is able to deliver either a live stream or pre-encoded con-tent over the Internet. For live streaming, a so-called broadcast publishing pointconnects to an encoder and sends the stream to compatible clients. A publish-ing point is a computer with a properly configured webserver and a runningstreaming server. Today, most servers can also stream files that were encodedby encoders of different origin. Servers may also get their input stream fromthe output of another server. This allows for load balancing when many peopleconnect to a certain broadcast at once. Essential runtime parameters such asserver load or the number of connections are logged and can be tracked onlinewith the help of a user interface or via SNMP [Case et al., 2002]. The streamingservers can be administrated using different front-ends.

The client program is able to receive a stream sent by a server and to playback files stored on the local harddisk. The Windows Media Player is partof the operating system Windows. Apple’s QuickTime Player is part of theoperating system Mac OS X. The RealPlayer is part of several operating systemsin handheld devices, such as mobile phones. Both the QuickTime Player andthe RealPlayer are available on several platforms. Usually, the players installthemselves into the webbrowser. They are integrated as plug-ins and are invokedby the webbrowser when a page returns a certain MIME type.

Several universities1 maintain an infrastructure for the recording and webcasting of university lectures or special events. Others also use teleconferencingsystems like PolyCom [40] to record and stream lectures or to hold conferences,one example is the University of Indiana [61], which uses a Polycom conferencingsystem to record lectures before making them available via Apple’s PodCasting[6].

Page 17: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.4. CLASSROOM 2000/ECLASS 9

Figure 2.3: The prototype lecture room of the eClass project (left) and themedia rack (right). Pictures taken from [74].

2.4 Classroom 2000/eClass

One of the earliest projects to use computer support in the classroom to generatedistance learning as a side-effect is the Classroom 2000 project [Abowd, 1999,Brotherton, 2001], developed at the Georgia Institute of Technology. In 2000,the project was renamed eClass. The purpose of the research project, whichended in 2001, was “to study the impact of ubiquitous computing on education”.Classroom 2000 consists of a prototype classroom environment and a softwaresystem with the goal to “capture the rich interaction that occurs in a typicaluniversity lecture”.

The instructor uses an electronic board system such as Smartboard [49],where the computer screen content is projected either from behind or from thefront. A pen tracking system is used to simulate mouse movements. Audiorecording is done using two dynamic microphones attached to the ceiling of theclassroom. However, the instruction manual for Classroom 2000 [74] recom-mends to use an optional wireless lapel microphone for better audio quality.The audio signal is picked up by a pre-amplifier and an audio mixer rack beforeit is forwarded to the sound card of the encoding computer. Video recording isdone using a front camera for the instructor, a rear camera for the classroomand a document camera to capture non-electronic documents. According to avideo downloadable from their project website [74] the estimated minimal costsfor an equipment suitable for the system are about $ 15,000. Their prototyperoom cost about $ 200,000 including all projectors, computers, and the Smart-Board. The prototype classroom also featured a radio tuner, a VCR, and aDAT player/recorder for the instructor to be able to use non computer-basedmedia during the lecture. Figure 2.3 shows their prototype classroom and thehardware rack.

The recording software (conceptual name ZenStar) technically consists ofseveral components that are briefly described in the following. A presentationcomponent called “ZenPad” is used to present pre-specified slides during the lec-ture. The program also allows for a simple free hand annotation of slides. Forthe use as whiteboard, the instructor adds an empty slide. The webbrowser used

1For example: UC Los Angeles [60], Stanford University [50], or UC Berkeley [59].

Page 18: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.4: A screenshot of a lecture replay with eClass.

during the lecture is configured to use a custom proxy server keeping track ofevery URL visited. A program called “StudPad” allows for student interactionin the classroom. The program distributes the ZenPad content of the presenta-tion computer to any number of student computers. The students can then addprivate notes. The so-called “ZenStarter” program was used to integrate thedifferent components. The program triggers the “RealProducer” simultaneouslywith the ZenPad to record optional audio and video of the lecture. After therecording session a program called “StreamWeaver” builds HTML pages includ-ing links to the timestamped slide positions that enable navigation inside therecording. The program converts all presented slides, including the last stateof the annotation to GIF images and creates a list of links from the logs of thecustom proxy server.

The created lecture can be replayed remotely using a web browser. Thesystem replays audio, a small video, any presented slides with static handwrittenannotations, and all weblinks visited during the lecture. RealPlayer is requiredto listen to the lecture audio and to view the lecture video. Figure 2.4 showsan example lecture recorded with eClass.

2.5 LecCorder

LecCorder [26] is an early commercial product developed and distributed byCollabWorx, Inc. It is “a lecture capture and publishing system designed forcorporate training managers, professional trainers, consultants, and presenters,as well as for academic use”. It is available as an integrated system in eithera desktop or a portable version or as a stand-alone software package. TheLecCorder software system is a Windows-based program that relies on specialMPEG-1 [ISO/IEC JTC1, 1993] encoder hardware2. CollabWorx gives a de-tailed specification of the recommended camera and microphone configuration

2Namely the MovieMaker Express card by OptiBase, Inc.

Page 19: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.6. AUTHORING ON THE FLY 11

Figure 2.5: A slide presentation recorded with LecCorder.

to achieve proper encoding quality. The integrated LecCorder systems includeapplicable microphones and cameras, as well as camera lighting. LecCorder usesslides in the form of HTML pages, for example exported by PowerPoint. Thepresenter uses the “LecCorder system setup tool” to configure several settingssuch as the output directory for the exported lecture, the input directory ofthe prior HTML exported PowerPoint slides, and some meta-data. The lectureslides are then copied to a folder that is accessible from the web via an HTTPServer, together with several scripts. During the presentation, the lecturer usesa web browser to show the slides to the audience while the LecCorder softwarecaptures audio and video. After the presentation, the audio and video is com-pressed using the hardware MPEG-1 encoder. The presentation can be replayedusing a Java and Javascript-enabled webbrowser. The video is shown at aboutfour frames per second. The bandwidth required for smooth replay is about64 kbit per second (a single ISDN channel). In addition to the client shown inFigure 2.5 there is also a replay variant where the remote viewer is able to addprivate annotations to each slide. The personalized slides can be saved to localfiles.

2.6 Authoring on the Fly

The Authoring on the Fly (AOF) project [Bacher et al., 1997,Hurst and Muller,2001] initiated by Thomas Ottmann at the University of Freiburg was developedto “merge the three activities presentation recording, teleteaching and produc-tion of multimedia courseware into one system” [70]. Mostly, the system is usedfor the annotation of slide presentations. Electronic whiteboards or digitizertablets are used as input devices. Using a microphone and a camera, the voiceand an additional small video of the instructor can be recorded and/or transmit-ted. AOF uses different software for live transmissions than for archiving lec-tures. Live transmission is mainly done using standard Unix-based MBone [85]tools. Originally, it used the programs “vic” for video transmission, “vat” for

Page 20: Adaptive Audio and Video Processing for Electronic Chalkboard ...

12 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.6: A screenshot of a lecture replay with AOFSync. Picture takenfrom [70].

audio transmissions, and “wb” for the transmission of whiteboard content andslides. The tools are all part of the MBone tool set [86] and were originally de-veloped for conferencing. The programs were replaced by “AOFwb” and laterby a program called “Media Lecture Board (mlb)” running on Windows. Thetool handles board strokes as vector graphics and can import slides, images, andalso video files in different formats. Audio is transmitted and archived withoutcompression using a sampling rate of either 8 or 16 kHz (8 bit mono).

Live lectures are received using a special client program called “AOFrec”.The client program is a small standalone Unix program that is not coupled to aweb browser. Closing “mlb” results in the creation of HTML pages that enableasynchronous replay of the lecture using a client called “AOFSync”, of whichthere is also a Java variant called “AOFJSync”. Figure 2.6 shows a screenshot.Both clients require the download of the entire recorded material to a localharddisk prior to replaying the lecture because the program is not embeddedinto a web browser either. AOF provides several tools to automate packagingof lectures and to burn CD-ROMs so that lectures can be directly replayed offthe disk. There are also tools for indexing lectures with keywords and for thecreation of directories over a set of recorded lectures. A number of additionaltools allows for post-production of lectures and for direct import of PowerPointslides into mlb.

2.7 Lecturnity

Lecturnity [25] is a commercial spin-off of the Authoring on the Fly project (seeSection 2.6). However, it does not provide live transmission of lectures. Lectur-nity integrates into PowerPoint and mainly consists of three programs. “Lectur-nity Assistant” is the main recording component. Lecture recording is controlled

Page 21: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.8. LECTOPIA/ILECTURE 13

Figure 2.7: A screenshot of a lecture replay with Lecturnity.

by a simple console that provides start/stop/pause functionality. The slides canbe annotated using different painting and drawing facilities. Capturing of slidecontent and annotations is actually done by screen grabbing. The entire desk-top screen or only a part of it is captured and recorded in addition with audioand an additional video feed. The video signal is mostly used for recording asmall video of the instructor or the classroom environment. As screen grabbingforms the principal recording component, lecturers can also record the use ofany other application and are not limited to the use of PowerPoint.

The “Lecturnity Editor” allows for post-production of recorded lectures.It provides the conventional audio and video editing and filtering facilities andallows to cut, copy, and paste entire lecture parts. A program called “LecturnityPublisher” is provided to package a recorded lecture for distribution on CDor DVD, as well as for publication in the Internet. The packaged lectures areembedded into HTML and can be played back with RealPlayer, Windows MediaPlayer, or Macromedia Flash. Lectures can be navigated using a slider or byselecting a specific slide from a thumbnail index. Figure 2.7 shows a screenshotfrom an example lecture replay created with Lecturnity.

2.8 Lectopia/iLecture

The iLecture system [William and Fardon, 2005], recently renamed Lectopia inseveral countries, is a commercially available system initially developed at theUniversity of Western Australia. In 1998, the first version of Lectopia was de-veloped for the purpose of replacing a small lecture tape library service thatsupported the automated capture and processing of lectures from audio tapeto help out part-time arts students who had difficulties attending lectures. Ac-cording to [82], “the main driving force behind the development of the iLectureSystem was the desire to make lectures and associated material available to allstudents with Internet access, at the time of their choice”. The system assumesa pre-configured classroom, with at least a camera and a microphone. It may

Page 22: Adaptive Audio and Video Processing for Electronic Chalkboard ...

14 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.8: Screenshot of a lecture replay with the iLecture system using adocument camera (left) and using a camera for the instructor and additional un-synchronized slides (right). The bandwidth required for the examples is 1 Mbit/s.

also be used for screen capturing. For the recording of blackboard and papercontent, the authors recommend the use of a document camera. This is a regu-lar camera, entirely focused on the chalkboard or a sheet of paper. Figure 2.8shows a sample lecture produced with such a camera and a replay of a lecturerecorded with a typical setup.

The core of the software is the “iLecture Administration Tool”. This is aweb-based information system that enables administrators to schedule lecturerecordings at certain times. Once a recording has been scheduled, Lectopia takescare of capturing, publishing, and notification of availability of the recording byusing various off-the-shelf components, such as RealProducer, Windows MediaEncoder, and QuickTime. The actual lecture recording is started by microphoneactivity and stopped when the time schedule ends (regardless of microphoneactivity). The lecture can be broadcast live and/or stored in a database that isaccessed by remote viewers through the content management system Lasso [37]or the learning management system WebCT [69]. Slides have to be uploadedseparately and are not synchronized with the stream, i. e., the user must clickon a link to open each slide manually while looking at the recorded video of thelecture. Lectopia also supports the automatic conversion of lectures to formatsthat are optimized for small devices, such as PDAs or mobile phones. TheLectopia Server also provides an API that allows external information systemsto search, retrieve, and change information in the system programmatically.The API is based on the Simple Object Access Protocol (SOAP) [66].

2.9 Camtasia

Camtasia [12] by TechSmith cooperation is a commercial screen grabbing tool.Although not explicitely developed for lecture recording, it is regularly used[Burcham, 2003] for this purpose by several universities.3 The reason for this isthat Camtasia’s functionality is very similar to Lecturnity (see Section 2.7).The

3See for example lecture recordings at RWTH Aachen, University of Freiburg, or StanfordUniversity.

Page 23: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.10. TELE-TASK 15

Figure 2.9: A screenshot of a university tutorial recorded with Camtasia playedback in a web browser.

software allows screen grabbing at different frame rates synchronized with audioand an optional video stream, integrated picture-in-picture. The result is a videofile that can be encoded in different standard formats, like Windows Media AVI,QuickTime MOV, RealMedia, Macromedia Flash or animated GIF (withoutaudio). Camtasia provides editing facilities and tools that ease the productionof video DVDs. Figure 2.9 shows a replay of a university tutorial recorded withCamtasia from [62].

2.10 tele-TASK

Tele-TASK (tele-Teaching Anywhere Solution Kit) [Ma et al., 2003,Meinel et al.,2005] is a university project directed by Christoph Meinel at the Hasso-Plattner-Institut at Potsdam. The main goal of the project is to minimize the effortsrequired for lecture recording by integrating the required hard- and software intoone device, the so-called tCube. Similar to other systems presented in this chap-ter, the Tele-TASK system grabs the screen content combined with audio andan additional video feed of the presenter. The systems assumes a presentationcomputer, a microphone, and an optional video camera, all connected to thetCube. The project homepage [41] recommends the use of an electronic black-board to project the presentations onto “in order to be able to add handwrittenremarks”. The software of the tCube includes a RealEncoder, which encodes theaudio and video stream into RealVideo format. The screen content is capturedat one frame per two seconds. The system is able to transmit the recordings liveusing an additional computer acting as server. However, when transmitting live,the screen content is not compressed and requires a high-bandwidth connection(at least 1Mbit/s). After the recording session has been terminated, a post-processing step compresses the screen content using differential frames in PNGformat. The slides are synchronized with video and audio using SMIL [67]. Thecompressed recordings can be viewed with a modem (56 kbit/s). Figure 2.10shows a lecture replay recorded with the tCube.

Page 24: Adaptive Audio and Video Processing for Electronic Chalkboard ...

16 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.10: A lecture recorded with tele-TASK.

2.11 Classroom Presenter

Conference XP is a project conducted by Microsoft Research. The initiativeis to explore “how to make wireless classrooms, collaboration, and distancelearning a compelling, rich experience by assuming the availability of emergingand enabling technologies, such as high-bandwidth networks, wireless devices,Tablet PCs” [33]. Conference XP is a software development kit (SDK) basedon Windows XP that is to serve as a research bed for exploring the creationof distributed applications using Tablet PCs and wireless networks. ConferenceXP is based on a four-layer architecture: an application layer, a capability layer,a conference API, and a network transport layer. Several complete applicationsare already included in the SDK, these are mainly configuration wizards andnon-specialized, extensible tools, such as the Conference XP client. The clientis a generic peer-to-peer video and audio conferencing application. The capa-bility layer provides graphical user interface components that can be reused byapplications using the SDK. Both the Conference XP applications and the ca-pabilities layer use the conference API. The conference API provides standardprotocols to transfer documents in different formats to enable remote interop-erability of applications. It also allows the vector-based transmission of inkstrokes. Microsoft research claims that the API conforms to the IMS/SCORMinterchange specification [2]. In addition to interoperability, the API also easesaccess to the DirectShow and Windows Media API (see 2.3) that provide theoperating system’s audio and video features in Windows. The network trans-port layer provides access to low-level network transport protocols and is thebottom-most layer in Windows XP. Microsoft Research has teamed up with re-search organizations and universities in several research projects based on thisSDK. One of these projects is Classroom Presenter [Anderson et al., 2006].It aims to improve “computer-based presentation systems [that] severely limitflexibility in delivery, hindering instructors’ extemporaneous adaptation of theirpresentations to match their audiences. One major limitation of computer-based

Page 25: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.12. A MINIMALISTIC LECTURE RECORDING SYSTEM 17

Figure 2.11: Screenshots of Classroom Presenter: instructor view (left), class-room presentation (right), student view (below). Students are able to make per-sonal annotations, these can also be sent to the instructor computer. Picturestaken from [63].

systems is lack of support for high-quality handwriting over slides, as with over-head projectors and other manual presentation systems” [44]. The system is tocombine “the advantages of existing computer-based and manual presentationsystems and build on these systems, introducing novel affordances”. Techni-cally, the system is an extension to MS PowerPoint that allows writing on topof slides. When lecturing using Classroom Presenter, the instructor writes ontop of images of the slides which are directly projected from a Tablet PC andshown on networked remote computers. Using their own networked Tablet PCs,students can also write on the slides and send these notes to the instructor PC.Figure 2.11 shows sample screenshots of the system.

2.12 A MinimalisticAutomated Lecture Recor-ding System

In order to see what can be done using only off-the-shelf components, the FUPowerPoint Recorder was developed by Damian Schmidt. The task was to cre-ate a PowerPoint plug-in that allows the recording of video and audio feedssynchronized with the slides. The required user interaction overhead was to beminimized to one mouse click. The recordings had to be replayable in a webbrowser. The system consists of about 2,000 lines of Visual Basic code andembeds itself into PowerPoint as a macro. When the user enters PowerPoint’s

Page 26: Adaptive Audio and Video Processing for Electronic Chalkboard ...

18 CHAPTER 2. AUTOMATED LECTURE RECORDING

Figure 2.12: A screenshots of the replay of a presentation automatically gener-ated by the FU PowerPoint Recorder.

presentation mode, an additional dialog screen is shown. The user can pressthe “cancel” button to preceed a normal presentation or “ok” to record thepresentation. The script initiates the capturing of audio and/or video, providedthe operating system detects a sound card and/or a camera. Operating sys-tem codecs are used to compress the captured streams. Slide transitions arereported to the script by PowerPoint itself. After the presentation has finished,the script generates a batch program that controls Microsoft Producer. Mi-crosoft Producer converts the recorded data into Windows Media Video formatand generates suitable web pages for replay. Figure 2.12 shows a screenshot ofa resulting web presentation. It can be replayed using a web browser and Win-dows Media Player. The entire process does not require any user interaction.The required bandwidth depends on the codecs that are used. A low-qualityversion of a slide presentation can be replayed with audio and additional videousing a 64 kbit connection. The experiment shows that a slide-show recorder canbe written in a few thousand lines of script code. Animations can be replayedand the system is able to playback handwritten annotations on the slides. Ofcourse, this “poor man’s slide recorder” has several drawbacks. It is a propri-etary solution that only works for Microsoft PowerPoint in combination withMicrosoft Producer. There is no means to integrate any other application. Therecorded lectures can only be played back using versions of Internet Explorerand Windows Media Player, which are only available on Windows XP. At leasta DSL or cable connection is required to receive such a lecture remotely in ad-equate quality. Live transmissions are not possible. The encoding process afterthe lecture takes about 1.5 times the lecture length. This makes it impossibleto record two lectures in immediate succession on the same computer.

Page 27: Adaptive Audio and Video Processing for Electronic Chalkboard ...

2.13. OTHER SYSTEMS 19

2.13 Other Systems

There are numerous other systems that cannot explained in detail here. One ofthem is DyKnow, which is being commercially distributed by DyKnow, Inc. [15].The system is very similar to Classroom Presenter with the exception that itis not an extension to MS PowerPoint but brings its own core program, whichlooks like a standard vector-drawing application. VCPlayer by [Sheng et al.,2005] is also very similar to Classroom Presenter but it allows to embed asmall video of the instructor and is not specialized for Tablet PCs. Tegrityby Tegrity, Inc. [55] is very similar to iLectures. The main differences is that itallows students to add their own handwritten notes to a recorded lecture and therecording system is sold as all-in-one hardware solution. The TeleTeachingTool(TTT) [Ziewer and Seidl, 2004] developed at the University of Trier is verysimilar to AOF. A Java-based client application plays back annotated slidedtogether with the voice and a small video of the instructor. The Virtual Director[Machnicki and Rowe, 2002], developed at UC Berkeley, helps automating theprocess of producing Internet web-casts. It saves man-power by enabling severalwebcasts to be run by a single technician. The system selects which streams tobroadcast and controls other equipment such as moving cameras to track thespeaker. The automation of multimedia productions is also researched in [Davis,2003a,Davis, 2003b]. A “proactive capture device” interacts with the user andthe environment in order to create annotations and footage for every sequence ofpictures captured. The picture sequences are then collected into a pool and canserve as building blocks for different movies. The underlying motivation is verysimilar, namely, making multimedia content creation easier for everybody. Thework, however, aims more at automating video direction in a cinematographicsense.

2.14 Conclusion

Internet broadcasting systems were created to perform traditional radio andTV broadcasting using the Internet. They are designed to fit into the work flowof a commercial radio or TV broadcasting station. Using them in a universityfor lecture broadcasting actually requires technical personnel for operation. Be-cause personnel is seldom available, several university projects try to automatelecture recording and transmission by building automation mechanisms on topof these systems. The biggest downside of this approach is that all content hasto be transmitted and archived as video or audio track. Slides as well as desk-top activity is often captured by a screen grabber and then encoded as video.Often, this results in poorly encoded material that needs much more bandwidththan neccessary. Even when proper screen codecs are being used that avoidsome encoding problems, simply starting several monolithic stream encoderssimultaneously is not an optimal solution for recording and archiving lectures.The multimedia content creation work flow as assumed by Internet broadcast-ing systems (illustrated in Figure 2.1) is short-cut: There is no audio or videotechnician supervising the recording of the lecture and there is no editing stepbetween encoding and transmission. This results in quality degradation whichis especially noticeable in the audio track (Chapter 7 will discuss this topic indetail). Mapping any content to video format results in a huge semantic gap. It

Page 28: Adaptive Audio and Video Processing for Electronic Chalkboard ...

20 CHAPTER 2. AUTOMATED LECTURE RECORDING

does not allow for domain-specific encoding and makes post-editing and partialreuse of archived content cumbersome. Using video formats for replaying textor handwritings that could otherwise be played back in vector-graphic formatsis also bandwidth inefficient4. When using traditional Internet broadcastingsystems, the computer is only used for transmission, storage, and replay of au-dio and video. Whithout manual editing, the integration of different contentstreams is reduced to a simple parallel playback in different windows on thedesktop. Chapter 9 will report on several problems that result from this lack ofintegration.

4An interesting early comment on this topic in the web can be found at [20].

Page 29: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 3

The E-Chalk System

In 2000, upon initiation by Raul Rojas, the E-Chalk project [73] started with theidea of “creating an update of the traditional chalkboard” [Rojas et al., 2001a,Rojas et al., 2001b]. Since then, the system has evolved in response to feedbackfrom its users. In the computer science department at Freie Universitat Berlinalone, about 300 lecture recordings1 have been created since 2001. The followingsections summarize both the ideas that motivated the creation of E-Chalk. andthe thoughts that evolved in response to the user feedback.

3.1 E-Chalk’s Philosophy

Main Target: Classroom Teaching

A teacher’s main priority is classroom teaching, not least because that is whathe or she is paid for today. Teachers will not accept a tool with which their ex-perience and practical knowledge, gained through a lifetime of teaching, cannotbe reused. An excellent teacher should remain an excellent teacher no mat-ter whether he or she makes use of electronic aids or not. There should beas little overhead involved as possible in familiarizing themselves with a giventechnology. Hence, any tool has to ensure that it conforms to the teacher’sestablished working habits while offering added value. Since every teacher hasdifferent ideas about what a good lecture looks like, teachers need to be able tocustomize the tools according to their preferences and ideas.

Support both: Prepared and ad-hoc Lectures

A smooth lecture performance directly depends on the quality of the prepa-ration. Preparation consists of gathering content, structuring the lecture, andpreparing of other material, such as charts, figures, or pictures. There is nodoubt that multimedia elements are very useful teaching tools. Consequently,the increasing use of such elements leads to an even higher preparation effort.In order to avoid redundant work (e. g., pre-sketching the lecture on paper),computer-based education tools must support this process in a convenient way.Convenient here means preserving as much freedom as possible while allowing as

1At the time of writing this document, I count 294 recorded courses (not including exper-imental recordings, show cases, and special events).

21

Page 30: Adaptive Audio and Video Processing for Electronic Chalkboard ...

22 CHAPTER 3. THE E-CHALK SYSTEM

Figure 3.1: This sketch, drawn by Raul Rojas, illustrates the general idea ofthe E-Chalk system. A bald-headed professor walking with a cane is able to workwith the intelligent electronic chalkboard just as he used to with a traditionalchalkboard. The lecture is archived and transmitted via the Internet and can bereceived on a variety of devices, including mobile phones and PDAs.

much structure as needed. A teacher should still be able to control the amountand order of elements that are to be included in the lecture. An experiencedprofessor for example is able to hold an excellent lecture spontaneously backedby lifetime-experience only.

Support Distance Teaching as a By-product

Synchronous remote teaching, such as video conferencing, can allow for coursesthat, due to limited resources, would be otherwise impossible to implement. Inaddition to helping students recall past lecture content, asynchronous remoteteaching provides benefits that assist students in reviewing past lectures in or-der to catch up on missed content or prepare for examinations, and, if theyare physically handicapped, gives them greater overall access to lectures. Addi-tionally, institutions promote lecture recording and broadcasting because theyanticipate that these archives will enhance the institution’s knowledge base andenhance its prestige. Synchronous and asynchronous teaching offers a valuableenhancement for students, but teachers will implement it only if doing so entailsas little overhead as possible.

Focus on the Chalkboard, not on Slides

The main advantage of slides (for example discussed in [Holmes, 2004]) is theirreusability, which is at the same time their biggest downside: Slide-show pre-sentations often appear static because everything has to be planned in advanceleaving little room for the teacher to adapt the content in interaction with thestudents. This often leads to the situation that information is either deliveredout-of-band, i. e., not presented in the slides (for example, many lecturers preferto draw their diagrams by hand, instead of spending hours of preparation in

Page 31: Adaptive Audio and Video Processing for Electronic Chalkboard ...

3.2. THE SOFTWARE SYSTEM 23

advance) or students are overwhelmed by the huge amount of slides displayedin rapid succession. Usually, slides present content as bulleted point lists. Thisdramatically restricts the expressiveness of the lecturer. A famous, but alsocontroversial [9], critique of PowerPoint was published by Edward Tufte [Tufte,2003]. Among the most important points he raises: Slides are used more toguide and reassure the presenter, rather than to enlighten the audience; theyencourage simplistic thinking, because ideas are forced into bulleted lists; thepresenter and the audience are forced to linearly progress through the presen-tation, sometimes causing them to miss the big picture.

Slide-show presentations are used very frequently. However, in several sub-jects, especially in the sciences, it is often said that they do not meet the in-structor’s requirements. Especially in mathematics and physics, the journey isthe reward and one is not interested in results presented on slides but on thedevelopment of thoughts that led to them. The chalkboard has been an estab-lished teaching tool for many decades. The lecturer thinks aloud while writingon the board. The students have enough time to understand an idea or formula,ask questions, and reflect on the contents of the lecture.

Integrated Concept

Figure 3.1 is a sketch drawn by Raul Rojas that illustrates the overall concept ofthe E-Chalk system. In essence, the idea was to create a tool that substantiallyfacilitates the preparation and holding of lectures, allows the integration ofvarious media, and supports (a)synchronous remote teaching with little or nooverhead for the lecturer. The most important policy is on the one hand toprovide the highest amount of assistance while on the other hand permittingthe greatest degree of freedom during class. The primary purpose of E-Chalk isto support the teacher in the classroom. Creating distance learning courses forthe World Wide Web [Raffel, 2000] is only a secondary consideration. E-Chalk isactually a conceptual idea that integrates many components, involving hardwareas well as software. However, for the rest of this text, the word “E-Chalk” orthe words “E-Chalk system” denote the E-Chalk software system.

E-Chalk is representative of the idea of technically-augmented classroomteaching and, due to its additional distance-teaching features, also can be con-sidered an implementation of blended learning.

3.2 The Software System

The next sections are mainly a summary of [Friedland et al., 2004e]. Theyprovide an overview of the features of the E-Chalk software. The design of thechalkboard simulation and the integration of different interactive content onthe electronic chalkboard is described in [Knipping, 2005]. An overview of theproject and a list of contributers is presented in Appendix A.

E-Chalk transforms the computer screen into a black surface where one canpaint using different colors and pen widths. The screen becomes a visual outputtool that delivers content to more than one person. This requires that everythingshown on the display can be watched and understood by many people. Thekeyboard loses its importance and needs to be used only occasionally. Anyrequired use of the keyboard is to be avoided since it breaks the flow of the

Page 32: Adaptive Audio and Video Processing for Electronic Chalkboard ...

24 CHAPTER 3. THE E-CHALK SYSTEM

Figure 3.2: An E-Chalk setup for larger lecture halls: The instructor is writingon a pen sensitive whiteboard and a second projector is used to enlarge the contentfor the students.

lecture. Furthermore, the mouse is not an adequate input device in a classroomscenario because the lecturer is standing in front of an audience rather thansitting in front of a desktop. These considerations dominate the entire designof the chalkboard simulation software and any software that is to be run duringthe lecture.

The board can be scrolled up and down vertically, providing the lecturer withan unlimited surface to write on. The user can also use an eraser to delete partsor all of the board content. The board software allows the placement of diversevisual and interactive material on the board. The material may be importedfrom the Internet during a lecture but is usually chosen from a set of predefinedbookmarks in order to avoid having to use the keyboard during a lecture. Theelements that are most prominently used to enrich an E-Chalk-based lectureare diagrams and photographs. These may also be retrieved from the Internetduring a lecture using an interface to Google Image Search [77]. Images insertedin the board can be directly annotated. Bookmarked sets of images may beused to present slides one after another on the electronic board. The advan-tage here is that the slides can be annotated which enables the combined useof slides and the chalkboard. The E-Chalk system also allows the insertion ofcertain2 Applets. This allows the integration of educational mini-applications,as many of them are available as Applets. Access to CGI scripts [68] has beenimplemented as a way of interfacing web services. The board shows both tex-tual and graphical responses. E-Chalk also provides an interface to the algebrasystems Mathematica, Maple, and MuPAD. It also integrates a small self-madealgebra system called Russmans Minimatica. The latter is able to solve sim-ple arithmetic expressions and provides 2D and 3D function plots, includingtrigonometric functions. To enter mathematical problems to be solved by the

2For a detailed discussion see [Knipping, 2005] Section 4.10.

Page 33: Adaptive Audio and Video Processing for Electronic Chalkboard ...

3.3. USAGE SCENARIOS 25

Figure 3.3: A demonstration of the four screen interactive datawall constructionat Freie Universitat Berlin for middle-sized seminar rooms.

interfaced systems, the system provides an integrated mathematical handwritingrecognition [Tapia, 2005,Friedland et al., 2004d].

Finally, the E-Chalk board system provides a unique interface for stroke-based applications, called Chalklets. These applications interact exclusivelythrough strokes, i. e., they are able to recognize drawings and gestures fromthe screen and respond by drawing their results on the board. Several Chalk-lets have already been developed, among them a Chalklet that simulates logiccircuits [Liwicki and Knipping, 2005], a chess interface [Block et al., 2004a],a Python programming environment, and a Chalklet that can be used to loggeometry proofs.

In order to allow the advance preparation of chalkboard lectures, the boardsoftware supports so-called macros. Macros are pre-recorded series of eventsthat a lecturer may call and replay on the board during a lecture. To record amacro, the instructor draws the portions of the lecture which he or she wantsto store in advance. During the lecture, macros are replayed either at originalspeed or using an acceleration factor. Automatically generated macros can beused for visualization purposes [Arguero, 2004].

3.3 Usage Scenarios

The E-Chalk software works with a variety of hardware that instructors can useto substitute the traditional chalkboard. The ideal electronic chalkboard wouldbe a large, pen-sensitive screen with high display and sensor resolution. Thedisplay must offer good contrast, so that the visual quality can be comparedto a real chalkboard, e. g., it should not be required to darken the room forthe lecture. However, an idea like E-Chalk depends on the instructor’s and thestudents’ preferences. [Friedland et al., 2003] studies different types of hardwaresetups that people have found useful.

To have many students in a lecture room requires a very big display sur-face. The practical solution is to use a digital whiteboard (for example fromGTCO CalComp, Inc. [28], Hitachi, Inc. [24], Numonics Corporation [36], or

Page 34: Adaptive Audio and Video Processing for Electronic Chalkboard ...

26 CHAPTER 3. THE E-CHALK SYSTEM

Figure 3.4: Sample setup where the E-Chalk system is combined with a confer-encing system for bidirectional synchronous remote lecturing.

Smart Technologies, Inc. [49]) or a digitizer tablet (for example from WacomCorporation [91]) as a writing surface plus an extra projector that projects theboard content in large. The Technische Universitat Berlin, for example, usesthis setup regularly for mathematics lectures for undergraduate engineering stu-dents [Friedland et al., 2005f]. Figure 3.2 shows this setup.

Due to the non-availability of large pen-sensitive screens we built a low-cost,scalable, interactive datawall in a dedicated seminar room at Freie UniversitatBerlin [Friedland et al., 2005g]. Figure 3.3 shows a photograph of the setup. Thedatawall is operated by two off-the-shelf PCs, one PC to actually work on anda second one for the pen-tracking system. A multi-head graphics card controlsfour screens with a total projection area of 1.15 m×6.13 m and a resolution of4096 × 1024 pixels. In order to keep the depth of the datawall compartmentsmall and to avoid using expensive wide-angle projectors, the screen images arereflected by mirrors. The instructor uses a special stylus, a laser pointer with atouch-sensitive tip, to write on the board. When the pen touches the screen, thelaser lights up. A vision system uses four regular web cams to capture a videoimage from the back of the datawall, in order to track the red laser spot. Thelaser pointer’s position is mapped into the coordinates of the display systemwhich runs a Java-based client program emulating a mouse device [Jantz et al.,2006].

For smaller seminars, a setup with several digitizer tablets is often used,where students are able to directly participate in the lecture. A geology lecturer,for example, often uses a rear projector and the students use small digitizertablets to work on geographical maps. A handicapped professor for Arabiclinguistics was glad to be able to give a chalkboard lecture while seated usinga digitizer tablet for himself, instead of writing on the rear projection screen.This scenario is also useful for a German high school: The computer room ofthis school is so small that neither a chalkboard nor tables for the students bigenough for a computer and writing space fit in. The teacher uses E-Chalk witha digitizer tablet and projects the board content onto the wall. The generated

Page 35: Adaptive Audio and Video Processing for Electronic Chalkboard ...

3.4. DISTANCE TEACHING 27

Figure 3.5: This screenshot shows Exymen [Friedland, 2002a] editing an archivedE-Chalk lecture that contains audio, board content, and an additional video.

PDF of the lecture is printed out at the end of the class, so that the students canget a copy of the class even though they do not have enough space for writing.

Because everything is recorded for the web (see Section 3.4), it is also possibleto give a lecture at home and present it later. Many high school teachersreported that they find this a practical feature because they can easily createclasses for the students to review at home. This makes it possible to teach morecontent.

E-Chalk is also often used in combination with video conferencing systems,which enables having audiences in several remote locations at the same time.Following chalkboard lectures by important personalities that are held at adifferent place is an example application for this scenario. Figure 3.4 shows acooperative lecture between FU Berlin and Kyoto University.

3.4 Distance Teaching

When an E-Chalk lecture is closed, a PDF transcript of the board content isgenerated automatically. The transcript can be generated both in color andin black and white. The lecture can also be transmitted live over the Internetand can be synchronized with video-conferencing systems (such as PolycomViaVideo [40]) for student feedback. Remote users connect to the E-Chalk serverto view everything as seen in the classroom. They can choose to receive the audioand optionally a small video of the teacher. The connection speed needed fora complete E-Chalk lecture with blackboard image, audio, and video is roughly128 kbit/s. Without the video stream the required connection bandwidth doesnot exceed 64 kbit/s. The most convenient way for users to follow a lecture isto use the Java-based playback. In this case nothing but a Java-enabled webbrowser is required. When viewing archived lectures, the remote user sees acontrol console that enables controling the playback of the content the same as

Page 36: Adaptive Audio and Video Processing for Electronic Chalkboard ...

28 CHAPTER 3. THE E-CHALK SYSTEM

with a VCR, i. e., pause, fast-forward, and rewind. There is no need to manuallyinstall any plug-in or client software. Other options include to follow the lecturein MPEG-2 format on a DVD, to use a Java-enabled PDA, a third-generationmobile phone using RealPlayer, or an Apple iPod capable of playing MPEG-4videos. Chapter 5 provides a detailed discussion on E-Chalk lecture playback.

3.5 Editing Lectures

Although E-Chalk has been conceived as a tool for capturing live and sponta-neous lectures, it is only natural that instructors would sometimes like to editthe recording of a lecture, in order to erase and correct errors, or just to elim-inate superfluous material. For editing E-Chalk lectures, the program Exymen(Extend Your Media Editor Now!) was developed [Friedland, 2002a,Friedland,2002b].

Exymen allows to edit audio, video, slide-show, and board content in anyformat that E-Chalk supports (see Section 4.9 ). Additionally, E-Chalk lecturesmay be enhanced with external material. The editor allows for spatial andtimely cut, copy, and paste of pieces of lectures. It is able to separately handleaudio, video, and board streams, for example for dubbing lectures for a differentaudience. The editor contains a set of audio and video filters that can be appliedto a lecture. Exymen may also be used to convert E-Chalk lectures to differentformats, such as MPEG video. Figure 3.5 shows a screenshot of the editor.

Page 37: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 4

Server Architecture

A software system that is to automatically integrate different types of con-tent streams also needs an architecture that fundamentally supports this. Thischapter provides a conceptual overview of the architectural approach of theE-Chalk Server system. Further details can be found in [Friedland and Pauls,2004], [Friedland and Pauls, 2005a], [Friedland and Pauls, 2005b], and Ap-pendix B.

4.1 Preliminary Considerations

As could be seen in the discussions of the previous chapter, schools and univer-sities are a heterogeneous playground. A software system that wants to achievesustained success in more than a few institutions has to be able to survive inan environment consisting of different software and hardware configurations. Itshould be able to adapt to different software ideologies (e. g., it should not inter-fere with political discussions about operating systems). The software has to fitinto the existing hardware infrastructure and should readily combine with othermultimedia applications. Structural modifications, such as the addition of newmedia, changes in technical formats, upgrades to new hardware, or functionalenhancements should not cause tremendous administrative overhead. A teachermust be able to step into the classroom and start lecturing as usual. Reliabilityshould also be considered. It is important to be able to continue working whenindividual parts of the system fail, at least on the level of switch-over and/orbackup facilities.

In the beginning, E-Chalk basically consisted of three monolithic servers.The chalkboard simulation and server, the audio server, and the video server.The three servers were started simultaneously by the E-Chalk Startup Wizard(see [Friedland et al., 2002]), a GUI wizard that is shown on startup to handle theconfiguration of the E-Chalk server before the beginning of a lecture. Figure 4.1illustrates the setup.

As E-Chalk was being used in more and more universities, the heterogenity ofdifferent university hardware and software prerequisites required more and morespecial solutions. A common case was the following: After the initial introduc-tion of some lecture recording system, professors wanted to do chalkboard-basedlectures while the students and the infrastructure of the university had been

29

Page 38: Adaptive Audio and Video Processing for Electronic Chalkboard ...

30 CHAPTER 4. SERVER ARCHITECTURE

Figure 4.1: An overview of the old architecture of the E-Chalk server: TheE-Chalk Startup Wizard starts the monolithic audio, video, and board servers.

optimized to a lecture recording system that only supported slide-show presen-tations. E-Chalk had to fit into already established software configurations andwork flows in different departments and subject areas. For example, E-Chalkhad to be combined with other universities’ lecture recording systems or withcommercial Internet broadcasting systems. Users wanted to use different codecsthan those that were built into the system. The early monolithic architecturedid not allow proper integration into many different systems. Any update ofthe software system did not only require a manual patch of the source code,it also required a complete re-installation by the administrators of the “clientuniversity”. Lectures, archived in different formats than the built-in E-Chalkformats, could not be edited using Exymen. Last but not least, E-Chalk is aresearch project and thus constantly underlies changes. This forced us to createan architecture that would be able to provide us with both system stability andthe possibility of a rapid integration of new ideas.

Even though most commercial multimedia streaming systems provide exten-sibility through an SDK (see Section 2.3), these are not only complicated to usebut also too specialized and proprietary. A common subset, like a compatibil-ity layer, is missing. Updates of codecs often force the customers to re-installcertain components (sometimes they do not even know about). Introducing anew medium results in administration work and also in an update of clients andassociated tools, see for example [Bacher et al., 1997].

This chapter presents a system called SOPA: Self-Organizing Processingand Streaming Architecture that was build as a reaction to these different de-mands [Friedland and Pauls, 2005b]. The system eases the development painsof applications in need for an extensible streaming and processing layer whiledecreasing administrative maintainance workload. It tries to provide a round-upsolution that serves as an extensible framework for managing software compo-nents (sometimes also called plug-ins). The system allows the synchronizationof different independent streams, such as slides and video streams and proposesa format-independent notion to describe the handling of a concrete content, forexample, to convert from one multimedia format into another. SOPA encour-ages the development of compatible codecs and filters in a community, makes

Page 39: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.2. EXISTING MULTIMEDIA ARCHITECTURES 31

it easy for system administrators to use and to search for available codes, andsupports changing the server configuration when a client connects instead ofrequiring the client to download a plug-in. Exymen also builds on SOPA, whichallows it to edit content that was created with newly introduced codecs.

4.2 Existing Multimedia Architectures

Although none of the following architectural approaches could be directly usedas a bottom layer architecture for E-Chalk, many concepts of E-Chalk’s serverarchitecture had already been introduced in a similar way before. This sec-tion therefore provides a short survey on the existing solutions for multimediasystems and explains their differences to SOPA.

Indiva [Ooi et al., 2000] stands for INfrastructre for DIstributed Video andAudio and is based on the Open Mash project. It is a middleware layer for a uni-fied set of abstractions and operations for hardware devices, software processes,and media data in a distributed audio and video environment. These abstrac-tions use a file system metaphor to access resources and high-level commandsto simplify the development of Internet webcast and distributed collaborationcontrol applications. It uses soft-state protocols for communication between in-dividual processes. Indiva focuses very much on distributed programming. Thesmallest elements Indiva handles are processes, not classes. The configuration issimply static and there are no automatic stream synchronization mechanisms.

MacOS X’s audio core by Apple contains a package called Audio Toolbox[Apple Inc, 2001]. Inside the toolbox, one can find the AUGraph SDK. AnAUGraph is a high-level representation of a set of so-called AudioUnits, alongwith the connections between them. AudioUnits are used to generate, process,receive, or otherwise manipulate streams of audio. They are building blocksthat may be used isolated or connected together to form the audio signal graph.Information to and from AudioUnits is passed via properties. AudioUnits areidentified by a string-based, proprietary, hierarchic identification mechanism.One can use the API to construct arbitrary signal paths through which audiomay be processed, i. e., a modular routing system. The API deals with largenumbers of AudioUnits and their relationships. AudioGraphs allow realtimerouting changes, that means connections can be created and broken while audiois being processed. The API is restricted to audio and – although available inJava – can only be used on Mac OS X. The AuGraph SDK has no concept forself configuration or distributed programming.

Microsoft Direct Show [31] is a component architecture for multimedia strea-ming and processing which is part of Direct X which is a support packagefor the Windows operating systems. Direct Show features dynamic assemblyof stream processing graphs. However, Direct Show contains no layer on theadministrative end, i. e., assembly has to be done in source code. DirectShow isnot platform independent and does not feature a remote discovery mechanismfor components.

Sun Microsystems delivers a quite general and platform independent frame-work that hides the implementation details of several media formats: the JavaMedia Framework (JMF) [84]. JMF is a Java API that supports capture, play-back, streaming and transcoding of audio, video, and other time-based media.It also provides a plug-in architecture that enables developers to support custom

Page 40: Adaptive Audio and Video Processing for Electronic Chalkboard ...

32 CHAPTER 4. SERVER ARCHITECTURE

data sources and sinks, effect plug-ins, and codecs. The architecture is deducedfrom the properties and technological restrictions of the supported hardware andthe implemented formats. Although their plug-in loading mechanism can loadclasses at runtime, they do not offer package dependency checking or automaticupdating, as one would expect from a well-defined component management. Itis therefore not suitable as a base for a dynamically configurable system.

The demands discussed above are very similar to the problems that have tobe solved by the manufacturers of digital TV set-top boxes. The difference beingthat the software inside a set-top box can often rely on predefined hardware.HAVi [80] stands for Home Audio Video Interoperability and is a standard fornetworking home entertainment devices defined by several major electronicscompanies. It specifically focuses on the transfer of digital audio/video (AV)content between HAVi devices, as well as the processing (rendering, record, playback) of this content by these devices. HAVi provides a Java API for streammanagement and device control. However, HAVi is targeted at consumers homeaudio/video network and dictates the use of Firewire (IEEE-STD-1394-1995)as a transport mechanism. It aims towards connecting hardware devices andis a protocol layer on top of IEEE 1394. Standards similar to HAVi also existfrom ISO (International Organization for Standardization) and ETSI (EuropeanTelecomunication Standards Institute). ETSI standardized an open middlewaresystem called Multimedia Home Platform (MHP) which is part of the DigitalVideo Broadcasting (DVB) specifications [72]. ISO’s Home Electronic System(HES) aims to “standardize software and hardware so that manufacturers mightoffer one version of a product that could operate on a variety of home automationnetworks” [Milutinovic, 2002].

Although not specifically a multimedia architecture, a related component-assembly system is Gravity [R.S. Hall and H. Cervantes, 2003]. Gravity is aresearch project investigating the dynamic assembly of applications and the im-pact of building applications from components that exhibit dynamic availabil-ity, i. e., components may appear or disappear at any time. Gravity providesa graphical design environment for building applications using drag-and-droptechniques. Using Gravity, an application is assembled dynamically and theend user is able to switch between design and execution modes at any time.The architecture presented here is driven by the same idea, but specializes onstream processing. Gravity, however, implements automatic service binding,meaning that additional meta-data is used in order to specify dependencies ofa service. The service binder makes sure that component dependencies are sat-isfied and binds the services provided by the components automatically to theapplication. The architecture presented here relies on a model, where servicesare bound upon user request. For this reason, Gravity cannot directly be usedbecause it does not have a means of specifying dependencies dynamically by theuser.

4.3 Architecture Overview

In the proposed architecture, the end-user application is built on top of acomponent-assembly mechanism, which in turn uses a component frameworkas a plug-in mechanism and a component search-engine as a means of deploy-ment-mechanism. Figure 4.2 shows both, the general model of the architectural

Page 41: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.4. JAVA AS EXECUTION PLATFORM 33

Figure 4.2: Conceptual diagram of the proposed service-oriented architecturefor multimedia applications (left), concrete implementation in the E-Chalk serversystem (right).

approach described here and E-Chalk’s server architecture. The architecture isbased on an execution platform – the operating system or a virtual machine.The component framework provides mechanisms for installing, updating, anddeleting components. It also takes care of component dependencies and managestheir lifecyles. E-Chalk uses the Java Virtual Machine by Sun Microsystems asexecution platform. The Java Virtual Machine runs on a variety of hardwareand operating systems and therefore provides platform independence at theprice of a slight memory and execution performance overhead. On top of thevirtual machine, Oscar is used as a component-management framework. Oscaris an open-source implementation of the OSGi standard [3]. Oscar manages theinstallation, update, and removal of special Java archives, so-called Bundles.Oscar also handles the lifecycle of Bundles as well as dependencies betweenthem. Bundles can be searched in the Internet using Eureka, an Apple Bon-jour -based component-discovery and deployment engine. SOPA manages theassembly of a set of specialized Bundles, so-called media nodes, to form one ormore media graphs. A media graph connects different codecs, filters, or otherstream-processing units in order to perform a certain operation. On top of theprocessing graphs, E-Chalk’s Startup Wizard as well as the board applicationcontrol the life cycle of the entire system. In the following sections, E-Chalk’sconcrete implementation will serve as an example of how the approach can solveseveral of the problems discussed in Section 4.1.

4.4 Java as Execution Platform

A thorough discussion of why Java may be favored against a native implemen-tation would go well beyond the scope of this dissertation. This section onlyprovides a brief summary of the relevant experiences that played a role whendeveloping the E-Chalk system. The decision to use the Java Virtual Machine asthe primary execution platform for E-Chalk was made very early in the project.The primary reason, back in the beginning of the project, was platform indepen-dence. Java allows to run the system on any available machine in the university.E-Chalk’s capability to run on Windows, Linux, Mac OS, Mac OS X, or Sun OS

Page 42: Adaptive Audio and Video Processing for Electronic Chalkboard ...

34 CHAPTER 4. SERVER ARCHITECTURE

was actually one of the main reasons why other educational institutions foundit easy to try out the system. A second important reason for using Java was thepossibility to have one rendering engine for both the board server and the boardclient. The board client could be written as a Java Applet and the board serveras a Java application. The same code is used for displaying content in the class-room and in the Web Browser [Raffel, 2000]. This approach did not only savedevelopment work, it also helped to minimize differences between the classroomview and the remote view. Slight differences due to varying hardware prop-erties (such as screen resolution) and different virtual machine versions mightstill be perceived, though. Some computer scientist might object to using theJava Virtual Machine as an execution platform because of its poor performance– especially when it comes to audio and video processing. The Java developercommunity has been discussing and comparing the execution speed of Java pro-grams versus native programs for several years now. The actual numbers vary,but even older articles (see for example [Prechelt, 2000, Shirazi, 2003,Mathewet al., 1999], [46, 52]) agree, that the speed penalty and memory overhead aretolerable when weighed against programming productivity. A disadvantage ofusing a platform-independent virtual machine is that special features of a cer-tain system cannot be utilized. A platform-independent virtual machine canonly implement a common subset of the functionality of the platforms it sup-ports. Special features, like using the pressure sensitivity of digitizer tablets, arefor example not part of the Java Virtual Machine. To get around this problem,native code had to be used sometimes and encapsulated using the so-called JavaNative Interface (JNI). Of course, native code has to be implemented individ-ually for all targeted platforms. Java is a pure object-oriented language andsupports dynamic class loading. This feature allowed to implement many of theboard’s dynamic functionalities, like integrating third-party Applets or Chalk-lets. It has also spawned many implementations of component-managementframeworks. Last but not least, Java binaries are rather small. None of thecomponents described in this chapter is larger than 1 MB. The next section willintroduce Oscar, a component-management framework that was used as theunderlying layer in E-Chalk.

4.5 The Component Framework

Component orientation is becoming increasingly popular in modern applicationsand is being more and more discussed for multimedia architectures [Nahrstedtand Balke, 2004, Nahrstedt and Balke, 2005]. The concept of a component isbroad and includes plug-ins or other units of modularization. In this text, a soft-ware component is defined as “a binary unit of composition with contractuallyspecified interfaces and explicit context dependencies only” [C. Szyperski, 1998].The notion of component orientation is strongly connected to the idea of objectorientation. Compositions of components are usually created by an actor (ei-ther the user or another software program) that instantiates some componentsthrough a managing framework. The instances are then appropriately connectedto each other by this actor. Component models and frameworks include theComponent Object Model (COM) [Box, 1998], Java-Beans [Sun MicrosystemsInc, 1997], Enterprise Java Beans (EJB) [Sun Microsystems Inc, 2000], theCorba Component Model (CCM) [Object Management Group (OMG), 1999],

Page 43: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.5. THE COMPONENT FRAMEWORK 35

the OSGi standard, Jini [83], and Avalon [4]. EJB and CCM support non-functional aspects such as persistence, transactions, and distribution. OSGi,Jini, and Avalon are so-called service-oriented platforms. Service orientationshares the component-orientation idea in that applications are assembled fromindependent building blocks. However, the essential building blocks are notcomponents but the services they are providing. In other words, a componentcan provide more than one service. A service is a functionality that is contrac-tually defined in a service description, for example as a Java Interface. Theidea of service orientation is that application assembly is based only on theseservice descriptions and the actual components are located and integrated intothe application later, either prior to or during execution of the program. Amore detailed discussion of component- and service-oriented programming canbe found in [Cervantes and Hall, 2004].

The OSGi specification defines a framework that is an execution environ-ment for services. Compared to related specifications the core framework isvery small. The specification does not refer to too many concepts and the OSGiinitiative is also trying to keep it compact since it is still being targeted torestricted environments, such as embedded devices. Implementations are there-fore kept small and efficient. OSGi can, however, be used in other domains, forexample, as a support-infrastructure underlying the Eclipse IDE [The EclipseFoundation, 2003]. Even before Eclipse, Exymen had introduced the use ofthe OSGi specification on the desktop for building an extensible multimediaediting application [Friedland, 2002a]. Both Exymen and E-Chalk are imple-mented on top of the Open Service Container Architecture (Oscar) [Hall andCervantes, 2004] [23], an Open-Source implementation of the OSGi specifica-tion [The Open Services Gateway Initiative, 2003], that has recently been re-named to Felix. However, the approaches are not limited to this particular OSGiframework implementation and should also be deployable to any other standardOSGi framework.

Oscar was created with the goal to provide a compliant and completely openOSGi framework implementation. Work on the Oscar project started in Decem-ber 2000 by Richard S. Hall. Technically, the OSGi service framework can beseen as a custom, dynamic Java class loader and a service registry that is glob-ally accessible within a single Java Virtual Machine. The custom class loadermaintains a set of dynamically changing Bundles that share classes and resourceswith each other and interact via services published in the global service registry.Oscar is almost fully compliant with the OSGi specification release 1 and 2 andlargely compliant with release 3. The OSGi specification is a document thatcontains more than 600 pages [The Open Services Gateway Initiative, 2003],therefore the next paragraph will summarize only the most important factsthat are relevant to E-Chalk.

The OSGi framework defines a unit of modularization, called a Bundle.Physically, a Bundle is a Java JAR file that groups together all classes, to-gether with their resources (native libraries, icons, help files), into a component.Archive attributes, among them the dependencies on other Bundles, are de-scribed in the JAR file’s Manifest. Every Bundle contains a Java class thatinherits from org.osgi.BundleActivator. A BundleActivator provides twomethods: start() and stop(). The framework provides dynamic deploymentmechanisms for Bundles, including installation, removal, update, and activation.Figure 4.3 shows that a Bundle can be in one of the following states:

Page 44: Adaptive Audio and Video Processing for Electronic Chalkboard ...

36 CHAPTER 4. SERVER ARCHITECTURE

Figure 4.3: A state diagram of the life-cycle management provided by Oscar.Drawing after [The Open Services Gateway Initiative, 2003]

• active - the Bundle is now running,

• installed - the Bundle is installed but not yet resolved,

• resolved - the Bundle is resolved and is able to be started,

• starting - the Bundle is in the process of starting,

• stopping - the Bundle is in the process of stopping, or

• uninstalled - the Bundle is uninstalled and may not be used.

Oscar mainly provides the following operations to manage Bundles:

• Install a Bundle from a URL. The Bundle is downloaded and archivedin a local repository. The Bundle enters installed state.

• Start a Bundle. If all dependencies of the installed Bundle can be re-solved, the start() method of the Bundle is called and the Bundle entersthe active state.

• Stop a Bundle. The stop() method of the Bundle is called and the Bundlechanges to the resolved state.

• Uninstall a Bundle. This stops the Bundle and tags it as uninstalled.The Bundle is removed at the next refresh.

• Update a Bundle. This puts a new JAR file in place without refreshing it.The Bundle enters the installed state.

Page 45: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.6. COMPONENT DISCOVERY 37

• Refresh a Bundle. Refresh causes all Bundles that depend on installedor uninstalled Bundles to stop. Oscar resolves any updated Bundles andthen restarts them all, if possible, effectively creating new instances forevery dependent Bundle.

After a Bundle is installed, it can be activated if all of its Java package de-pendencies are satisfied. Bundles can export and/or import Java packages toand/or from each other. The OSGi framework automatically manages packagedependencies of locally installed Bundles. After a Bundle is activated it is ableto provide service implementations or use the service implementations of otherBundles within the framework. A service is a Java interface with externally spec-ified semantics. This separation between interface and implementation allowsfor the creation of any number of implementations for a given service. Whena component implements a service, the service object is placed in the serviceregistry provided by the OSGi framework so that other Bundles can discoverit. When a Bundle uses a services, this creates an instance-level dependency ona provider of that service. When the top-level application (e. g. E-Chalk ) isexited, the states of all Bundles are saved and restored at the next startup. Thisdecreases start-up time by avoiding to resolve all Bundles again. Bundles can beuninstalled or updated while the application is running without ever requiringa restart of the application.

Using the diagram in Figure 4.3, one can see what happens if Bundle A isuninstalled while it is needed by another Bundle B: Bundle A stays physicallywhere it is stored until the next refresh operation. At the refresh operation,A is deleted and B is stopped. Stopped Bundles cannot be used, but are stillinstalled. If the user installs an update of Bundle A that B can use, B enters theresolved state and can be used again after the next refresh. A refresh commandis invoked automatically after every install command by the SOPA framework.

4.6 Component Discovery

The OSGi specification allows the installation of Bundles from any URL. How-ever, it is not able to remotely discover Bundles. In order to be able to queryand locate Bundles and the services they are providing from remote locations,E-Chalk integrates the Eureka system.

Eureka [Pauls, 2003, Pauls and Hall, 2004] is a network-based resource-discovery service to support deployment and run-time integration of compo-nents into extensible systems. Eureka is based on Apple Inc’s Bonjour Network-ing [5] technology, formerly named Rendezvous. Bonjour is an open protocolthat enables automatic discovery of computers, devices, and services in ad-hoc,IP-based networks. The DNS/Bonjour infrastructure has features that fit wellwith the requirements of a component- or service-discovery service. For exam-ple, clients of a DNS/Bonjour-based resource-discovery services only producenetwork traffic when they actually make a query and they do not need to knowthe specific server that hosts a given component to discover it. Domain namesunder which components are registered provide an implicit scoping effect (e. g., aquery for components under the scope inf.fu-berlin.de produces a list of allcomponents available in the computer science department of the Freie Univer-sitat Berlin). A scope hosted by a server can be either open or closed. An open

Page 46: Adaptive Audio and Video Processing for Electronic Chalkboard ...

38 CHAPTER 4. SERVER ARCHITECTURE

scope allows arbitrary providers to publish their components into that scope.A closed scope requires a user name and password. To submit a component,meta-data and an URL from which the component archive file is accessible isprovided to Eureka. The developer may also submit the component archive fileitself. Then, the Eureka server will store it in its component repository and useits own HTTP server to make the submitted component accessible.

Eureka also provides a garbage-collection mechanism for component meta-data. A Eureka server periodically checks whether all components referenced bythe meta-data in its associated DNS server are accessible via their given URL.If a component cannot be accessed, its meta-data is removed from the server.

The E-Chalk server system uses the scope sopa.inf.fu-berlin.de. Thisresults in each service being referenced as 〈eurekaid〉.sopa.inf.fu-berlin.de.The identifier 〈eurekaid〉 is generated by Eureka – with the exception of www,which is reserved for the website of the project. The meta-data for media nodesare automatically generated by the SOPA framework and consist mainly of aset of Java properties. The publishing and unpublishing of nodes is reduced tospecifying a download URL and a node name (see Appendix B).

4.7 Component Assembly

The SOPA framework is based on Oscar and Eureka. It uses them to achieveits goals and provides several services on top of it. The next sections provide adetailed explanation of the SOPA framework as integrated into E-Chalk.

4.7.1 Processing Nodes

Building a graph that combines individual filtering units for stream processingappears in many systems and can be considered canonical (see Section 4.2). Thebasic units in SOPA are called media nodes. There are six basic types of nodes:generic, sources, targets, forks, mixers, and pipes. The conceptual differencebetween them is the semantics which is determined by the number of inputsand outputs.

• Source nodes have one outgoing edge and no incoming edges. A sourcenode generates data or gets its data from anywhere outside the graph.This node is typically used for accessing sound devices, video cards, or forfile readers.

• Target nodes only have one incoming edge and no outgoing edges. A targetnode acts as a sink: It takes the incoming data and writes it somewhere.This node is typically used for playback or for file writers.

• Pipe nodes inherit the properties from source nodes and target nodes.They have one incoming edge and one outgoing edge. This type of nodeis the most frequently used because it can be used to implement filters,converters, measuring devices, and many more.

• Fork nodes are pipe nodes that have two or more outgoing edges. They arehelpful to branch the data flow in oder to have several processing chainsfor the same input.

Page 47: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.7. COMPONENT ASSEMBLY 39

• Mixer nodes unify two or more incoming processing branches into one.Consequently, they have several incoming edges and only one outgoingedge.

• Generic nodes have neither incoming nor outgoing edges. However, theycan communicate with the rest of the nodes in the graph via events.Generic nodes are meant to be extended for different purposes. Theirtypical use is as a receptor that captures information from outside thegraph and reacts by rebuilding the graph as necessary.

Technically, the nodes are defined as abstract classes inside the SOPA frame-work. The developer has to define the final semantics of each node by inheritingfrom one of the six superclasses. To use the framework, a developer has tolearn only a limited number of concepts in order to create his or her own nodes.Appendix B shows the overhead required for implementing a PipeNode. Sincethe node acts as target and as a source, the methods of both types of nodeshave to be overwritten. On other words, the overhead doubles for this type ofnode. Nevertheless, no more than ten methods have to be implemented by thedeveloper, most of them being one-liners.

The framework already defines a set of standard nodes. Standard nodesinclude testing methods and default implementations for frequently-used func-tions. Examples include a pipe node that introduces bitwise noise into anystream that is passed through, a bandwidth delimiter, a traffic-measurementpipe, a compression pipe node that applies ZIP compression on any byte stream,a buffer pipe that caches any incoming data before it is passed through, a file-reading source, a source node that generates zeros as output, or a file-writingtarget. Several predefined nodes are useful for component assembly. The so-called IdentityPipe just outputs the incoming data, the BlackHoleTarget isa target node that discards all incoming data, the GenericFork copies any in-coming data to all the outputs ports, and the GenericMixer node mixes themultiple incoming content into a single sequence on a first-in, first-out basis.

Every class that inherits from MediaNode is a service as defined by the OSGistandard. This way, Oscar actually takes care of the component administrationbut is hidden to developers that do not want to fiddle with the OSGi system.Every node has a name and a version that identifies it uniquely. It comesalong with a set of properties as well as a preference-ordered list of processableformats. The formats are described using a FormatDescriptor as explained inSection 4.7.4. Every node class provides methods that describe the propertiesneeded for graph assembly. There is no need to create any extra file that containsmeta-data for a particular node.

Nodes are configured via Java properties. SOPA features a central propertymanagement system that builds upon Oscar’s property management which inturn builds on Java’s property management. Properties can be made persistentand can quickly be restored upon a restart of the system. The E-Chalk StartupWizard also generates property files that are read in by SOPA. Any node’sparameters, for example filenames or sampling rates, are set this way. Nodescommunicate via property change events: Whenever a property changes, theproperty manager notifies all nodes. There are several predefined propertiesfor certain events. Properties are set, for example, when new media nodes areinitialized or new processing paths are started.

Page 48: Adaptive Audio and Video Processing for Electronic Chalkboard ...

40 CHAPTER 4. SERVER ARCHITECTURE

Figure 4.4: A screenshot of a visualized media graph inside SOPA’s graphicalnode composition editor. The visualization is updated at runtime as the graph isupdated. The editor can read and write XML graph serializations independent ofthe framework. The visualization shows a typical processing graph in the E-Chalksystem containing independent audio and video paths.

At any time a node is in one of three states: constructed, initialized, or run-ning. In the constructed state a node is constructed by first instantiating theclass and then calling the start() method as required by OSGi. Unlike theOSGi standard, in SOPA a service can be instantiated multiple times. This al-lows to use several instances of the same node in different locations of the graph.In the initialized state a node is initialized after the graph has been resolvedand is to be started. During initialization, a node has to prepare everythingfor the immediate receive of data. This state is introduced for synchroniza-tion. Section 4.7.5 explains synchronization in detail. Finally, in the runningstate pipes, targets, mixers, and forks immediately receive data from their pre-decessing nodes. If stop() is called on a media node, the node goes back tothe initialized state. This saves object disposal and construction (rather costlyoperations in Java), since a node may be reused in another graph location.

4.7.2 The Processing Graph

Component assembly is performed by arranging the media nodes into one ormore directed acyclic graphs. In reality, the framework is able to handle severalunconnected graphs at once, but for simplicity this text will refer to them as“the graph”. The actual assembly and resolution algorithm of the graph isdescribed in Section 4.7.3. Multimedia PC hardware, such as sound or videocards, mostly enforces a push paradigm (hard disks, however, are accessed usinga pull paradigm). For this reason, data is flowing from a source node to a targetnode with the source nodes pushing the data.

The media graph can be created and changed in two ways. A media node canuse the methods provided by the framework, or the framework itself can load aserialized version of the graph. The framework can load or serialize the structureof the graph at any time. The serialization is stored in a simple XML format

Page 49: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.7. COMPONENT ASSEMBLY 41

(see Appendix B). Additionally, the framework provides a graphical editor forvisualizing and building graph descriptions (Figure 4.4 shows a screenshot) anda command-line console for developers. The shell gives access to Oscar andEureka functionalities such as installing and publishing Bundles as well as to afew Java debugging features. Upon startup of the framework, the initial graphis always loaded from the XML description.

Each node is described by a temporary label (that can be chosen freely, oris assigned randomly by the framework), its type, and an LDAP query [Howes,1996]. The framework searches for nodes matching the LDAP query, first locallyin Oscar’s Bundle repository and then remotely using Eureka. The specifica-tion of nodes using LDAP queries allows incomplete descriptions which enablessystem administrators to only specify the important properties and to includewildcards. Appendix B shows the grammar of the LDAP query language. Thefollowing listing shows some sample node descriptions.

<service label="source"match="(&(&(author=Friedland)(version>=1))(outputs=*RGB*))"type="&source;">

target="display"></service><service label="display"

match="(&(&author=Friedland)(version>=1))(name=TVPipe))"type="&pipe;"target="sink">

</service><service label="sink"

match="(name=BlackHoleTarget)"type="&target;">

</service>

To deploy a running media graph, it suffices to copy an XML graph de-scription, the framework itself, and optionally a few property files. Usually aBundle repository is also included, so that the end user is not required to havean Internet connection already at startup. Media nodes can be removed or re-placed dynamically at any time if they are not in the running state. However,forks and mixers can handle new connections even when in the running state.Therefore an active path can be connected to by connecting a media node toa fork or a mixer. A media graph is in one of three states: defined, resolved,or active. In the defined state the graph only consists of node descriptions. Inthe resolved state the graph is resolved as described in Section 4.7.3. Whena graph is resolved, at least one path exists from a source to a target, whereall LDAP queries have been evaluated to match certain media nodes and theinput and output formats have been set to each media node in such a way thatthey build a processing chain. If a media graph description led to a resolvedgraph, then the active state is reached by first initializing all non-sources of validpaths. Then the sources are activated in order to start delivering data. If thereis an error during initialization of a node and there are still alternatives thatalso match the LDAP query, they replace the erroneous node. When running,sources continously push the stream of data through the pipes to the target. Apath of the media graph is deactivated by stopping its source. An event is thenpropagated saying that no further data is available, which makes the remainingnodes of the path shut down, too. After all activated paths have shut down, themedia graph gets back to the resolved state.

Page 50: Adaptive Audio and Video Processing for Electronic Chalkboard ...

42 CHAPTER 4. SERVER ARCHITECTURE

4.7.3 Resolving the Media Graph

The graph resolution algorithm is the core of the framework. Because of thisimportance, developers might want to change its behavior. SOPA’s graph res-olution algorithm can therefore easily be exchanged by third-party developerswho want to provide their own resolution. Apart from a class that implementsan interface with the new resolution method, a new resolution algorithm mightalso need its own serialization, which can easily be changed by creating a newXML DTD (and providing the methods for reading and writing the serializa-tion). Several different resolution algorithms have been implemented. One ofthe question was whether the connecting edges have to be specified manuallyby the user. In the end, the following method seemed to be the most practicalone. It became the default behavior in the E-Chalk system. The DTD shown inAppendix B is used by this resolution method, it has been extended for betteruser editability, see Section 4.9. In the beginning, the graph consists only ofa set of connected node descriptions. The graph is then resolved in two mainsteps.

In the first step, the SOPA framework tries to match the LDAP queries.The query is matched to the properties that each media node propagates. Medianodes are searched locally and in the Internet using Eureka. If no node is found,the regarding path cannot be resolved. If several nodes are found they are storedas a list of alternatives considered in the next step.

In the second step, a list of media nodes belongs to each node description.The framework now tries to create a processing chain by substituting each nodedescription by the media node that matches best concerning its input and outputformat. Since the format list is preference ordered, best fit is defined as theminimum index in the list. The source’s output format and the target’s inputformat is considered more important than the format preferences of other nodes.If there is an ambiguity, newer versions of media nodes are preferred.

Technically, the following steps are preformed during resolution. The inputis a serialization of the graph that already contains all edges.

1. Count the incoming and outgoing edges of all node descriptions. If aspecified node type does not match the incoming or outgoing edges: throwthe description out and notify the user.

2. Use union-find to separate unconnected graphs. The following steps areperformed for each connected set of nodes.

3. Try to match each LDAP query using the properties defined in the systemand by the nodes, both locally and in the defined Eureka search scopes.

4. Associate each set of matching nodes to the node description.

5. Find a path from a source to a target such that:

• The sum of all preference-ordered format list indeces of each node’sused format is minimal. The index numbers of the source node andthe target node are each weighted twice.

• If there is ambiguity, use the media node that has a higher versionnumber.

Page 51: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.7. COMPONENT ASSEMBLY 43

• If there is still ambiguity, use the node that was found first (theseare usually the local ones).

6. Test whether the path from source to target is complete. If yes, tag allnodes in the path as startable.

4.7.4 Identifying Media Formats

Data can only be exchanged between two nodes if they speak the same lan-guage, that is, if they work on data in the same format. Sometimes, it sufficesto describe a format using basic data types. For example, a node that uses ZIPcompression on any incoming data can work on a byte-per-byte basis. Usu-ally, however, nodes need to exchange a more differentiated description of thestructure of the data they have to deal with. The degree of provided detailand the way this structural information must itself be structured is difficult tostandardize because it depends heavily on the format per se and how it is to behandled. In the end, if a totally different, new media format is to be handled,a group of node developers will have to agree on an appropriate description.

In SOPA, media formats are distinguished by so-called format descriptors.Format descriptors are actually Java Interfaces that provide get and set methodsfor certain format properties. The mechanism has already been established inExymen, [Friedland, 2002a] discusses it in detail. The SOPA framework providesseveral default implementations of format and content descriptors. Examplesinclude a generic descriptor for byte streams, a descriptor for uncompressedvideo formats and a descriptor for raw audio formats. Each descriptor suppliesseveral standard methods typically used to describe media content such as theaverage frame rate, the duration of an individual frame, the name, time andspace coordinates of a frame, and a so-called FormatID. Exymen uses FormatIDsas a mechanism to uniquely identify media formats, because file extensionsare ambiguous and unreliable. Magic bytes in headers are not always usedby formats and sometimes it is unclear how to read them since they tend tobe machine-dependent (for example, Little and Big Endian representations).MIME types [Freed and Borenstein, 1996b] contain too few information becausetheir primary intention is to define a mapping between format and application,not the classification of format types. Exymen’s FormatIDs group compatibleformats on the SDK/API level. Formats that are basically different but arehandled by the same SDK/API are given the same ID. For example, a node thatuses Microsoft’s Windows Media SDK [34] can handle all audio codecs supportedby the Audio Codec Manager (ACM). A lists of the FormatIDs defined so far isavailable at [76].

4.7.5 Synchronization

When different media are to be processed in parallel they need to be synchro-nized in most cases. If content flows continuously at a constant data rate,synchronization is trivial since at any point in the stream the time-position isclear. Given a certain stream position, any time position in the past or in the fu-ture can be easily extrapolated. When started at the same time, a video streamand an audio stream directly grabbed from a camera and a sound card can beeasily kept in sync this way. Whenever there is a time difference, one stream

Page 52: Adaptive Audio and Video Processing for Electronic Chalkboard ...

44 CHAPTER 4. SERVER ARCHITECTURE

can either wait or skip bytes. However, if bit rates vary or a medium deliversevent-based data, such as strokes from a chalkboard, one cannot easily get andinterpolate the time-position of a stream at any time. It is particularly impos-sible to predict any future time position. Since components may be combinedfreely, node developers are, of course, not required to handle synchronization ontheir own.

The first type of synchronization has already been described in Section 4.7.2.When the media graph is activated, all non-sources are initialized first. Thenthe sources are activated in order to start delivering data. This is done in orderto be able to handle all possible errors as soon as possible and to achieve arather simultaneous start. Deactivation works the other way round: First allsources are stopped, then an event is propagated saying that no further datais available. Thus the remaining nodes can process the remaining data beforethey shut down.

In addition, the SOPA framework provides a content-independent synchro-nization scheme that works on a node-to-node level. Several nodes can begrouped together into a synchronization group. Nodes can be added or removedfrom a synchronization group at runtime. The synchronization mechanism pro-vided by SOPA has been derived from the well-known barrier synchronizationscheme, which is described in detail in [Tanenbaum and van Steen, 2002]: “Abarrier is a synchronization mechanism that prevents any process from start-ing phase n+1 of a program until all processes have finished phase n. When aprocess arrives at a barrier, it must wait until all other processes get there aswell.”

SOPA deals with threads instead of processes and extends the notion of abarrier in two aspects.

1. Instead of a fixed number of barriers, the mechanism assumes an infinitenumber of barriers and all barriers are ordered and identified by a naturalnumber.

2. A thread is only blocked by barriers of the same synchronization groupand only if other threads define themselves dependent on it at a certainbarrier.

In the framework this new mechanism is called Progress-Constrained Threadssince it offers a notion for threads to dynamically depend on other threads’ pro-gress. Each Progress-Constrained Thread implements a special interface andregisters to a central Clearance Manager. Each Progress-Constrained Threadimplements a method which allows the Clearance Manager with get the depen-dencies of this thread on any other threads at a certain barrier. Threads have torequest permission for progress by requesting to proceed over the next barrierat the Clearance Manager. The Clearance Manager asks all registered threadsof a synchronization group for dependencies on other threads at this barrier. Ifa requesting thread depends on another, it is blocked until that other threadrequests the same or a higher barrier. To avoid deadlocks, the Clearance Man-ager forces the requested barrier numbers to be monotonously increasing. Thedependencies have to be constant for a certain barrier, but may change at laterbarriers. That means, if a certain barrier has been requested by a thread (andthus marked as reached), the synchronization group cannot be changed for thisbarrier or any previous barrier.

Page 53: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.8. LIMITS OF THE APPROACH 45

Any set of media nodes can be grouped into a synchronization group, forexample in the XML graph description. When a node is put into a synchro-nization group, the framework automatically registers the media nodes with theClearance Manager. By default, the system associates barriers with time stampsusing a granularity of 10 ms. The path to an ancestor node is blocked while agiven media node has processed more data than the others according to thetime stamps. However, other behavior can easily be implemented by overridingthe dependency-definition method. Nodes have to decide for themselves whatto do while blocked: Either they discard the incoming data, or they accumulateit in a buffer for later flushing.

The following practical example illustrates the approach: Let us assumethat we want to synchronize an audio track with a pipe that presents a GUIwith a pause button. The pause pipe overrides the default behavior so thatonly when the button is pressed it reports itself dependent on any node in thesynchronization group. The audio stream is then blocked until the pause modeis released and the pipe requests progress again.

4.7.6 Top-Level Application

The SOPA framework is integrated into the E-Chalk system and is invisibleto the user. As described in [Knipping, 2005] in Section 3.10.1, any Java ap-plication that implements the de.echalk.util.Launchable interface can bestarted by the E-Chalk Startup Wizard. The framework is started and stoppedby a wrapper class that implements the interface. Property files are generatedby the E-Chalk Startup Wizard and an initial serialization of a video and anaudio graph is deployed with the E-Chalk system. In addition to the functionsdiscussed here, the framework also provides some minor features that are notdiscussed in detail here. Examples include an error and debug message manage-ment and a node update management. The update management is configuredto determine when new versions of a node should be favored over ones alreadydownloaded in the local repository during graph resolution. Usually updatepolicies are defined inside the LDAP queries, but using the update managerthey can also be enforced or forbidden. Further information on the details ofthe system can be found in the SOPA’s developer documentation [17].

4.8 Limits of the Approach

The presented architectural approach facilitates the maintenance and configu-ration of streaming and processing components which are usually organized ingraphs. The validity of the approach has been time-tested since its integrationinto the official E-Chalk distribution in the beginning of 2003. Still, severalissues remain unimplemented or unsolved. This section discusses some of them.

The resolution algorithm depicted here works well with several dozen nodes.It was never intended nor tested for a system with more than a hundred nodesin one graph. Apart from efficiency problems concerning the graph resolution,the realtime performance of the sum of filters in a media graph is not guar-anteed. Up to now the system installs components on demand without anyknowledge of how much CPU time is consumed. Therefore, the combinationof certain components may go beyond the limits of the underlying computer

Page 54: Adaptive Audio and Video Processing for Electronic Chalkboard ...

46 CHAPTER 4. SERVER ARCHITECTURE

system. The result would be a denial of service. One solution is to let eachdeveloper implement a benchmark for their components which returns a valuerelative to components that come along with the framework. On the very firstrun of SOPA, a large benchmark using built-in components would be run tofigure out what the underlying system is able to handle. When running, SOPAwould refuse to integrate further components into the graph if the sum of thebenchmark results of the already integrated components exceeds the maximum.One problem of this approach is that CPU time consumption may depend onthe semantics of concrete content.

Security is a concern in any environment that supports the execution of ar-bitrary dynamically downloaded code. From the point of view of safety, theenvironment must be protected from dynamically integrated components caus-ing harm (such as deleted files) to the underlying resources. From the point ofview of privacy, the environment must be protected from components snoopingor spying, such as inventorying all services being used. The OSGi frameworkactually provides a technique for dealing with security, namely it executes dy-namically integrated components within a security sandbox. The sandbox isused to prevent unauthorized access to the underlying resources and to controlthe visibility of other installed services and resources. External rules enable thecreation of security policies that can assign default access rights to dynamicallyintegrated components, as well as assigning different levels of rights to com-ponents from known sources using a public-key cryptography approach. Theapproach, however, is rarely used because of performance problems. The mostdifficult aspect of security is finding mechanisms that are very simple or supportautomated decision making, since the typical end-user is not very knowledgeableabout security-related issues. End-user involvement in security-related decisionsshould be kept to a minimum to avoid confusion and mistakes.

Due to the generic approach of components that provide their functionalityby means of services, the most important thing that a node developer has todo is to define proper service contracts in the form of properties describing thenode. An inherent problem that occurs when arbitrary parties want to sharetheir component implementations is that all service contracts must be under-stood by everybody and must provide enough semantic to describe the servicein all contexts and all purposes. SOPA reduces this problem by restrictingthe context to a specific domain, namely multimedia signal processing. Due tothis restriction, the syntactic interface descriptions enforced by the architecture,combined with the meta-data encouraged by the framework plus a few proper-ties, are sufficient in most practical cases. In theory, however, it is very easyto describe the same component with totally different service contracts. Theresult is that components that could fit into a particular position of the graphmight not be considered because service contracts are incompatible.

4.9 Practical Usage Examples

The following section presents a few practical scenarios that illustrate the use-fulness of the architectural approach discussed in the preceeding Chapter. TheSOPA framework was integrated into E-Chalk in 2003. Since then, about ahundred nodes have been developed for both experimental and productive use.Both the audio system presented in this dissertation and the video system have

Page 55: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.9. PRACTICAL USAGE EXAMPLES 47

Figure 4.5: A screenshot of the audio panel of the E-Chalk Startup Wizard. Thewizard outputs property files that directly affect the graph assembly.

been created using the framework and have been in use for several years now.Some of the presented advantages result from mere component orientation, oth-ers results from the functionality of remote component discovery, a third classmight result from automatic graph resolution. In the end, the combination ofthe features discussed earlier offers even more opportunities.

As discussed in Chapter 2 of [Knipping, 2005], E-Chalk is mainly configuredthrough the E-Chalk Startup Wizard. Figure 4.5 shows a screenshot. Theoutput of the Startup Wizard consists of Java property files. Property filesare actually hash tables that map a unique key encoded as a string to anyvalue. Before SOPA had been integrated into the E-Chalk system, each propertyvalue resulted in a case distinction somewhere in the Java code. Using theLDAP queries, many of these case distinctions are now handled by the graphresolution. In order to promote this feature, LDAP queries can also be usedfor case distinctions. Nodes that contain disabled functionality are just notconsidered any more. The following code snippet illustrates this functionalityfor the VU-Meter checkbox:

<on match="(!(audio.tools.vumeter=true))"><service label="audioprocessors2"

match="(name=BlackHoleTarget)"type="&target;">

</service></on>

<on match="(audio.tools.vumeter=true)"><service label="audioprocessors2"

match="(name=VU-Meter)"type="&pipe;"target="audiosink">

</service><service label="audiosink"

match="(name=BlackHoleTarget)"type="&target;">

</service>

Page 56: Adaptive Audio and Video Processing for Electronic Chalkboard ...

48 CHAPTER 4. SERVER ARCHITECTURE

As can be seen from the DTD shown in Appendix B, the standard XMLserialization is extended for user editability. Several commands that are usu-ally part of the LDAP queries have been externalized and have their own XMLtags. This helps system administrators to edit the file directly. This attributeturned out to be a useful feature when supporting users: They can individu-alize the processing graph according to their needs. By developing new nodes,the functionality of E-Chalk can be extended using the SOPA framework andwithout fiddling with any E-Chalk source code. These extension capabilitiesgo beyond traditional plug-in extensibility since even basic functionality can beexchanged.1 But not only third-party development is made easier: Studentsin our department found it useful to be able to integrate their own extensionsdirectly into the system. The integration of Eureka into the framework, forexample, facilitates deployment of updates and customized nodes. Assume aneducational institution has several installations of E-Chalk that have been indi-vidually customized by a dedicated developer. The institution can now defineits own scope in Eureka and deploy the newly developed nodes individually toall installations in the institution. Scopes can be defined so that experimentalnodes are not searched for E-Chalk installations in productive environments.Nodes provided by the E-Chalk developers can still be searched in the homescope sopa.inf.fu-berlin.de. The institution just extends the search domainto include their own institution.

The audio expert system presented in Chapter 7 guides the user through asystematic setup and test of the sound equipment. The result is a modified ini-tial XML media graph description that contain a set of filtering services neededfor pre- and postprocessing audio recordings. During the lecture recording, thesystem monitors and controls important parts of the sound hardware. This ex-ample shows, that using SOPA, machines can also easily alter the configurationof processing graphs, not only users and system administrators.

Both the configurability using XML and the possibility of reusing compo-nents facilitates fast prototyping and debugging of filter processing chains. Forexample, many of the experiments on the instructor extraction presented inChapter 9 where conducted in a testing environment that allows to discoverand execute media nodes. It was used to test experimental video nodes frameby frame. Figure 4.6 shows a screenshot.

In a typical live streaming scenario, a fork node connecting to different codecnodes can be used to convert a media stream into different formats. A receptorservice (for example a generic node) receives the incoming connection request.It then assembles a media graph containing converter nodes that convert theformat of the captured media to the format playable by the client software.Instead of forcing the user to install the right plug-in, the server adapts itselfto the needs of the user.

Exymen shares the local repository with E-Chalk. Since the format descrip-tion mechanisms in SOPA and Exymen are identical, a plug-in could easily bedeveloped that acts as a wrapper between Exymen’s plug-in API and SOPA’sAPI. The plug-in enables Exymen to import and export any media format assoon as a conversion path can be built between the requested import format anda format Exymen can edit. For exporting, conversion pipes must exist for the

1One of the examples where this extensibility was appreciated were requests by researchersof the University of Regensburg, who wanted to combine E-Chalk lectures with RealVideo.

Page 57: Adaptive Audio and Video Processing for Electronic Chalkboard ...

4.10. CONCLUSION 49

Figure 4.6: A screenshot of the testing environment for rapid prototyping anddebugging of video nodes on a frame-by-frame basis.

currently edited format and the requested output format. This way, Exymenadapts itself automatically to newly introduced streaming formats in E-Chalk.

4.10 Conclusion

The SOPA framework manages multimedia processing and streaming compo-nents organized in a flow graph. Based on state-of-the-art solutions for compo-nent-based software development, the system simplifies the assembly of multi-media streaming applications and associated tools. Components are discoveredfrom interconnected remote repositories and integrated autonomously duringruntime. Stream synchronization is handled format-independently. Most partsof the E-Chalk server system, namely the audio system and the video system arebased on the SOPA framework. The next chapter describes the implementationof the client system.

Page 58: Adaptive Audio and Video Processing for Electronic Chalkboard ...

50 CHAPTER 4. SERVER ARCHITECTURE

Page 59: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 5

Client Architecture

A key concept of E-Chalk is remote teaching over the Internet. This chapter isdevoted to technical considerations on E-Chalk’s distance teaching facilities andtheir remote replay. Currently, there are basically three ways to replay E-Chalklectures: using Java-based client software, using traditional video players, andusing MPEG-4.

5.1 Preliminary Considerations

Successful distance teaching requires the awareness of the technical abilities andpre-requisites of the targeted students. For example, when a participant has todownload and install client software, the “psychological barrier” for following aremote lecture for the first time can be very high. Furthermore, it might not be agood idea to assume that all students have an Internet connection, let alone onewith a high bandwidth. A survey among engineering students in Berlin revealedthat while 93 % had Internet access at home, more than half of them had todial-in with a modem [Friedland et al., 2004c]. Subsequently, broadcasting atdifferent levels of quality and/or splitting up the content into different streamsproviding the remote viewer with the choice to turn off individual streams isadvisable.

5.2 The Java Client

In E-Chalk, remote listeners receive the board content and listen to the lecturer.In some of the lectures a small video of the teacher has been transmitted, too.The small video screen does not deliver very much content-related information,but provides an impression of the classroom in order to achieve a certain “psy-chological closeness” to the classroom (see Chapter 8). Later, the small-videoapproach was abandoned in favor of a semi-transparent transmission of the seg-mented lecturer in front of the board. The reasons for this and the details ofthis approach form a huge part of this dissertation starting at Chapter 9.

Upon starting the E-Chalk project in 2000, there were three main reasons forthe decision to create a purely Java-based Applet client. First, the renderingengine of the server could be reused in the client, thus making it easier toguarantee that the replay looked exactly like the server presentation [Raffel,

51

Page 60: Adaptive Audio and Video Processing for Electronic Chalkboard ...

52 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.1: Conceptual diagram of the Java-based client replay architecture forlive streaming. The content is transmitted directly over different TCP sockets.

2000]. Second, the World Wide Radio 2 Java-based audio client could be usedas an audio replay facility (see Chapter 6). Last but not least, as discussed inChapter 2, most lecture-recording tools require the remote learner to install aspecial receiving software, usually designed as a browser plug-in. This introducesa psychological barrier for first-time users, compare for example [Nielsen, 1999].Moreover, remote learners often do not have the skills or even the permissions(for example on campus computers) to install such a client software.

As explained in [Knipping, 2005] Section 2.4.1, E-Chalk generates a directorytree containing the Java client and other resources for replay, including the webpage that is to be displayed by the web browser. The E-Chalk client systeminternally has three modes of operation. In live mode, each client connects to itscorresponding server through a socket connection. In on-demand mode, clientsuse an HTTP connection to receive the files and no E-Chalk server is needed.In local mode, the client just reads files from the local hard disk as they werestored in the directory tree. This last mode is often used by students to saveconnection costs when they watch a lecture at home. For easier download, thegenerated files are collapsed into one archive file.

The Java-based replay system consists of a set of independent receiver Ap-plets that synchronize by communicating through the Media Applet Synchro-nization Interface (MASI). MASI can be used to synchronize several mediaApplets located in one web page to make them cooperate and deliver the samemultimedia content. The E-Chalk system uses this to synchronize the audio,video, and board client. The underlying concept of this Interface is the notionof a frame. All methods that address offsets use frames as their basic unit. Aframe is a certain amount of time. Frames are atomic in the sense that thereare no fractions of frames. Every Applet has to provide methods to convert theabstract unit frame into concrete amounts of time. The amounts of time shouldbe chosen to be as small as possible to allow optimal control of the streams.MASI provides around 26 methods for synchronization and concurrent streamcontrol, these include methods for pausing, fast-forwarding, and rewinding aswell as communication methods and error handling. For a detailed descriptionof MASI see [16]. Although the text refers to “the client”, the replay systemconsists of a board client, an audio client, a video client (for both small windowreplay and overlaid instructor replay as explained in Chapter 9), a slide-showclient, and a console client to provide VCR-like GUI control. Figures 5.1 and

Page 61: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.2. THE JAVA CLIENT 53

Figure 5.2: Conceptual diagram of the Java-based client replay architecture forremote on-demand replay. The content is streamed from files over HTTP.

5.2 illustrate the basic architectures for live and on-demand operations. Lo-cal mode works similar to on-demand mode, with the exception that files aredirectly read from a local storage device. Figure 5.3 shows the Java-based re-play client running in a web browser. All clients are backwards compatible toJava 1.1 to achieve maximum compatibility. For example, on PDAs that usethe Pocket PC operating system, Insignia’s Jeode Java Virtual Machine can beinstalled which implements most features of Java 1.1 Standard Edition. Thisalready allows for a Java-based play back of E-Chalk lectures. However, a lotof scrolling is required, since scaling has not yet been implemented in the boardclient. Figure 5.4 shows an example.

A detailed description of most of the features of the client Applets can befound in [Knipping, 2005], Chapter 7. The description here only summarizesthe key aspects.

5.2.1 Board Client

The code for the board client is an intersection with the board server code.The board client opens a window of the same size as the board server in theclassroom. The strokes are rendered by the same code as in the server. Ifthe resolution of the remote screen is smaller than the server screen, the useris still able to see the entire content by scrolling the board. The board canbe scrolled manually in all four directions. The server also sends scroll eventsthat may interfere with a client’s scroll action. Therefore the user can choosebetween server-only scrolling, client-only scrolling, and combined mode. Theboard server and client use a textual format to encode events, the bandwidthconsumed depends on the sampling rate of the drawing device. In practice itvaries between 2.5 kbit/s and 6 kbit/s, not counting any image or Applet data.Random seek to a specific time position is implemented using a fast redraw of theevents from time position 0 to the desired position. An overview of the formatcan be found in Appendix C, a detailed discussion can be found in [Knipping,2005], Section 4.11.

Page 62: Adaptive Audio and Video Processing for Electronic Chalkboard ...

54 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.3: Java-based replay of an E-Chalk lecture in a regular browser. Theadditional video window is used to make the lecture appear more personal. Pre-senting the lecturer in a second window has several disadvantages that are dis-cussed in Chapter 9.

5.2.2 Audio Client

The current audio client is able to decode World Wide Radio 2 (WWR2) andWorld Wide Radio 3 (WWR3) formats. The format is a packetized ADPCMvariant. The bandwidth needed depends on the selected quality and variesbetween 15 kbit/s and 256 kbit/s. The details on the audio format are presentedin Appendix D. Depending on the installed Java version, the client chooses toplayback the audio at a sampling rate of either 8 kHz (Java 1.1) or 16 kHz (Java1.3 and higher) with 16 bits per sample.1 Random seek depends on mode ofoperation. In live mode, random seek is impossible. The client just plays thestream back as it comes in. In local mode, random seek is provided usingoperating system file I/O operations. Because the sizes of the compressed audiopackets are not constant, WWR3 uses an index file that contains a list of offsets,each pointing to the beginning of a packet. If this index file is missing (or anold WWR2 file is played back) the client has to start from the beginning of thecompressed audio file and iterate over each packet header reading the packetsizes and then skip over the actual packets. When an audio file is to be playedback remotely over HTTP, sometimes even skipping is not possible. The onlyoption is to transmit the entire compressed file from position 0 to the desiredposition. Even though browser caching might help, this random seek methodis very inefficient. The reason for this is that even though it had already beenspecified for HTTP 1.1 [Fielding et al., 1999], it took until about 2003 untilHTTP server daemons implemented the GET command with the ability to specifya start offset (so-called partial GET ). In other words, only entire files could beretrieved. Today, all the frequently used HTTP servers allow specifying byte-range offsets for file retrieval, thus enabling an efficient random seek. The audioclient then uses the index file to directly retrieve the desired package. The audioclient serves as a synchronization master, which means that audio replay is neversuspended unless bandwidth restrictions or net congestion interfere with audio

1Java 1.2, for a short time, did not support any audio playback in Applets.

Page 63: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.2. THE JAVA CLIENT 55

Figure 5.4: Java-based replay of an E-Chalk lecture on a mobile device runningthe PocketPC operating system (former Windows CE).

replay. Board, video, and slide-show client pause when they are too fast orskip (if possible) when they are too slow. An interruption of the audio replayis perceived to be more distracting than most visual distortions [Tellinghuisenand Nowak, 2003].

5.2.3 Video Client

The video client is very similar to the audio client. The compression works usingJPEG compression in combination with a simple lossy motion compensation.The implementation of the client is straightforward because Java Applets relyon browser functionality and one of the main capabilities of web browsers is todecompress JPEG files. As the Java Applet SDK directly uses the browser’srendering engine to decode JPEGs, they can be decompressed very efficiently.

A small technical intricacy concerns memory usage. Like the other clients,the video client needs a buffering strategy to guarantee continuous operation.However, many browsers restrict memory usage for Applets such that the avail-able memory would not last for caching several seconds of video data. A dy-namic cache was implemented that increases and decreases with the amount ofmemory available.

The client works in two modes. Either it opens a window to play back thevideo, or it gets a DrawPanel from the board client. The later mode is to enablereplay of a semi-transparent instructor video in front of the board (compareChapter 9). Figure 5.5 shows a screenshot. The transparency of the instructorcan be adjusted using an HTML parameter. When the instructor is overlaid onthe board, the color black is interpreted as transparent and the video is scaledup to fit the resolution of the board. Semi-transparent drawing, however, seemsto be a Java bottleneck at the time of writing this dissertation. For example,using a board resolution of 1024 × 768, the client does not achieve frame ratesof more than five frames per second on a 3-GHz Intel Pentium 4.

Since video streams consume CPU and connection resources, the client canclose the video stream at any time in order to save bandwidth and processor

Page 64: Adaptive Audio and Video Processing for Electronic Chalkboard ...

56 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.5: The instructor video played in front of the board using the Javaclient. The approach is discussed in Chapter 9.

time. The required bandwidth for video transmission ultimately depends onthe contents of the video stream. A 192 × 144 video stream at four frames persecond needs roughly 64 kbit/s. This means approximately 16 kbit per picture.Random seek is implemented by directly skipping to a specified position (ifpossible, see Section 5.2.2) and begining to play back from the next frame found.The codec has been built such that motion compensation is rather self-repairing(see Chapter 8 for details).

5.2.4 Slide-show Client

The E-Chalk slide-show client is a part of the E-Chalk system that has nocorresponding server component. The component had originally been integratedfor historical reasons as it was part of the World Wide Radio 2 system. Due touser demands it has remained part of the E-Chalk system. E-Chalk slide-showscan be generated by using Exymen to convert an HTML-exported PowerPointslide-show into an E-Chalk slide-show. Using Exymen, slide-shows can also begenerated from scratch and then edited [Friedland, 2002a]. The slide-show clientgets a list of events that contain a timestamp, a URL, and a target frame. Aslide-show event at some point during a lecture triggers the browser to opena URL. The URL can point to any content, such as HTML pages, images,animations, or video files. The target frame is used to specify the frame inthe browser where the webpage should pop up. The slide-show client, like allothers, synchronizes itself with the other clients using the MASI interface. Anexample of an E-Chalk slide-show lecture generated with Exymen can be seenin Figure 5.6.

Page 65: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.3. PLAYBACK AS VIDEO 57

Figure 5.6: An example of a slide-show lecture played back with the E-Chalkclient.

5.2.5 Console

The console is only available for on-demand or local replay. The Applet isa GUI interface that uses MASI to provide VCR-like operations to the user.The console allows direct seek to a certain time position as well as a relativeseek with operations similar to fast-forward and fast-rewind. The console alsoallows to pause and continue a lecture. Although a lecture replay can easilybe terminated by closing the browser or going to a different URL, closing theconsole also terminates all clients. The console also provides feedback to theuser: It shows the current time position, the length of the lecture, and the timeuntil the ending of the lecture. In order to allow universities to customize thedisplay panel to their corporate identity, the console supports different GUIthemes [Knipping, 2005].

5.3 Playback as Video

In addition to Java replay, E-Chalk lectures can also be stored as regular videofiles by instructing the SOPA framework to use the appropriate nodes during lec-ture recording. Lectures can also be converted offline using the E-Chalk2Videoconverter. Both methods use the Java Media Framework to encode the contentusing codecs that are either provided by the framework or by the operatingsystem. Depending on the chosen format, lectures can be played back using anystandard multimedia player such as the QuickTime Player, RealPlayer, or Win-dows Media Player. The content is encoded by internally rendering the boardcontent (plus the optional video of the instructor) using a frame-buffer client.Each internally rendered frame is then passed to a video codec.

When only board data (plus audio) is to be encoded, most video formats pro-vide only very bandwidth-inefficient storage. Video codecs use a frame-by-frameencoding. This results in the stroke data being converted from vector formatto pixel format. Even though motion compensation accounts for redundancies,

Page 66: Adaptive Audio and Video Processing for Electronic Chalkboard ...

58 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.7: Original rendered chalkboard picture (left) and image showing thetypical artifacts resulting from quantizing the higher frequency coefficients of aDCT-transformed image for low bandwidth transmission (right).

the storage space is still several orders of magnitude higher (see Section 5.5).Vector format storage is not only smaller, it is also favorable because the strokesemantics is preserved. After a lecture has been converted to video, it is forexample not possible to delete individual strokes or to insert a scroll event,without recalculating and rendering huge parts of the video again. Anotherdisadvantage concerns the way most traditional video codecs work. Most of-ten, lossy image-compression techniques are used that are based on a DCT orWavelet transformation. The output coefficients representing higher-frequencyregions are mostly quantized because higher-frequency parts of images are as-sumed to be perceptually less relevant than lower-frequency parts (see for ex-ample [ISO/IEC JTC1, 1993, ISO/IEC JTC1 and ITU-T, 1996, ISO/IEC JTC1and ITU-T, 1999]). These and similar techniques (for example vector quantiza-tion as in [19]) work for most images and videos showing natural scenes where aslight blurring is perceptually negligible. For vector drawings, such as electronicchalkboard strokes, however, blurred edges are clearly disturbing. Figure 5.7shows the typical artifacts resulting from frequency quantization applied on anelectronic chalkboard drawing. Specialized screen capture codecs (such as usedin Section 5.5) are able to compress board strokes well with few artifacts. How-ever, images or colorful Applets inserted into the board are reproduced in anunacceptably low quality or result in an unacceptably low compression rate.

In order to work around the disadvantages of inefficient compression, dis-turbing artifacts, and the loss of semantics, Stephan Lehmann has developed aconverter and a plug-in for the Windows Media Player that allows the replay ofboard data simultaneously with audio. The converter is an experimental com-mand line tool that encapsulates the board events into ASF files. ASF standsfor Advanced Systems Format and is a proprietary container format which ispart of the Windows Media Platform (see Section 2.3). The audio data is en-coded using Windows Media Audio. The generated file can be played back withWindows Media Player using the plug-in developed by Stephan Lehmann. Ran-dom seek is implemented by the player and works just as in the board client.The implementation (Figure 5.8) only served as a proof of concept and replayis limited to strokes and images. Live streaming is not possible.

When the board data is to be transmitted in combination with an overlaidinstructor, however, traditional video codecs provide an alternative. Althoughthe instructor image usually takes only about one third of the board surface, it

Page 67: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.3. PLAYBACK AS VIDEO 59

Figure 5.8: A plug-in developed for Windows Media Player makes it possible toview E-Chalk lectures without losing the vector-based storage format.

consumes most of the bandwidth. Since the instructor image has the propertiesof a natural scene, applying the strategies of traditional video encoders resultsin an adequate bandwidth reduction (see Section 5.5). Still, board semantics islost. However, editing board contents in such a lecture is not straightforward.Deleting or inserting board events, such as strokes or scroll events in an E-Chalklecture with overlaid instructor, may result in a confusing replay because thelecturer’s arm movements do not match the created strokes. Synchronized time-line-based editing can still be done, but this functionality is also provided bystandard video editing tools. However, encoding E-Chalk lectures using anMPEG-4 video codec is not yet possible in real time and in my experimentsmany players had problems to show a 1024× 768 video with overlaid instructorat the 25 frames per second (again an Intel Pentium 4 with 3 GHz was used).

A general advantage of video-based replay versus Java-based client replay isthat many tools are available for conversion and processing. This is especiallyuseful for the replay of E-Chalk lectures on handheld devices. PDAs, mobilephones, and iPods are able to play back different types of video formats. Man-ufacturers often ship the appropriate conversion and processing tools for theirdevice along with other accessories. The tools encode and scale any operating-system-supported video format down for playback on the small device. Thequality of the final replay depends on the quality of the video scaling and on theproperties of the device’s display. The same is true of the ability to randomlyseek into the lecture. Live replay seems to be impossible yet and the conversionspeed is far from reaching real time. My experiments have shown that the toolsprovided around the ITU-T video standard H.263-2000 [ITU, 2000] and the Re-alPlayer on Symbian OS work very well for E-Chalk. The bandwidth consumedfor a lecture with audio and board (no instructor video) is about 16 kbit/s. Avideo containing audio, board, and overlaid instructor needs 64 kbit/s. Whichmeans that a 90-minute lecture takes about 10MB or 40 MB, respectively, andcan easily be stored on a 64 MB or 128 MB SD-memory card. The display resolu-

Page 68: Adaptive Audio and Video Processing for Electronic Chalkboard ...

60 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.9: E-Chalk replay using the video capabilities of handheld devices. Left:A Symbian OS-based mobile phone. The resolution is 176× 144 pixels. Right: Avideo iPod.

tions of mobile phones are usually very low. As a consequence, complex imagesor small strokes sometimes disappear. Random seek is mostly not supported.Apple’s iPod supports playback of several movie profiles of MPEG-4 [ISO/IECJTC1 and ITU-T, 1999] using a screen resolution of 320 × 240. The convertedE-Chalk lectures have a fairly good quality, and a 90-minute lecture (includingaudio and overlaid instructor) uses about 90 MB of storage. This makes themeasily portable on this device (current video iPods come with 60 GB of storagespace). Random seek is supported, too. Figure 5.9 shows E-Chalk lecturesplayed back on a mobile phone and on a video iPod.

5.4 MPEG-4 Replay

When new techniques for video storage and compression are discussed, the videostandard that is most often mentioned is MPEG-4. MPEG-4 [ISO/IEC JTC1and ITU-T, 1999] is the successor of the MPEG-1 [ISO/IEC JTC1, 1993] andMPEG-2 [ISO/IEC JTC1 and ITU-T, 1996] standards. It extends them in manyways. For the purpose of encoding chalkboard events, it contains an interestingpart called Binary Format for Scenes (BIFS) [ISO/IEC JTC1 and ITU-T, 2005].BIFS includes support for the vector-based storage of 2D and 3D scenes, as wellas some interactivity. The following paragraphs will provide a few technicaldetails on the storage format before discussing the advantages and disadvantagesof using MPEG-4 for replay of E-Chalk lectures. Figure 5.10 shows E-Chalklectures played back using MPEG-4 players. For a short description of the boardformat and some sample mappings to MPEG-4, please refer to Appendix C.Further work on the conversion of E-Chalk lectures to MPEG-4 is presentedin [Jankovic et al., 2006].

Page 69: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.4. MPEG-4 REPLAY 61

Figure 5.10: Two lectures played back using MPEG-4. Using Osmo-Player (left)and using the IBM Java-based player (right).

5.4.1 Encoding E-Chalk Lectures in MPEG-4

As its name implies, BIFS is a binary representation that has to be compiledfrom a user-editable source format, called Extensible MPEG-4 Textual (XMT).XMT has two levels: XMT-A format and XMT-Ω format. XMT-A is a low-levelrepresentation that can be easily mapped to BIFS, while XMT-Ω is a high-levelformat that uses a subset of the tags defined by SMIL [67]. XMT-Ω compiles toXMT-A and then to BIFS. Although XMT is specified as BIFS source formatby the ISO standard, its biggest downside is that it is an XML-based format.XML files can only be parsed entirely, since the document opening tag hasto be closed by the document ending tag at the end of the file. This makes itimpossible to compile XMT files incrementally for live streaming, although BIFSis by itself a streamable format. A solution has been provided by the authorsof the GPAC framework [78], developed at the Ecole Nationale Superieure desTelecommunications (ENST) in Paris. The format is called BIFS Text (bt) andis a non-XML-based exact transcription of the BIFS stream. Some users alsoprefer the format for better readability as the bt document architecture is verysimilar to XMT-A and the syntax is close to VRML [ISO/IEC JTC1, 1997].

BIFS uses trees as basic structures. As a consequence, the basic struc-tures are nodes and there are two types of them: group nodes and leaf nodes.Group nodes link to a subtree, and leaf nodes are final. Once a node hasbeen defined, its properties can be changed at any time or the entire node canbe removed. Mapping the E-Chalk board events to BIFS is straightforwardin most cases. Timestamps are directly supported by BIFS: Every node def-inition or command can be preceded by an At <timestamp> command whichhas the same semantics as E-Chalk’s event timestamps. After an obligatoryInitialObjectDescriptor, which defines some basic parameters such as thevideo size (in this case the board size), the main scene is described by a groupnode of type Transform2D that forms the root. All other nodes, including anode that defines the background color, are added to this group node. Thenode has a property called translation. Changing the value of this propertyresults in a change of the top left position of the group node. Since all othernodes are children of this node, their position is also changed. Because theviewpoint of the player stays constant, this is a very easy way to implementE-Chalk’s Scrollbar event. RemoveAll events can be implemented by deleting

Page 70: Adaptive Audio and Video Processing for Electronic Chalkboard ...

62 CHAPTER 5. CLIENT ARCHITECTURE

Figure 5.11: E-Chalk client-based replay (left) and the MPEG-4 replay in Osmo-Player (right). The output differs because MPEG-4 players use stronger antialias-ing and draw angles of connected line segments differently.

the group node defining the scene, which also triggers the deletion of all itschild nodes. After this, a new group node is defined. Undo events can be imple-mented by deleting the last inserted node, redo by again defining and addingthe last deleted node. Images can be directly placed into the board by adding arectangle object at the appropriate position and then placing the image on topof it as a texture. Later image updates are directly supported by BIFS as anynode can be updated, so Applet replay can be easily implemented. Text eventscan be inserted using a TextNode. However, it is impossible to guarantee thatthe appearance is identical to the board server appearance because Java fontsmay have a different appearance compared to the fonts the MPEG-4 player isusing. Typing text is mapped by consecutively changing the string presentedby the text node. In order to draw strokes, connected line segments defined bythe Form$Line events are consolidated into a polyline and then drawn as anIndexedLineSet2D. However, the output in the player is not pixelwise identicalto the rendering output of the Java-based E-Chalk client. The main reason forthat is that MPEG-4 players use stronger antialiasing and draw angles of con-nected line segments differently. Figure 5.11 shows a comparison. Finally, theoutput differs from player to player because every player uses slightly differentrendering methods. Video and audio tracks are added by adding appropriatenodes to the root of the tree (Appendix 5.4 shows an example). Overlaying theinstructor video is directly supported since layers of different content are a keyconcept in MPEG-4. The video can be scaled or put at different positions onthe board. The transparency of the video can be either hard coded or, using aTouchSensor, interactively controlled by the user during playback.

5.4.2 Practical Experiences

In theory, the MPEG-4 format seems to provide a very good representation forthe storage and transmission of E-Chalk lectures. In our experiments, however,we experienced several disadvantages. Although many programs are availablethat are described as capable of playing back MPEG-4 content, most of them, forexample the QuickTime Player, RealPlayer, Windows Media Player, or Apple’sVideo iPod, only support movie profiles and are not able to play back BIFScontent. We identified three players that are capable of playing back BIFS

Page 71: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.5. A NOTE ON BANDWIDTH REQUIREMENTS 63

content in combination with audio and video, namely the Osmo-Player thatis part of the GPAC Framework, a Java-based player that is part of the IBMMPEG-4 Toolkit [81] called M4Play, and a plug-in for Windows by Envivio [75]that adds this functionality to the Windows Media Player and the QuickTimePlayer.

Both Osmo-Player and IBM’s Java-based player do not support random seekfor BIFS content. Fast-forwarding and rewinding can only be implemented man-ually using a TouchSensor and XMT-Ω’s accelerate, decelerate or reversecommands. Video codecs generally do not support so-called α-transparency,which means that they work with a three-byte RGB representation for eachpixel and not with a four-byte α-RGB encoding, like the graphics hardware.For this reason, encoding transparency in the instructor video itself is currentlyimpossible. However, MPEG-4 supports tagging colors as transparent. Shadesof transparency needed for sub-pixel-accurate segmentation (see Chapter 10)cannot be used. For encoding E-Chalk lectures, black is tagged as transparent.Osmo-Player, however, does not yet support the transparency tag. The playeruses a different strategy: The video is overlaid onto the board by pixelwise mix-ing the colors of the two layers. This results in a darkening of the board strokes(mixture with black) in the areas not occluded by the instructor and other un-desired effects. M4Play supports transparency; however, as it is Java-based, theIBM player drops many frames when playing back such a video. It has the sameproblems as the Java-based E-Chalk client (see Section 5.2.3).

Streaming of MPEG-4 content over HTTP requires the conversion from plainMPEG-4 to a format called m4x. These files contain so-called “hints” that en-able partial playback of MPEG-4 files. Of course, the conversion itself does notwork incrementally. In other words, streaming of MPEG-4 files is only possibleafter lecture recording has been completed. Although the GPAC Frameworkallows incremental compilation of BIFS content using BIFS text, it does notyet allow for incremental creation of the audio and video track. Live stream-ing is usually done using the Realtime Transport Protocol (RTP) [Schulzrinneet al., 2003], and MPEG-4 supports the streaming of BIFS content using so-called BIFS commands. Although the computational needs for the generationand conversion of MPEG-4 BIFS would easily allow it, there is no program orframework available yet that supports live encoding and streaming of contentthat consists of BIFS, audio, and video. One reason for this might be the licensepolicy [88] connected to the MPEG-4 standard which requires every applicationgenerating MPEG-4 content to pay a royalty fee. The policy has often beencritizised as being the primary reason for the slow adaptation of the standard,see for example [87].

5.5 A Note on Bandwidth Requirements

The following short experiment is to support the discussion of the possibleapproaches for E-Chalk lecture replay. A sample lecture [45] was convertedinto different formats. The lecture contains only chalkboard strokes, no images,no texts, and no Applets. The lecture was held using a digitizer tablet witha high resolution and relatively high sampling rate – thus generating manyevents (see [Knipping, 2005] Section 4.11 for a discussion). The experimentwas conducted in two runs, the first run containing no video and the second run

Page 72: Adaptive Audio and Video Processing for Electronic Chalkboard ...

64 CHAPTER 5. CLIENT ARCHITECTURE

Format Board only Video overlayE-Chalk format 1,861 kB 277,102 kBMPEG-4 BIFS 1,662 kB 57,839 kBWindows Media ScreenCapture v9 2,873 kB inapplicableMicrosoft MPEG-4 v2 44,251 kB 147,367 kB

Table 5.1: Comparison of file sizes resulting when representing electronic chalk-board content alone and with overlaid instructor in different formats. Please referto the text for a description of the experiment.

containing an overlaid instructor video. The resolution of the board is 1024×768,and the total length of the lecture is 1 hour, 37 minutes. The lecture wasconverted from E-Chalk’s regular event format (see Appendix C) to MPEG-4vector format (BIFS), to Windows Media Video using a frame-by-frame screen-capture codec, and to pure MPEG-4 video format. All videos were convertedusing 10 frames per second. In our experience, this frame rate mostly provides anappropriate tradeoff between bandwidth requirements and video quality whenplaying back electronic chalkboard lectures using video, because board drawingsusually do not appear at a very high speed. Moreover, recent studies haveconfirmed, that animations appear smooth even with lower frame rates whenthey are supported by an audio track (see for example [Mastoropoulou et al.,2005]). The video files were converted with the lowest possible bandwidth thatdid not result in any visible artifacts during replay. Of course, conclusionsdrawn from the presented figures take into account that “no visible artifacts”is sometimes a subjective measure. The presented figures provide an indicationfor the amount of data that is generated when encoding the same content indifferent formats, in terms of orders of magnitudes.

Table 5.1 shows the results of the two runs. When only the chalkboard isencoded it can be seen that, although the original E-Chalk board event file isencoded in user-readable ASCII format without any entropy coding, it is still avery efficient format. Giving up user readability and applying entropy encodingwould bring the storage requirements of a 97-minute lecture down to less thanhalf a megabyte (for example, a zipped version of the file resulted in a file size of358 kB). Representing the board strokes using MPEG-4 BIFS (as described inSection 5.4) results in storage demands of about the same order of magnitude.The output of screen capture codecs is only a bit larger for this lecture. Ingeneral, the quality does not match the other formats described here. Especiallyimages have many artifacts, and animations are sometimes messed up when thescreen is captured in the middle of a redraw. Using state-of-the-art movie codecsresults in file sizes that are several orders of magnitude higher. It is evident thatfor big parts of the video file the perceptually relevant information between twoframes actually only consists of a change in several pixels. However, even if therepresentation is sampled down to ten frames per second, the techniques usedby traditional video encoders do not yield acceptable compression results.

In the second run of the experiment, the same lecture was encoded againbut with a semi-transparent, overlaid instructor. As described in Chapter 9, theinstructor is recorded using a resolution of 640× 480 and then extracted. Both,the E-Chalk Java client as well as MPEG-4 players that are able to play backBIFS content, receive the segmented video in the original resolution and scale

Page 73: Adaptive Audio and Video Processing for Electronic Chalkboard ...

5.6. CONCLUSION 65

Figure 5.12: MPEG-4 players generally allow scaling replay content to any size.The example shows the Osmo-Player scaled to three different sizes.

it up to fit the board replay. For replay using Windows Media ScreenCapturecodec or regular MPEG-4 movie players, the segmented instructor video has tobe scaled up during conversion. As a result, more data has to be stored andtransmitted for replay (because the codecs are not scaling-invariant). As canbe seen from the figures in Table 5.1, adding an overlaid video of the instructorgives different results because the instructor video takes up the biggest part ofthe data. The E-Chalk format takes up most storage space because it is a verysimple format (described in Chapter 8) that is aimed at computational efficiency.The results of encoding MPEG-4 BIFS plus video depends mainly on the videocodec used. For this experiment, Microsoft MPEG-4 v2 was used. The resultingfile is smaller than when using the movie codec also for board transmissionbecause the video is encoded in 640 × 480 and the players scale the video up.The results of the Windows Media ScreenCapture codec were of intolerably lowquality because it applies a very strong color quantization. The coding strategiesof traditional video encoders, such as the MPEG-4 video profile, offer good toolsfor compressing data when an instructor is to be transmitted too. The tableshows the results without audio track. The audio track adds another 8 to 46 MBto each file (about 10-64 kbit/s), depending on the codec that was used. In theresult, each of the shown alternatives can be easily played back using a DSL orcable connection.

5.6 Conclusion

Most distance education projects still use traditional video encoders for bothchalkboard and slide contents (compare Chapter 2). Although screen-capturecodecs give better compression results for board-only lectures, vector-formatstorage is still more efficient, does not produce any compression artifacts, andpreserves board stroke semantics. The required video players as well as thecodecs are not always installed and force listeners to perform downloads andinstallations. The only reason to convert E-Chalk lectures without overlaidinstructor into traditional video formats is enhanced compatibility with a broadrange of players, especially on handheld devices, or to distribute lectures onDVD. E-Chalk lectures with overlaid lecturer can be better compressed using

Page 74: Adaptive Audio and Video Processing for Electronic Chalkboard ...

66 CHAPTER 5. CLIENT ARCHITECTURE

movie codecs. In practice, however, the effect is negligible since for receiving alecture with overlaid instructor, a DSL or cable connection is required, no matterwhich format is used. A separated transmission of the three streams still allowsto switch off individual streams for connections that do not provide the requireddata-rates and other features that need lecture semantics, for example dimmingthe transparency of the lecturer or scrolling the board independently of thereplay. MPEG-4 BIFS allows to encode E-Chalk lectures properly without anyloss of event semantics. MPEG-4 BIFS players support scaling (see Figure 5.12)and MPEG-4 BIFS editing tools would even allow for post-processing of lectures.In theory, MPEG-4 BIFS would render both the Java-based E-Chalk clientand Exymen’s E-Chalk support obsolete. In practice, however, there are toofew implementations of the standard and they still have too many technicalproblems. The Osmo-Player by GPAC comes as a source package and has tobe compiled prior to first use because of licensing issues, and Envivio’s plug-in is a commercial product and not freely available. The IBM toolkit appearspromising at a first glance because M4Play is available as a Java Applet thatwould not require the user to perform any downloads or installations. However,a license fee has to be paid for regular use and distribution of the toolkit orparts of it. Because of the lack of support for random seek, the player is justnot sophisticated enough to be an alternative to the E-Chalk client. At thetime of writing this, a completely self-developed MPEG-4 encoder, server, andreplay-client seems to be the only viable solution in order to use the MPEG-4standard properly. The main disadvantage of the self-developed E-Chalk clientis that it has to be maintained and kept compatible with any future browsersand Java versions. The advantage, on the other hand, is that the underlyingformats can be kept simple and the operational requirements for the user can bekept low. The remote viewer can turn off individual streams, and the minimalbandwidth requirements can be fulfilled by analog modems. The Java client doesnot require any explicit download or installation, and random seek is efficientlysupported. When E-Chalk’s format changes, a new client can automatically beprovided for the remote viewer without notice. However, playing back a lecturewith overlaid instructor without dropping frames is still a problem for currenthome computers.

Thus, replay using traditional video formats makes sense for special ap-plications such as supporting handheld devices. When an instructor is semi-transparently overlaid, video replay provides a workaround to allow higher re-play frame rates. The self-developed E-Chalk client is the only way to copewith user demands (compare Section 5.1) for distance teaching and will remainthe best option until MPEG-4 becomes elaborate enough to provide a practicalalternative.

Page 75: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 6

Audio Storage andTransmission

This chapter shortly describes E-Chalk’s audio part. It presents the evolutionand problems that led to the design decisions for the current realisation. Itprovides an overview of the implemented components before Chapter 7 presentsa detailed explanation of the recording methods.

6.1 Evolution of E-Chalk’s Audio System

The ancestor of E-Chalk’s audio system is the World Wide Radio system thatdates back to early 1997. During almost 10 years, the system transformed itselfto adapt to many technical changes and user demands. This section summarizesthe evolution of E-Chalk’s audio system by presenting the concepts and maindesign decisions of the different versions. The history of the system helps tounderstand some of E-Chalk’s main design decisions as presented in Chapter 3.

The history of the audio system is primarily driven by the specialization froma generic Internet audio broadcasting system to a solution for transmission andarchiving of the voice of an instructor lecturing with an electronic chalkboard.Thus looking at the history of the system also provides some hints about the dif-ferences between generic Internet audio broadcasting systems and the solutionsneeded for the remote lecturing.

6.1.1 The World Wide Radio System

The World Wide Radio System (WWR) [Friedland and Lasser, 1998], developedin 1997, aimed at providing a software solution for using the Internet as abroadcasting media for traditional radio stations. The idea behind the projectwas that the Internet would make it easier for more people to create their ownradio station without the costs of leasing a radio frequency slot. Especiallytransmissions that were of interest only to a small group of people would nowbe possible. The main scenario was that of a small community radio station (forexample for a school or a university) broadcasting audio content produced bya set of volunteers. In fact, the main users of the WWR system have been the

67

Page 76: Adaptive Audio and Video Processing for Electronic Chalkboard ...

68 CHAPTER 6. AUDIO STORAGE AND TRANSMISSION

“Offener Kanal Berlin (OKB)”1, the “Berliner Volksbuhne”, as well as severalgroups and organizations of high-school students.

The dominating technical problem were bandwidth restrictions in the Inter-net in these days. Analog modems constituted the number one way to connectto the Internet. CPU power was limited, especially for real-time signal pro-cessing so that most elaborated audio compression algorithms were not able torun in real time. In the given scenario, a modem connection could not onlybe found on the client side, but also on the server side. The WWR systemtherefore consisted of three parts: The server part, the client part, and a so-called broadcaster. The broadcaster was a kind of proxy server that received thecompressed audio stream from a server and forwarded it to different clients (orother broadcasters).

Server

The WWR server was a small program (implemented for Linux and Sun So-laris) that was able to record audio directly from the soundcard and compressit down using a self-developed DCT codec (see [Friedland and Lasser, 1998] fordetails). The DCT codec was able to compress an 8 bit 8 kHz µ-law [ITU-T,1988] mono audio signal (64 kBit) down to about 13 kbit/s which was enoughfor a 14.400 kbit/s modem dialup connection to receive the audio stream with-out interruptions. The server was able to stream from the input lines of thesoundcard or from prerecorded files.

Broadcaster

Every connected client requires its own stream. The actual bandwidth require-ments for the server therefore increases proportional to the amount of listeners.Since the World Wide Radio system also assumed a modem connection for theserver, the solution was to send the compressed feed to an Internet provider withgreater bandwidth and then distribute the signal from there to the individualclients. The program responsible for this distribution was called WWR broad-caster. Broadcasters get the compressed audio stream from a regular WWRserver or from another broadcaster and send them to clients or other broad-casters. The result is a tree structure similar to the one that exists in theMBONE [85]2.

The multicasting system of WWR was transparent, i. e., the user did notknow that he or she actually listened to a stream which came from a broad-caster server even though he or she originally connected to another server. Aserver automatically forwards clients if the number of simultaneous connectionsexceeds a threshold. In order to facilitate transparent forwarding of clients toreceive the stream from different broadcasters – even while the client was run-ning – a simple technique was used. Instead of the clients actively connectingto the server, the clients were accepting incoming connections from any WWRserver or broadcaster. So a server could close the connection to a client anytime and trigger a broadcaster or another server to connect to the client again.

1Literally translated: “open channel Berlin”. A TV and radio station where everybodycan broadcast his or her own TV or radio productions

2The reason for not directly using MBONE was that MBONE needs special router config-urations and was not very popular outside the academic domain.

Page 77: Adaptive Audio and Video Processing for Electronic Chalkboard ...

6.1. EVOLUTION OF E-CHALK’S AUDIO SYSTEM 69

WWR differentiated two types of broadcasters: active broadcasters and pas-sive broadcasters. Active broadcaster servers are always connected to its parentserver and get the audio stream, even if no client is currently connected. Pas-sive broadcaster servers only connect to its parent server if at least one clientrequests audio data. The second method results in a short delay for the firstuser and may lead to even further delays if the parent connection cannot beestablished (if this happens, the broadcaster redirects the client again).

Client

The WWR client was a small program compiled for Linux, Sun OS, Solaris,FreeBSD, Windows (16 bit), Windows NT, Mac OS 9, Next Step, Irix, and otheroperating systems. The clients consisted of a single executable (about 30-60 kB,depending on the platform) that could be started right off the download, withoutthe need for any installation. The work flow for receiving a radio program forthe first time consisted of viewing the web site of the radio station, downloadingthe client software, then starting the client software, and pressing a button onthe web site to make the server connect the running client software. A processconsidered too difficult by many users.

The client received the compressed signal via TCP from a server, decom-pressed it, converted it to the format capabilities of the underlying sound card,and played it back. Buffering strategies were used to guarantee uninterruptedplayback. A buffer length of about seven seconds was needed in order to achievesatisfying results.

6.1.2 World Wide Radio 2

The World Wide Radio 2 (WWR2) project (beginning of development in 1999together with Bernhard Frotschl) was initially meant to be a Java rebuild of theWWR system. There were several reasons for rebuilding the WWR system inJava. Of course, maintaining server, broadcaster, and client implementationsfor so many different platforms was very cumbersome. When using pure Java,however, programs are automatically small and portable. Most importantly,Java offered the possibility to embed the client directly into a website withoutthe user needing to download and install client software. Embedding the clientas an Applet in a web page also revealed new possibilities. Applets on the sameweb page are able to communicate. This made it possible to synchronize theaudio stream with different applications. A feature that made WWR2 a perfectcandidate to take over the audio streaming part of E-Chalk (see Chapter 5 fordetails). However, WWR2 was also used as a stand-alone program by differentinstitutions, among them Uniradio Berlin [58], the project GIOVE [22], and“Berliner Gruselkabinett” [10]. An independent evaluation of the system hasbeen performed by the Funkschau magazine [Manhart, 1999].

Using Java as a client platform also introduced several problems which weremainly due to the restricted access rights of Applets and Java’s initial perfor-mance. For years (until the introduction of Java 1.3) Java Applets running instate-of-the-art web browsers could not playback sound with a sampling fre-quency higher than 8999 Hz [53]. Fortunately, this was sufficient for record-ing speech. The basic input format for WWR2 was again 8 kHz, 8 bit µ-law.The first idea to compress these data was to use the low bandwidth percep-

Page 78: Adaptive Audio and Video Processing for Electronic Chalkboard ...

70 CHAPTER 6. AUDIO STORAGE AND TRANSMISSION

tual CODEC developed in [Friedland and Lasser, 1998]. However, the adaptivecompression that was used there could not be implemented in Java for effi-ciency reasons. We found that it was not possible to implement even a subsetof MPEG-audio encoding in these days, since any fast DCT-based approachwas too slow to run in real time under Java. In the end, we implemented sev-eral codecs for different bandwidth, CPU speeds, and audio quality levels intoWWR2 by modifying older compression standards. WWR2 contained a sim-ple and fast codec, which used no compression but the Java built-in GZIP [P.Deutsch, 1996] algorithm. The resulting stream needed about 50 kbit/s. Toachieve better compression, WWR2 also contained the 4 bit version of the µ-lawcodec, adapted from [ITU-T, 1988]. This codec, together with GZIP, compressesdown to about 20 kbit/s. The sound quality was adequate. To achieve a goodtrade-off between sound quality, compression, and execution speed, we modifiedthe ITU ADPCM standard [ITU-T, 1990]. The results were 4 bit, 3 bit, and 2 bitmodified-ADPCM codecs that, combined with GZIP, gave an effective averagecompression of 30 kbit/s, 22 kbit/s, and 15 kbit/s.

The WWR2 server was also able to replay files. Programming of sequencesand loops could either be done off-line via a configuration file or on-line via atelnet command-line interface. For this purpose, the WWR2 server had a built-in macro language. While the World Wide Radio system had only been ableto transmit live, the WWR2 system also supported on-demand listening. Inon-demand mode, no server is used, and a client reads a WWR2-encoded audiosignal directly over an HTTP stream (see Chapter 5). While live transmissionhad been considered the most important feature in the original WWR system, inthe E-Chalk system it lost more and more importance. Although E-Chalk stillsupports live transmissions, the main use of the system is on-demand. Alongwith other technical improvements, such as using a 16 kHz sampling rate, theWWR2 system specialized more and more on on-demand replay. In the end,the resulting system was called E-Chalk Audio, which is described in the nextsection.

6.1.3 The E-Chalk Audio System

In 2002, a complete rebuild of the audio system was performed (internally calledWorld Wide Radio 3 ). The challenges of the system were quite different thana few years ago. In the meantime, analog modems had become faster and werenot the only possibility to connect to the Internet from home. The Java VirtualMachine as well as processors had improved and real-time decoding of morecomplicated codecs did not constitute a problem anymore. Now, a wide rangeof commercial Internet broadcasting systems were available and codecs had longsurpassed high fidelity3. The design decisions for the new system were thereforedominated by new aims which in turn where a result of user demands from theuniversities and other schools that had been using E-Chalk.

Mainly three problems seemed to demand a solution in the new E-Chalkaudio system. The integratability and combinability of E-Chalk audio withother software, as well as extensibility of E-Chalk Audio itself; an infrastructurethat is able to handle E-Chalk’s audio format, e. g., to allow editing, automaticchecking and repairing, as well as converting E-Chalk lectures and in particu-

3Originally specified in DIN 45500 in the early seventies, now revised into DIN EN 61035.

Page 79: Adaptive Audio and Video Processing for Electronic Chalkboard ...

6.1. EVOLUTION OF E-CHALK’S AUDIO SYSTEM 71

lar E-Chalk’s audio format; and the improvement of the subjective quality ofE-Chalk’s audio recordings.

The interfacing possibilities of E-Chalk audio, as well as the possibility toreplace E-Chalk’s audio core with completely different software, is facilitatedby using SOPA as the underlying architectural layer of E-Chalk. The E-ChalkAudio system consists of a set of independent nodes that are managed by SOPA.Developers can easily substitute nodes with their own versions or add morefunctionality by implementing additional nodes. The life-cycle and deploymentof the nodes is automatically managed as described in Chapter 4.

E-Chalk’s audio system features a set of tools for converting between E-ChalkAudio and different other formats. An Exymen plug-in developed by Mary AnnBrennan allows the conversion between E-Chalk Audio and the audio formatssupported by the Java Media Framework [84], which include “wav” and “aiff”containers with different codecs as well as “mp3” [54]. Because Exymen plug-ins can also be used as SOPA nodes, E-Chalk lectures can also be recordedusing these formats, as long as encoding works in real time and a player forsynchronized playback with the chalkboard data is available. E-Chalk Audiorecordings can be edited using an Exymen plug-in (see [Friedland, 2002a]).

Although many codecs could now be integrated into E-Chalk and the audioreplay capabilities of the browsers had enhanced dramatically, many E-Chalklectures were not recorded with audio. Because the audio track is a vital partof any E-Chalk lecture, we looked at the recording archives of other lecturerecording systems. Interestingly enough, the archives of other projects thatdo not require technical operators for camera and audio recording control alsocontain many lectures without audio track (compare for example the lecturearchives of the e-class project [74] or the archives of the Classroom Presenter[64]). Several regular E-Chalk users were interviewed informally to analyze thereasons for this problem. Often, audio was omitted because instructors didnot want their voice to be archived. Their fear was that their reputation asprofessors could be harmed when errors, comments, or small jokes from theclassroom lecture appeared in the replay. People could recite their utterancesword by word. Of course, this problem cannot be solved by software other thanby providing the user with the choice not to record audio.

Many of the lectures, however, do not have audio because of usage errorsduring the recording. Often, the lecturers just forgot to switch on the micro-phone or the operating system’s mixer settings were not correct. Sometimes,audio recording was omitted or subsequently removed because professors andstudents complained about clipping or noise in the recordings. Mostly, the causeof these problems were again operational errors that were caused by the factthat the instructor had to concentrate on giving the lecture.

As described in Chapter 3, E-Chalk’s philosophy does not allow overheadresulting from the use of the system. Consequently, the system must help theinstructor in handling the task of recording audio properly, i. e., E-Chalk Audiohad to do more than simple recording, encoding, sending, and archiving theincoming sound signal. Hence the notion of Active Recording was introducedin E-Chalk Audio. Chapter 7 is devoted to this approach.

Page 80: Adaptive Audio and Video Processing for Electronic Chalkboard ...

72 CHAPTER 6. AUDIO STORAGE AND TRANSMISSION

6.2 E-Chalk’s Default Audio System

As described in the previous section, the E-Chalk audio system consists of manycomponents and is highly adaptable and configurable thanks to the SOPA sys-tem. Therefore, the audio system runs in different configurations at differentlocations. This chapter therefore describes the components of the E-Chalk Au-dio system as they are shipped with the default installation. E-Chalk’s defaultaudio system consists of a set of about 20 SOPA nodes. Half of them are devotedto the classical recording, compression, and transmission tasks and the other halfconstitutes the active recording system. This section describes the recording,encoding, and transmission core briefly and the next chapter describes ActiveAudio Recording.

6.2.1 Encoding

The incoming audiodata from the active recording system is a 16-bit mono signalwith a sampling rate of 16 kHz. It is compressed down to one of the followingfour bandwidth 40 kbit/s, 32 kbit/s, 24 kbit/s, and 16 kbit/s (upper limits) usinga variant of an ADPCM codec inherited from WWR2. ADPCM is a very simpleand computationally efficient lossy compression method that gives usable resultsfor speech [ITU-T, 1990]. ADPCM stands for Adaptive Differential Pulse CodeModulation and is a waveform-quantization method. Given a few samples ofthe waveform from the past, the next sample is predicted by a heuristics. Thesame prediction heuristics works in both the encoder and the decoder, so thatonly the difference between the prediction and the measured sample has tobe transmitted from the encoder to the decoder to reconstruct the originalsignal. If the prediction is good, the variance of the prediction error should bemuch smaller than the signal variance, reducing the amount of bits that haveto be transmitted. In practice, the transmitted difference values are quantizedby defining a step-size table of a certain size. The table contains predefineddifference values. The difference is then encoded by a reference into the stepsize table. Because the size of the table is set to a fixed size, the amount of bitsthat have to be transmitted per sample is also fixed. By changing the size of thetable it is easily possible to control the bandwidth-quality tradeoff from losslessto unbearable lossy. To achieve further compression, especially in regions withno signal, the signal is packetized and compressed with the Java built-in ZIPalgorithm [P. Deutsch and J-L. Gailly, 1996]. The E-Chalk Audio format isdescribed in detail in Appendix D.

6.2.2 Live Transmission and Archiving

The encoded and packetized audio signal is then ready to be streamed overthe Internet. E-Chalk’s audio server component is a straightforward implemen-tation that waits for a connection of a client and then streams the encodeddata packet-by-packet over TCP/IP. Each connection is managed in a separatethread with a large buffer (by default 256 kB), to compensate for connectionbandwidth variations or even temporary stalls. A self-developed benchmarkprogram that simulates incoming connections in a short time period was usedto test the robustness of the approach and to determine the maximum amountof manageable connections. With state-of-the-art computers, however, memory

Page 81: Adaptive Audio and Video Processing for Electronic Chalkboard ...

6.3. TOOLS 73

Figure 6.1: A screenshot of the E-Chalk lecture checker and repair tool. Audiorecordings are scanned for broken packets and the file is repaired if possible.

and CPU power are sufficient to handle several hundreds of connections, evenif the video server and the chalkboard server are running on the same machine.The server node can be used to stream any audio or video format.

During the transmission, the encoded version of the audio stream is alsosaved to a file along with an index file that contains a list of offsets pointing tothe beginning of each packet. During on-demand replay, both files are streamedover HTTP and the index file is used by the client to accelerate random seek.For a description of the audio client refer to Section 5.2.2).

In order to simulate bandwidth variations, self-developed traffic shapers wereused that randomly limit the connection speed. This is useful to determine therequired buffer size at the receiving end. A large receiving buffer results in alarge time-shift between sender and receiver. A small buffer might be unable tocompensate for normal bandwidth variations. I found that using a client buffersize of about 3.2 seconds suffices to allow a 90-minute uninterrupted transmis-sion of a typical chalkboard lecture combined with 40 kbit/s audio over ISDN(64 kbit/s). In the typical scenario, where a local student follows a universitylecture at home, the buffer is almost only needed when an image that has beenput on the board has to be transmitted. The delay between recording and play-back, however, is unacceptable for bidirectional transmissions. It is thereforenecessary to adjust the size of the buffer when synchronizing E-Chalk with avideo conferencing system.

6.3 Tools

Numerous small helper applications are part of E-Chalk’s infrastructure andallow the handling of the audio format both stand-alone and as part of a lecturewith board content and video (compare also [Friedland, 2002a] and [Knipping,2005] Chapter 6).

Page 82: Adaptive Audio and Video Processing for Electronic Chalkboard ...

74 CHAPTER 6. AUDIO STORAGE AND TRANSMISSION

6.3.1 Lecture Repair Tool

Among the most important tools is the E-Chalk lecture checker (compare Fig-ure 6.1). Initially, the tool was created as a reaction to problems reported byTechnische Universitat Berlin. One of their computers running E-Chalk in theclassroom crashed regularly due to a hardware problem. The tool that is partof the E-Chalk distribution now gives users more security not to lose their work,for example due to a crash of the computer or a power shortage during the lec-ture (this typically happens in a laptop presentation without stationary powersupply). When a computer crashes during a lecture, the last few audio packetsare sometimes missing and/or the the last packet has only been written partly.Because the lecture recording has not been terminated properly, no HTML filesfor archived lecture replay have been saved into the output directory, so thatthe lecture cannot be replayed in the browser. In most cases, the classroomlecture is continued after a reboot of the machine. Often instructors then justcontinue the crashed E-Chalk recording. E-Chalk then appends the continuedlecture to the files already written without checking if the already saved partsof the lecture haven been terminated properly (such a check would take toolong). So in this case, the broken packet is in the middle of the recording (andin the case of multiple computer crashes there might even exist several bro-ken packets). The E-Chalk lecture checker scans E-Chalk lecture recordings forcorruption and optionally repairs them. The board event file is scanned lineby line and corrupt events are removed. Corrupt audio packets are found byunpacking each packet in the archive. If a packet cannot be unzipped entirelyand/or the entries given in the index file do not correspond to the actual offsetof the packet headers the archive needs repair. Corrupt packets are replaced bycorrect ones containing silence, trying to maintain the time synchronization ofthe tracks. Of course, the lecture checker is also able to generate a new indexfile. Video files are repaired similarly by replacing corrupt packets with packetsthat contain idle frames (compare Chapter 8). However, if several entire pack-ets have been lost during recording, the archived chalkboard, audio, and videotracks may need to be re-synchronized manually using Exymen. In the end,the lecture checker generates new HTML pages and puts a current Java replayclient into the lecture directory.

6.3.2 Audio Format Converter

The WWR2 to WWR3 converter works as wizard and as command-line tool.The command line tool is used internally by the E-Chalk Startup Wizard if alecture in the old format is to be continued using a newer E-Chalk version. Inthe beginning, the purpose of the converter consisted exclusively of this syntaxtransformation from one format to the other. However, when active recording(Chapter 7 explains Active Recording in detail) was introduced in E-Chalk,users asked for the possibility to improve old E-Chalk recordings using the noiseand humming fingerprints and/or the equalizer settings created with the newerE-Chalk versions. Of course, this is only possible when the old lectures andthe new fingerprints have been produced with the same hardware setup, i. e., ifonly the E-Chalk version has changed. By using the GUI wizard version of theWWR2 to WWR3 converter, it is possible to upgrade the syntax while applyinga recently recorded fingerprint.

Page 83: Adaptive Audio and Video Processing for Electronic Chalkboard ...

6.3. TOOLS 75

6.3.3 E-Chalk Broadcaster

E-Chalk audio also features a transparent multicasting mechanism similar to theone explained in Section 6.1.1. The mechanism also uses a simple broadcastertool that works similar to the one already introduced in WWR(1). The differ-ence is that the transparent forwarding of clients works by an event mechanismthat is integrated into the E-Chalk audio format (compare Appendix D). Whena server or broadcaster wants to forward a connected client to another server orbroadcaster (for example, if the maximum number of connections is exceeded),it sends the URL pointing to the web page of the new server or broadcasterencoded in the WWR stream. The client Applet then replaces its own host webpage by opening the URL. Because the board client expects to receive the entireevent file at the start of transmission and the rest of the events line-by-line, theE-Chalk events are broadcasted by a small Unix shell script 4. Since most ofthe live transmissions are performed from within the network of a university,the broadcaster application is used very rarely5. For this reason, a broadcasterfor the video stream has not been implemented yet, although it would easily bepossible.6

Several other tools, for example a recording monitor, an automatic mixercontrol, and a noise fingerprint recorder, have been built as part of the activerecording component to enhance the recording quality. These tools are describedin detail in the following chapter.

4A Java version of a broadcaster for E-Chalk board content was created by ChristianBurger from elfzehn GbR.

5The broadcaster application was only used for special events at locations where only amodem connection was available, for example an arts event where an electronic chalkboardwas used as drawing device in September 2004.

6The audio broadcaster application cannot be used because the video broadcaster wouldhave to generate the initial header for each connecting client (compare Appendix F).

Page 84: Adaptive Audio and Video Processing for Electronic Chalkboard ...

76 CHAPTER 6. AUDIO STORAGE AND TRANSMISSION

Page 85: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 7

Active Audio Recording

As described in the previous chapter, several components have been added toE-Chalk’s audio system because, in contrast to the scenario assumed for theolder versions of World Wide Radio, one cannot just assume that the audiosignal fed into the encoding computer is of broadcast quality. This chapteranalyses some of the problems causing the quality distortion and presents tech-niques to improve the situation. These techniques were integrated into a wizardapplication that is used before the first lecture and several SOPA nodes thatrun during the recording. The entire system was integrated into E-Chalk underthe name Active Audio Recording, which is also described in [Friedland et al.,2004b,Friedland et al., 2005c].

7.1 Audio Recording in Classrooms

Looking at the different usage scenarios of E-Chalk (see Chapter 3) and thoseof similar lecture-recording systems, several practical problems that deteriorateaudio quality can be observed. The two key factors that most often degradethe audio quality are usage errors of the sound equipment, including a wrongassessment of the expected quality with given equipment and distortion sourcesthat are typical for the situation in a classroom or a lecture hall.

7.1.1 Usability Problems

According to my experiences as well as user feedback, the usability problemsthat directly concern audio quality can be classified into three categories: Wrongassessment of the expected results using certain equipment and workload, usageerrors that concern configuration and setup, and issues during the lecture.

Both teachers and students are accustomed to high fidelity, almost noise-free, broadcast-quality voice recordings that they are able to receive every daythrough radio and television or are able to buy on commercial compact discs.These recordings are mostly produced in a studio with very costly equipmentand (several) sound engineers present surveilling every recorded sample both bylistening to it and observing different measurement instruments. Just plugginga microphone into a notebook computer’s sound input jack does not deliver thesame results.

77

Page 86: Adaptive Audio and Video Processing for Electronic Chalkboard ...

78 CHAPTER 7. ACTIVE AUDIO RECORDING

Of course, the experienced audio quality of a lecture recording also dependson the speakers and the equipment at the receiving end. In fact, most soundcards focus on sound playback but not on sound recording. Gaming and musicreplay are their most important applications. Many sound cards cannot gener-ate studio-quality audio recordings. On-board sound cards, especially those inlaptop computers, have often very restricted recording capabilities. Even when adecent sound card is installed, noise may be introduced to the sound equipmentby hard disk motors and fans because of the very compact construction [Mack,2002].

Many people also have problems setting up the computer for sound recording.Especially if it is an unknown notebook computer and the setup is made adhoc,e. g., directly before a presentation. The improper handling of the operatingsystem’s mixer often causes malfunctions. It differs from sound card to soundcard and from operating system to operating system and usually has manyknobs and sliders with potentially wrong settings. Selecting the right port andadjusting mixer settings can take even experienced users minutes. Often themicrophone is put into the wrong input jack – an error that is noticed mostlyonly after the presentation or lecture. Even if the right input jack is chosen,the input level is very often not adjusted well because many people just do nothow to adjust the input level correctly. The results are either overflows or therecording is not set to maximum gain.

During a lecture, the instructor’s attention is entirely focused on the presen-tation, and technical problems can easily be overlooked. Often, lecturers justforget to switch the microphone on. In many lectures, weak batteries in themicrophone caused a drastic reduction of the signal-to-noise ratio.

However, as described in Chapter 3 and discussed in Section 6.1.3, E-Chalk’sphilosophy requires that using the system must ideally not cause additionaloverhead. The lecturer has to do his teaching job, and controlling a batterylamp during the whole lecture or setting up the mixer to the right level isoverhead. So the problems resulting in not doing these above-mentioned thingsright or even not doing them at all cannot be considered the lecturer’s fault.

7.1.2 Distortion Sources

Besides the above-mentioned usability issues, there are also many direct audiodistortion sources that can affect the audio quality of a lecture recording. Thisparagraph only concentrates on those that in my experience have the greatestimpact on recording quality. For a more detailed discussion of these problemsrefer for example to [Dickreiter, 1997a,Katz, 2002].

In contrast to a recording studio, a classroom or a lecture hall is filledwith multiple sources of noise: Students are murmuring, cellular phones ring,doors slam, etc. Several independent studies provide measurement methodsand quantitative data, see for example [American National Standards Institute,2002], [Hodgson et al., 1999], or [Bradley et al., 1999]. In lecture halls theremay also be reverberation that depends on the geometry of the room and onthe amount of people in the seats [Knecht et al., 2002]. As a consequence,the speaker’s voice does not always have the same loudness as it adapts to thenoise level of the audience. The loudness and the volume of the recording de-pends on the distance between microphone and the speaker’s head, which isusually changing all the time. Coughs and sneezes or even movements of the

Page 87: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.1. AUDIO RECORDING IN CLASSROOMS 79

lecturer result in irritating noise. Feedback loops can also be a problem if thevoice is amplified for the audience. Long audio wires can cause electromagneticinterference that results in noise or humming.

7.1.3 Other Issues

Some issues are also directly caused by low-quality sound cards. Some analog-to-digital converters introduce a fairly high DC-offset. A DC component isunhearable during replay. However, when a lecture is appended to a lecturerecorded with a sound card that has different offset or Exymen is used to editthe sound a changing DC-offset causes a “click” sound at the merging point.

Yet another problem is the incorrect timing of soundcards. E-Chalk’s boardevents are timed using the clock of the computer, while the audio system relieson getting a certain number of samples per second. However, comparing thecomputer clock with the number of sound samples received sometimes resultsin noticeable differences. Using a small measurement application distributed toseveral E-Chalk users, I found that these timing differences differ from soundcard to sound card1. Usually, better sound cards have a more accurate timing,but timing errors of up to 0.1 % appear in many sound cards. In a 90-minutelecture an error of this magnitude results in a desynchronization between boardand audio track of about five seconds at the end of the lecture.

7.1.4 Ideal Audio-Recording Conditions

Even though many audio distortion sources exist in a classroom or lecture hall,this does not mean that broadcast-quality recording is impossible. Many com-mercial music productions for example have been produced at live concerts ofrock bands. Usually, the noise level at these events exceeds the noise in class-rooms or lecture halls by several orders of magnitude. However, the equipmentand personnel required for such a recording also exceeds any effort expendablefor regular lecture recording. It is impossible to describe a generic solution forall kinds of rooms, speakers, and situations in one section. Nevertheless, lookingat what would be ideal helps to identify the problems of the status quo.

In contrast to a studio recording, the voice of the instructor is also importantfor an audience in the room. Either the instructor speaks aloud or additionalspeakers are used to amplify his or her voice. In the latter case, care must betaken that the speakers for the audience are both loud enough and cause no feed-back with the recording microphone. Reverberation effects of the room shouldbe avoided using proper absorption material, one of many typical methods is topad the back of the audience seats.

Since noise is inevidentably present in lecture halls or seminar rooms, adirected microphone should be used for recording. This also eliminates otherinfluences of the room acoustics, like reverberation. Directed microphones, how-ever, are usually very sensitive against direct contacts or movements of the in-structor which result in scratch sounds. The microphone should have some kindof pop-screen to avoid popping effects during the recording of explosive con-sonants. The distance to the microphone should be adjusted according to thespecification of the microphone (usually about 30 to 60 cm from the mouth of

1Thanks to Stefan Flor and Thomas Klein of the Physics Department of University ofRostock for their initial reporting of this problem.

Page 88: Adaptive Audio and Video Processing for Electronic Chalkboard ...

80 CHAPTER 7. ACTIVE AUDIO RECORDING

the speaker). As the volume of the speaker’s voice changes dramatically withthe distance, wireless microphone headsets are a perfect technical choice. If thedistance between sound source and microphone is too small, the proximity effectboosts the lower frequencies and makes the voice sound unnatural. However,some lecturers feel constrained by having to wear a headset. The alternativeis a wireless lavalier microphone. However, the distance has to be properly ad-justed and scratch sounds due to movements or direct contact – for example byclothes that brush against it – must be avoided. The signal level can drop orraise dramatically if the speaker turns away or closer to the mic.

Before the recording, a sound check should be performed. The recordingshould then be monitored by a technician to ensure the equipment works asexpected. The gain must be controlled continuously because the volume of theinstructor can change rapidly. If the gain is too low, the signal-to-noise ratiobecomes a problem; if gain is too high, the signal is clipped. Both problemsresult in distortions of the audio signal especially when lossy encoding techniquesare used because they assume a perfect input signal. Furthermore, noise raisesthe entropy of the signal, decreasing the compression ratio. Cable length shouldbe minimized and only shielded cables should be used. Although this seems to gowithout saying, we have experienced many problems with university lecture hallsusing long cables that introduced humming. Equalizers may be used to balanceout certain frequencies and to tune the signal for a more pleasant listeningexperience.

The factor that should be optimized for a lecture recording is speech intel-ligibility. The two factors that mainly determine speech intelligibility are theupper-bound frequency and the signal-to-noise ratio [Dickreiter, 1997a, Allen,1994]. The upper-bound frequency is cut off mainly by the sampling rate of thesound card. Given state-of-the-art codecs, however, the distortion introducedby noise and the upper bound frequency is virtually imperceptible, especially forspeech (see for example [Hansen, 2002]). The most important factor is the soundequipment and from that, mainly the microphone and the sound card [Mack,2002].

7.2 Improving Audio-Recording Quality

Given the problems discussed in the previous section together with the fact thatin most cases no further personnel is available for consultation during or evenbefore lecture recording, the question arrises how E-Chalk could be able to assista lecturer in improving the audio quality. Providing state-of-the-art codecs foraudio compression is important; however, if the signal is distorted even beforethe compression, satisfying results will not be achieved anyways. Therefore,improving audio recording for lectures means first and foremost improving thequality of the raw signal before it is processed by the codec. Ideally, it shouldbe possible to produce satisfactory audio quality with standard hardware andwithout a technician necessarily present. In order to do this, the software mustbe able to help the lecturer in assessing the expected results, monitor the signalduring the recording, and provide basic tools that automatically reduce the in-fluence of distortion sources. Yet, sound recording is a profession and a researcharea of its own and, of course, the work presented in this chapter does not aimto replace them. However, the work in this chapter indicates that there are

Page 89: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.2. IMPROVING AUDIO-RECORDING QUALITY 81

possibilities to assist an audio un-savvy user in producing higher-quality speechrecordings.

None of the state-of-the-art Internet broadcasting systems such as WindowsMedia Encoder, RealProducer, or QuickTime (see Chapter 2) provide automaticmonitoring or signal enhancing mechanisms. Like the old World Wide Radiosystem, they assume the typical streaming usage scenario where a high-qualityaudio signal is fed in by a radio broadcasting station with audio techniciansbeing present. As already discussed in Section 2.3 as well as explained in [Mack,2002], these software systems integrate into a recording work flow that requiresboth adequate equipment as well as trained personnel. Microsoft’s Real-TimeCommunications API at least provides an audio-tuning wizard which providesa manual input-device selection and a dialog that helps a user to adjust themicrophone distance. Most video-conferencing tools, such as Polycom ViaVideo(see Chapter 2), do at least have basic filters for echo canceling or feedbacksuppression. Octiv Inc. applies real-time filters to increase speech intelligibilityin telephone calls and conferences. They provide hardware to be plugged into thetelephone line. Cellular telephones also apply filters and speech enhancementalgorithms but these rely on working with special audio equipment and knowingthe properties of the underlying devices. Among other products, Octiv alsosells a product called Volume Pro which acts like a kind of enhanced audiocompressor [38].

In academic research, many projects seek to solve the Cocktail Party Prob-lem [Haykin, 2003]. Most approaches try to solve the problem with blindsource separation using extra hardware, such as multiple microphones. Itohand Mizushima [Itoh and Mizushima, 1997] published an algorithm that iden-tifies speech and non-speech parts of the signal before it uses noise reductionto eliminate the non-speech parts. The approach is meant to be implementedon a DSP and although aimed at hearing aids it could also be useful for soundrecording in lecture rooms.

In practice, most recording and sound-editing tasks are still solved manually.Just as in many image manipulation programs, the tools provided by genericaudio-editing applications are very powerful, yet they require a user who knowswhat he or she wants to do. Even consumer-level remastering software suchas Steinberg “WaveLab” [51] usually require the user to visit an introductoryseminar to be able to handle the software properly – not to mention morecomplex remastering software, such as “Magix Samplitude” [30] or “Logic Pro”[7]. Automation is still at the beginning and only available for special purposes.For example, Steinberg also provides software like “My MP3 Pro” [51] thatfacilitates the creation of MPEG-encoded files from a given audio source.

PEAQ [Thiede et al., 2000] is a sound-quality-assessment algorithm basedon psychoacoustic models. It aims to emulate a sound quality test using au-dio experts at a degree of quality where distortions are not easily noticeablyanymore. The algorithm is intrusive, i. e., it requires a reference signal. Ad-ditionally, the quality measurement is computationally expensive, so an onlinemeasurement is impossible at the moment2. However, the algorithm is actuallya combination of several methods from which some can be used to assess audioquality during and before lecture recording, as described below.

2I thank Opticom GmbH for providing me a with a test version of their PEAQ and PQSMimplementation distributed under the product name Opera.

Page 90: Adaptive Audio and Video Processing for Electronic Chalkboard ...

82 CHAPTER 7. ACTIVE AUDIO RECORDING

Figure 7.1: A conceptual overview of the steps of the audio diagnose wizard thatis to be run before the first lecture recording.

Of course, not all of the audio-distortion problems discussed in Section 7.1can be solved by software. E-Chalk’s active audio recording component focuseson the special case of lecture recording and mainly concentrates on the automa-tion of equipment configuration and sound hardware monitoring. The systemassists in the assessment of the sound equipment and provides several basicmethods for the oppression of audio distortions. The system relies on the lec-turer using some kind of directional microphone, so that the influence of roomgeometry and of cocktail-party noise is already eliminated. A lecture-recordingsystem has the advantage that information about speaker and equipment arealready accessible before the recording. The approach is therefore divided intotwo parts:

1. An expert system analyzes sound card, equipment, and the speaker’s voiceand keeps this information for recording. It assists in assessing the qualityof the audio equipment and makes the user aware of its influence on therecording.

2. During recording, a hardware monitor and some basic filters use the in-formation collected by the expert system to improve the quality of theincoming audio signal.

7.3 Before the First Lecture

Before the first lecture in a new room or with new equipment is recorded, thelecturer creates a so-called audio profile. It represents a fingerprint of the in-terplay of sound card, equipment, and speaker. The profile is recorded using aGUI wizard that guides through several steps, see Figure 7.1. This setup takesabout three minutes and has to be done once per speaker and sound equipment.Each speaker uses his or her audio profile for all subsequent recordings as longas the recording equipment remains unchanged. The GUI wizard does not onlyrecord the audio profile, it also simulates several tasks that a recording techni-cian would do before a lecture recording. The program identifies and configuresthe sound hardware setup (once it has been installed into the operating system)and takes sample measurements of the sound card and the sound equipment.It then simulates E-Chalk Audio’s processing chain, allowing a user to listen tothe recording exactly as it will be broadcast. A final report gives a hint how thesound equipment compares to an ideal one based on the measurements. The

Page 91: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.3. BEFORE THE FIRST LECTURE 83

results are saved in the audio profile to be be used by E-Chalk during record-ing. Because users asked for the diagnostic features of the wizard independentlyof E-Chalk, the wizard also exists in a diagnostic-only version that runs as astand-alone application and does not create an audio profile. In the followingparagraphs, the diagnostic steps of the expert system underlying the wizard arepresented. Further technical details on the underlying methods are presentedin Appendix E.

7.3.1 Detection of Sound Equipment

The first step of the setup consists of detecting the audio equipment. The useris asked to disconnect every input device from the sound card but the one he orshe wants to record with and turn this on. Using the operating system’s mixerAPI the sound card’s input ports are scanned to find out the recording devicesplugged in. This is done by shortly reading from each port with its gain at amaximum, while all other input lines are muted. The line with the maximumnoise level is assumed to be the input source. For the result to be certain, themaximum must differ to other noise levels by at least 3 dB, otherwise the useris required to select the line manually. With a single source plugged in, thisoccurs only with digital input lines because they produce no background noise.At this stage several hardware errors can also be detected, for example if noiseis constantly at zero decibel there is a short circuit. After the sound card andinput line has been chosen, E-Chalk’s audio system takes full control over thesound card mixer, both during the rest of the steps of the wizard and duringrecording. There is no need for the user to deal with the operating system’smixer.

7.3.2 Recording of Floor Noise

In theory, silence should not contain any noise or the noise level should at leastbe below the hearing threshold. The second step in the wizard is therefore therecording of the sound card’s floor noise. The user is asked to unplug all inputdevices from the sound card3. The mixer control raises input gain to maximumand a few seconds of “silence” are recorded. The signal is analyzed to detectpossible hardware problems or handling errors. For example, if the floor noise istoo high, sound maybe introduced by electromagnetic interference. Harddisksor fans are typical sources of such problems. Overflows may be caused by shortcircuits at the sound card ports and wires. These critical noise levels result indescriptive warnings.

After recording sound card noise level, the user is asked to re-plug andswitch on the sound equipment. Then the equipment floor noise is measured.The equipment floor noise is influenced by several factors and consists maybe ofnoise or humming introduced by the equipment or direct environment. Beforeanalysis, the user is asked to play back the recording in order to verify thatno accidental sounds have been recorded. Again, warnings will be shown if thefloor noise level is too high. Furthermore, comparing this signal to the previousrecording enables to detect handling errors. If the equipment noise level is lowerthan the sound card noise level, the measurement was performed wrongly. If

3On notebook computers this is not always possible because built-in microphones cannotalways be switched off. The wizard then skips the recording of sound card background noise.

Page 92: Adaptive Audio and Video Processing for Electronic Chalkboard ...

84 CHAPTER 7. ACTIVE AUDIO RECORDING

Figure 7.2: A report gives a rough estimation of the quality of the equipment.

both are the same, the equipment is likely to be switched off or connected toa wrong port. If overflows occur during the recording of floor noise there arevery probably hardware misfunctions. A constant maximum level indicates apossible short-circuit. Both floor noise recordings are stored for later use.

7.3.3 Dynamic-Range Adjustment

In order to equalize the voice level, adapt to loudness variations, and to levelout distortions easily detectable by signal peaks, Active Recording includes anautomatic gain control (see Section 7.4). Since the dynamic range adjustableby the sound card mixer is mostly only a sub-intervall of the overall dynamicrange of the equipment connected to the sound card, the equipment must beadjusted so that the automatic gain control is able to control the signal leveloptimally. For a perfect gain control, the recorded voice varies only inside thedynamic range of the mixer. The dynamic range adjustment is the next stepof the GUI wizard. The user is asked to record a phrase containing manyexplosives. In English, repeating the word “coffee pot” seems to provide goodresults. Explosives form the peaks in voice recordings and are therefore able toprovide the upper bound of the dynamic range. During the test recording, thesound card mixer’s input gain at the current port is adjusted to control the gainof the signal. During recording, the average signal level should be maximized butoverflows must be avoided. If too many overflows are detected or if the averagesignal gain is too low, the user is informed about possible improvements.

7.3.4 Measuring Signal-to-Noise Ratio

As discussed in the previous sections, noise is one of the predominant factors inspeech intelligibility and also an important quality indicator for sound equip-ment. For an accurate measurement, a frequency generator has to be pluggedinto a sound card port and both the harmonic distortion as well as the noiseadded by the sound card have to be determined by measuring the distancebetween the input and the output at different frequencies and waveforms. Ofcourse, this requires equipment and more importantly an educated technician.

Page 93: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.3. BEFORE THE FIRST LECTURE 85

Figure 7.3: E-Chalk Audio’s processing chain during lecture recording. Theencoded signal is then transmitted or stored to a file.

Therefore E-Chalk Audio relies on an approximation that is measured moreeasily. In the next step of the wizard, the user is asked to record a predefinedsentence. The gain is measured, without counting speech pauses between words.

The signal-to-noise ratio is then estimated by comparing the A-weightedgain [DIN EN, 2003] against the A-weighted floor noise. Although only anestimation, this is a very practical method that has often been used – for exampleto measure the signal-to-noise ratio of tape recorders [Dickreiter, 1997b].

7.3.5 Fine-Tuning and Simulation

When the tests and adjustments have been finished, the user is asked to recorda typical start of a lecture. This final recording serves as the basis for a simu-lation and allows for some fine-tuning. The recording is filtered (as describedin Section 7.4), compressed, and uncompressed again using E-Chalk’s defaultcodec. The user is able listen to his or her voice as it will sound after hav-ing been transmitted through the Internet. This step does not only provide ademonstration of the expected results, it is also useful for debugging certainproblems before the actual recording of the lecture. If necessary, an equalizer(according to [ISO, 1997]) allows experienced users to further fine-tune the fre-quency spectrum of the recording. The time for filtering and compressing ismeasured. If this process takes the same time or even longer than the length ofthe recording, audio packets will be lost during real recording because of a slowcomputer.

7.3.6 Summary and Report

At the end of the simulation process a report is displayed, as shown in Figure 7.2.The report summarizes the most important measurements and grades soundcard, equipment, and signal quality into the categories “excellent”, “good”,“sufficient”, “scant”, and “inapplicable”. The sound card and the equipmentare graded using the background noise and the estimated signal-to-noise ratiocalculated from the recordings. A sixth grade, “improperly measured”, is givenfor contradictory results, for example when equipment noise level is lower thansound card noise level alone. Of course, this is only a very rough grading. Soundquality is determined by many more factors, for example, frequency response,harmonic distortion, or phase distortion. Other factors can be ignored, forexample crosstalk does not play a role since E-Chalk Audio only records monosound. Many problems of the past are believed to be good enough in modernsound cards, such as the dynamic range (which is actually determined by the

Page 94: Adaptive Audio and Video Processing for Electronic Chalkboard ...

86 CHAPTER 7. ACTIVE AUDIO RECORDING

Figure 7.4: The microphone’s floor noise level has sunk – batteries have to bechanged. The warning dialog appears directly in front of the chalkboard anddisappears without any need of interaction when the problem has been fixed.

bit depth) and jitter. The advantage of using this simple heuristics is that themeasurement of silence noise can be easily performed by users without priorknowledge and still assists in identifying quality bottlenecks. Further testingwould require the user to work with loop-back cables, frequency generators,and/or measurement instruments. In order to create a grading scale, practicalexperience reports were collected from the Internet and consumer computermagazines, for example from [56], and floor noise level seems to be an importantindicator for the overall quality of sound cards. Further details on the gradingheuristics can be found in Appendix E.

In the last step of the wizard the user is asked to enter an identification stringfor the audio profile. Among other information, the created profile contains allmixer settings, the equalizer settings, the recordings, and the sound card’s name.The identifaction string then appears in E-Chalk’s Startup Wizard and enablesthe selection of a certain profile for the Active Recording chain.

7.4 During Lecture Recording

This section describes the steps that are performed during the lecture that arealso illustrated in Figure 7.3.

7.4.1 Mixer Monitor

During the lecture, the system relies on the profile of the equipment. If changesare detected, for example a different sound card, the system complains atstartup. The mixer settings saved in the profile are used to initialize the soundcard mixer. The mixer monitor complains if it detects a change in the hardwareconfiguration such as using a different input jack. It supervises the input gain

Page 95: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.4. DURING LECTURE RECORDING 87

Figure 7.5: Without (above) and with (below) mixer control: The speech signalis expanded and the cough is leveled out.

in combination with the mixer control. A warning is displayed if too many over-flows occur or if the gain is too low, for example, when microphone batteriesare running out of power. The warning disappears when the problem has beensolved or if the lecturer decides to ignore the problem for the rest of the session.Figure 7.4 shows a warning dialog presented during the recording.

7.4.2 Mixer Control

The mixer control levels out the input gain using the sound card’s mixer. Theanalog preamplifiers of the mixer channels thus work like analog expander-compressor-limiter components used in recording studios. This makes it possibleto level out voice intensity variations. Coughs and sneezes, for example, are lev-eled out (compare Figure 7.5) as well as gain variation caused by microphonedistance changes (compare Figure 7.6). The success of this method dependson the quality of the sound card’s analog mixer channels. Sound cards withhigh-quality analog front panels, however, are becoming cheaper and are get-ting more popular. Automatic gain control reduces the risk of having feedbackloops. Whenever a feedback loop starts to grow, the gain is lowered. As in ana-log compressors used in recording studios, the signal-to-noise ratio is lowered.For this reason noise filters, as described in the next paragraph, are required.In order to be able to react quickly on gain changes, Steinberg’s “ASIO” low-latency sound-recording interface is used if the sound card provides it.

Page 96: Adaptive Audio and Video Processing for Electronic Chalkboard ...

88 CHAPTER 7. ACTIVE AUDIO RECORDING

Figure 7.6: Another example of the mixer control in action. When the lecturerturns away from the microphone, the audio gain goes down and with the instruc-tor’s mouth approaching the microphone the gain raises again (darker signal).Using the mixer control, the overall gain is higher and the microphone distancedifferences are leveled out more effectively (lighter signal).

7.4.3 Filtering

According to the measurements the audiowizard creates a filterchain using aSOPA graph description (see Section 4.7.2) that is used during the recordingto eliminate common problems. Of course, the chain can also edited manually.Mostly, however, the following chain is used. First, the signal’s DC offset isremoved. Then, the sound card’s background noise level is used as thresholdfor a noise gate and the equipment noise as a noise fingerprint. This is animportant step since the mixer control tends to raise the silence noise level.The fingerprint’s phase is aligned with the recorded signal and subtracted infrequency space. This removes any humming caused by electrical interference.Because the frequency and shape of the humming might change during a lecture,multiple noise fingerprints can be specified. A typical situation that changeshumming is when the light is turned on or off. The best match is spectrallysubtracted as described in [Boll, 1979]. See Figure 7.7 for an example. It isnot always possible to pre-record humming, but if so this method is superiorto using electrical filters. Electrical filters have to be fine-tuned for a specificfrequency range and often remove more than wanted. Any equalizer settingsspecified by the user are applied before the normalized signal is processed bythe codec.

7.4.4 Final Processing

After the recording has finished the recorded samples are counted and comparedto the time stamps of the board event file (if present). The mismatch due toany timing difference between sound card and real-time clock of the computeris calculated and logged. During later replay, the timing of the audio playbackcan then be adapted. The Java audio replay client, for example, adjusts the

Page 97: Adaptive Audio and Video Processing for Electronic Chalkboard ...

7.5. PRACTICAL EXPERIENCES 89

Figure 7.7: Three seconds of a speech signal with a 100Hz-sine-like hummingbefore (dark) and after spectral subtraction (light). Humming is a frequent audiodistortion in lecture halls.

MASI timing information according to a parameter defined in the invoking webpage.

7.5 Practical Experiences

As explained above, Active Recording tries to eliminate common recording dis-tortions using well-known filtering methods. The filters themselves have alreadyestablished as standards and thus do not need to be evaluated. The performanceof the entire system is difficult to evaluate empirically because if E-Chalk usersare aware of audio recording problems and decent audio equipment is used, thefilters are actually not needed. All in all, filtering results in a more efficientcompression. Because noise and clipping is reduced, entropy also scales downand the codec is able to achieve better results. By compressing several exper-imental recordings before and after the filter chain using the default WWR3format (see Appendix D), we have measured that the bandwidth reduction dueto Active Recording is about 10 %.

The system has been integrated into the E-Chalk system since 2004. Bothinstructors and technicians that were regularly using E-Chalk considered a setuptime of several minutes complicated at first. The opinion, however, usuallychanged when the recording was saved because a small usage error had beenprevented, for example picking up the wrong microphone. Common recordingdistortions were eliminated and the listeners of the courses reported a morepleasent audio experience. The assessment of the sound card and equipment ismeant to be rather strict and people often complained about their sound cardsto be assessed “too badly”. Having to go through a wizard before the firstrecording raises the awareness for potential recording issues. Using downloadlogs and a registration procedure in the installer, we estimate about 800 E-Chalk installations at the time of writing this text. Before the introductionof the Active Recording system, we regularly had to answer support emails

Page 98: Adaptive Audio and Video Processing for Electronic Chalkboard ...

90 CHAPTER 7. ACTIVE AUDIO RECORDING

concerning the audio quality. With the system being part of E-Chalk, not asingle question was emailed to us concerning audio-recording quality issues.

7.6 Limits of the Approach

Of course, the software system does not replace a generic recording studio, nordoes it make audio technicians jobless. In the special case of on-the-fly lecturerecording, however, it raises the awareness for audio recording issues and filtersout several standard distortions. Sound quality enhancement provided by soft-ware is only one element of the signal chain. A heavily distorted audio signal canonly seldom be restored by software. More problems can be solved if the soundequipment of the entire signal chain is known in advance. E-Chalk’s ActiveAudio Recording works mainly by reacting to the results of a steady compari-son between the sound intensity of the incoming signal with the prior recordedfingerprint of the equipment and the speaker. In order to get a better approx-imation to what an audio technician would do when monitoring the incomingaudio signal to prevent quality deviations or malfunctions, it is necessary tointerpret the sound information at a higher level than by basic operations on aset of samples in amplitude and frequency space. Speech recognition methods,for example, could provide the necessary operations to come closer to how ahuman being listens to the incoming signal.

7.7 Conclusion

None of the software presented in Chapter 2 takes into account any classroom-specific recording problems because they simply use commercial Internet audio-broadcasting systems. This does not yield satisfying results. Because audioencoding strategies have become better and transmission bandwidth has in-creased during the last years, quality bottlenecks are now primarily caused byhuman factors. However, freeing users from performing technical setups by au-tomation, as recently observable in digital photography, is still a challenge foraudio recording. The Active Audio Recording component of E-Chalk is a firstsmall step towards making the creation of high-quality recordings easily possiblefor the layperson.

Page 99: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 8

Video Storage andTransmission

This chapter introduces the architecture and realization of E-Chalk’s video sub-system that belongs to the project almost since its inception. It discusses thegeneral purpose of video transmission for electronic chalkboard lectures beforea more enhanced method is presented in the subsequent chapter.

8.1 Preliminary Considerations

When instructors do not want to adapt to a new technology or educationalinstitutions are not able to invest in electronic chalkboards, lectures held witha blackboard can still be captured by recording a video of the lecturer actingin front of the board. As discussed in Chapter 2, several educators use stan-dard Internet video-broadcasting systems for the transmission of all kinds oflectures. The primary advantage of recording a lecture with a camera is thatthe approach is rather straightforward. Well-known techniques can be usedfor recording, and off-the-shelf Internet broadcasting software can be used fordigitizing, encoding, transmission, and playback. The lecturer’s work flow isnot disturbed, he or she does not have to get used to a new teaching medium.Even though some projects have tried to automate the process [Gleicher andMasanz, 2000,Rui et al., 2001], recording a lecture the “conservative way” re-quires additional manpower for camera and audio-devices operation. Yet thevideo compression techniques used by traditional video codecs are not suitablefor chalkboard lectures for the same reasons as discussed in Section 5.5: Videocodecs mostly assume that higher-frequency features of images are less relevant,which produces either an unreadable blurring of the board handwriting or a badcompression rate. In addition to artifacts, non-electronic chalkboard drawingsare sometimes also difficult to read because of low contrast. Figure 8.1 showsan example of a traditional chalkboard lecture compressed with a typical videocodec.1

Using an electronic chalkboard (see Chapter 3) is a better alternative: It cap-tures strokes and allows to save them in a vector-based format. Vector-based

1For further reading: A very different approach that provides a partial workaround for theproblem presented here can be found in [Wallick et al., 2005].

91

Page 100: Adaptive Audio and Video Processing for Electronic Chalkboard ...

92 CHAPTER 8. VIDEO STORAGE AND TRANSMISSION

Figure 8.1: Two chalkboard lectures captured and played back with commercialInternet broadcasting systems. Due to lossy compression and low contrast, thechalkboard content is difficult to read. The lectures where given and recorded atFreie Universitat Berlin.

information requires less bandwidth, can be transmitted without loss of seman-tics, and is easily rendered as a crisp image on a remote computer (compareChapter 5). Still, a disadvantage that was reported to us by many students isthat during distance replay the objects on the board appear out of nowhere. Thelecture appears impersonal because there is no one acting in front of the board.The replay lacks important information because the so-called “chalk and talk”lecture actually consists of more than the content of the board and the voice ofthe instructor. Often, the facial expression of the lecturer adds to the verbalcommunication or the instructor uses gestures to point to certain facts drawn onthe board. Sometimes it is also interesting to get an impression of the classroomor lecture hall. Psychology suggests (see for example [Krauss et al., 1995]) thatgestures and facial expressions are part of a person’s semantic of encoding ideas.The understanding of words partly depends on gestures as they are also usedto interpret and disambiguate the spoken word [Riseborough, 1981]. All theseshortcomings aggravate with the creation of board content being temporarilyabandonded for pure spoken phases or even non-verbal communication. In or-der to transport this additional information to a remote computer, the E-Chalksystem provides an additional video server. As shown in Figures 5.3 and 8.2,the video pops up as a small additional window during lecture replay. The im-portance of the additional video is also supported by the fact that several otherlecture-recording systems (compare Chapter 2) have also implemented this fea-ture, and the use of an additional instructor or classroom video is also widelydiscussed in empirical studies. Not only does an additional video provide non-verbal information as to the confidence of the speaker at certain critical points,like irony [Dufour et al., 2005]. Several experimental studies (for an overviewrefer to [Kelly and Goldsmith, 2004]) have also provided evidence that show-ing the lecturer’s gestures has a positive effect on learning. For example, [Fey,2002] has reported that students are better motivated when watching lecturerecordings with slides and video in contrast to watching a replay that only con-tains slides and audio. [Glowalla, 2004] also shows in a comparative study thatstudents usually prefer lecture recordings with video images over those without.

Page 101: Adaptive Audio and Video Processing for Electronic Chalkboard ...

8.2. OVERVIEW 93

Figure 8.2: Two examples of the use of an additional video client to convey animpression of the classroom context to the remote viewer.

Yet, this transmission of non-verbal information requires several additional re-sources. A camera is needed for capturing, handling the video stream consumesCPU time on both ends, and the additional video data requires a huge amountof additional storage capacity and bandwidth for transmission. The E-Chalkvideo system is therefore an optional component. The classrom as well as thestudent side can choose to turn it off. The video system compresses the videodata down to a manageable size and deals economically with memory and CPUresources on both sides.

8.2 Overview

In the beginning of the E-Chalk project, the development of the video subsystemwas guided by the same idea as the World Wide Radio 2 audio system (seeChapter 6). Initially, the video system was called World Wide Video (WWV)and aimed at building “a fully featured Internet video streaming system thatruns on any hardware or platform” [Friedland et al., 2002]. During the beginningof the development of the video subsystem, the system did not only inherit theidea of WWR2, it also inherited its problems. The processing of video data isusually more expensive since there is more information that has to be handled.For this reason, the system was built as an asymmetric system based on theassumption that the server side has rather unlimited resources while the clientprovides only low computational performance. Since E-Chalk has mainly beenbuilt to support one-to-many communication, this paradigm still governs thearchitectural approach of the system.

The E-Chalk video system is mainly divided into three parts. A systemconfiguration and hardware detection part, the actual server, and the receivingclient. The system configuration and hardware detection directly interacts withthe E-Chalk Startup Wizard. The server is divided into a set of SOPA nodesthat allow grabbing, processing, encoding, transmitting, and converting videocontent. The Java-based replay client is described in Section 5.2.3. The videoserver is similar to the audio server described in Chapter 6. The video serverconsists of a set of SOPA nodes. In the default configuration without instructorsegmentation, only four nodes are used: a video capturing node that delegates

Page 102: Adaptive Audio and Video Processing for Electronic Chalkboard ...

94 CHAPTER 8. VIDEO STORAGE AND TRANSMISSION

Figure 8.3: The video configuration panel of the E-Chalk Startup Wizard.

to JMF, an encoder node that compresses the incoming frames, a node thatwrites the compressed data to a file for archived replay, and optionally a servernode that streams the encoded data live to any connecting client.

8.3 Configuring the Video Server

As described in Chapter 4, the Jave Media Framework provides a platform-independent means to access video hardware. E-Chalk uses the Java MediaFramework for capturing and auto-detecting video devices. Auto detection isencapsulated in a generic media node. It detects a wide range of hardwaredevices on different platforms. In addition to grabbing from hardware pluggedto the computer, the E-Chalk video system also provides two virtual capturingdevices for testing purposes: a screen-grabber records consecutive screen-shotsand a test pattern device creates a video from a user-defined set of test images.Figure 8.3 shows the GUI panel of the video configuration. The user can choosean input grabber, the frame rate, and the video resolution. The settings definedin the E-Chalk Startup Wizard are written to property files that are read bythe SOPA Framework.

8.4 Video Encoding

The video codec in E-Chalk is a very simple, lossy motion compensation codec.The designing goals of the codec prioritize simplicity and computational effi-ciency rather than trying to achieve a maximum compression ratio. Appendix Fshows a detailed syntax description of the format.

The encoder creates three types of frames, called I-Frames, T-Frames, and0-Frames. I-Frames are JPEG images2. I-Frames can be used to improve the

2Strictly speaking, the term “JPEG format” is incorrect. The right term is JFIF (JPEG

Page 103: Adaptive Audio and Video Processing for Electronic Chalkboard ...

8.4. VIDEO ENCODING 95

Figure 8.4: Visualization of the encoding of two consecutive frames from a TVbroadcast (source: ARD, “Tagesschau”, November 10th, 1989): Blocks that havenot changed significantly from the left frame to the right frame are colored black.

quality of the video replay, for example at the beginning of a new scene, at thecost of bandwidth. The first frame recorded or sent when a client connects isan I-Frame.

T-Frames (transparency frames) are generated by a simple motion com-pensation mechanism. The redundancy of static parts of a scene over severalframes is utilized. T-Frames consist only of those blocks in the picture that havechanged significantly or have aged. All other blocks are tagged as transparentand are encoded as zeros. The player substitutes uncoded blocks with olderones from previous frames. A block, as in JPEG, is an 8× 8-pixel matrix. Thedifference is calculated in the YUV color space with the components weightedY:U:V as 4:1:13 The pixelwise Euclidean distances between the current frameand the previous frame are summed up for each block. If the sum of the pixelchanges in a block is n times bigger than the average sum of all blocks, the blockis defined to have changed significantly. Figure 8.4 demonstrates the approach.If a block has never changed significantly over a period of t seconds, it is definedto have aged. Because of this ageing strategy, the video is self-repairing andI-Frames are not required. The variables n and t are set to n = 4 and t = 2by default, but may be adjusted by the user in order to control the degree ofcompression.

For random seek, the video is played back beginning from the specified T-Frame. This individual T-Frame does not contain all image blocks. This resultsin several parts of the image containing black 8 × 8-pixel holes. However, aftertwo seconds of playback, all blocks must have been filled because the ageingrule forces an update of each block after two seconds. Encoding a 640 × 480-pixel video is easily possible in real time. On a Pentium 4 3GHz the algorithmencodes such a video file with more than 50 frames per second. In practice,much lower frame rates are used. This leaves enough CPU resources for theother tasks running during lecture recording and transmission.

0-Frames contain no image data at all. They are used as a placeholder toskip one frame and are transmitted when not a single block is marked opaque.

file interchange format), whereas the specification can be found in [ISO/IEC JTC1, 1994].However, the term “JPEG format” is used more commonly.

3The reason for this is that the human eye is more sensitive to contrast changes than tocolor changes. This is a standard heuristic used in several image and video encoding standards.

Page 104: Adaptive Audio and Video Processing for Electronic Chalkboard ...

96 CHAPTER 8. VIDEO STORAGE AND TRANSMISSION

0-Frames are encoded using a single byte and thus need less bandwidth than aT-Frame with all blocks marked transparent.

The incoming images are consecutively encoded. A sequence of 20 framesforms a packet. Each packet is then compressed using the GZIP format [P.Deutsch, 1996]. The compression ratio obtained is roughly 40:1. In the E-Chalksystem, mainly a quarter picture of NTSC (that is 192 × 144 pixels) and fourframes per second, is used to obtain a bandwidth of about 64 kbit/s. Experienceshows that this frame rate is acceptable when the video is transmitted in aseparate window and only used as an additional source of information. Theactual bandwidth required for a certain transmission, however, finally dependson the video content.

This encoding strategy is used for both the regular video encoding and theoverlaid instructor video (compare Chapter 5). In the latter case, the clientinterpretes the color black as transparent. The next chapter will explain theidea behind the instructor extraction approach.

Page 105: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 9

Merging Video andBlackboard

This chapter presents an idea on how to improve the remote presentation ofelectronic chalkboard lectures by better utilizing the potential that computershave for multimedia processing. It discusses reasons and initial considerationsthat led to an enhanced approach for transmitting the non-verbal communi-cation of the instructor in relation to the electronic chalkboard lecture, basictechnical considerations, and an algorithm that implements the presented idea.Further details on the ideas and algorithms presented here can be found in [Jantzet al., 2004], [Friedland, 2004], [Friedland et al., 2005a], [Friedland et al., 2005d],and [Friedland and Rojas, 2006].

9.1 Split Attention

As discussed in Chapter 8, transmitting an additional video of the classroomcontext is desirable when the required connection bandwidth and storage needscan be met. For this reason, E-Chalk as well as several other lecture recordingsystems (see discussion in Chapter 2), transmit a supplementary video. Es-pecially a video of the instructor conveys non-verbal information that severalempiricial studies have shown to be of value for the student. There are, how-ever, several reasons against showing a video of the lecturer next to the slidesor the chalkboard visualization. The video shows the instructor together withthe board content: In other words the transmitted board content is actuallyredundant. On low-resolution devices, the main concern is that the instructorvideo takes up a sigificant amount of space. The bigger the video, the betternon-verbal information can be transmitted. Ultimately, the video must havethe size of the board to convey every bit of information. There are also layoutconstraints, as the board resolution increases because electronic chalkboardsbecome better, it gets ever more impractical to transmit the video side by sidewith the chalkboard content. Some window managers even arrange the boardand video window in a way that the board occludes the video window. In theend, the video transmission takes up resources without the user even noticingthat there is a video transmission. Even though there still might be solutions forthese layout issues, a more heavily discussed topic is the issue of split attention.

97

Page 106: Adaptive Audio and Video Processing for Electronic Chalkboard ...

98 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Attention is still a topic of research and it is still an open question whetherit can be split in order to attend to two or more different sources of informationsimultaneously (see for example [Hahn and Kramer, 1998] or [Narcisse P. Bichotand Kyle R. Cave and Harold Pashler, 1999]). The topic has been discussedby psychologists and neuroscientists for decades. Most researchers now acceptthat attention can be split but this usually causes cognitive overhead. One ofthe most important publications on the limits of human mental-performancethat has strong implications on the field of man-machine interface design isstill [Baars, 1988]. His Global Workspace Theory states that the brain has onlya single locus of attention. Human beings only become conscious of informa-tion if it is selected by a central executive part of the brain. Several practicalexperiments that are related to the work presented here have been describedin [Sweller et al., 1990] and [Chandler and Sweller., 1992].

In a typical E-Chalk lecture with instructor video (as shown by Figures 5.3and 8.2) there are two areas of the screen competing for the viewer’s atten-tion: the video window showing the instructor and the board or slides win-dow. [Glowalla, 2004] tracked the eye movements of students while watchinga lecture recording that contains slides and an instructor video. His measure-ments show that a students spends about 70 percent of the time watching theinstructor video and only about 20 percent of the time watching the slides. Theremaining 10 percent of the eye focus was lost for activities unrelated to lecturecontent. When the lecture replay only consists of slides and audio, studentsspend about 60 percent of the time looking at the slides. Of course, there isno other spot to focus attention on in the lecture recording. The remaining40 percent, however, were lost in distraction. However, the results may notbe directly transferable to electronic chalkboard-based lecture replays becausethe slides consist of static images and the chalkboard window shows a dynamicreplay [Mertens et al., 2006]. Motion is known to attract human attention morethan static data (see for example [Kellman, 1995]), it is therefore likely that theeyes of the viewer will focus more often on the chalkboard, even when a video ispresented. Nevertheless, the example shows that on a typical computer screentwo areas of the screen may well be competing for attention. Furthermore, itmakes sense to assume that alternating between different visual attractors causescognitive overhead. The issue has already been discussed in [Cooper, 1990]. Heprovides evidence that “students presented a split source of information willneed to expend a portion of their cognitive resources mentally integrating thedifferent sources of information. This reduces the cognitive resources availablefor learning.”

Given what has been said in Chapter 5, the introduction of the previouschapter (Section 8.1), and in this section, the following statements seem tohold:

• Replaying a traditional video of the (electronic) chalkboard lecture insteadof using a vector-based representation is bandwidth inefficient, visuallydisadvantageous, and results in a loss of semantics.

• If bandwidth is not a bottleneck, showing a video of the instructor conveysvaluable non-verbal content that has a positive effect on the learner.

• Replaying such a video in a separate window next to the chalkboard con-tent is suboptimal because of layout constraints and cognitive issues.

Page 107: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.1. SPLIT ATTENTION 99

Figure 9.1: The remote viewer gets the segmented lecturer overlaid semi-transparently onto the dynamic board strokes stored as vector graphics. Upperrow: original video, second row: segmented lecturer, third row board data as vec-tor graphics. In the final fourth row, the lecturer is pasted semi-transparently onthe chalkboard and played back as MPEG-4 video. The segmentation algorithmused for this picture is presented in Chapter 9.

In the following sections, an enhanced approach for transmitting the non-verbal communication of the instructor in relation to the electronic chalkboardlecture is presented. The instructor is filmed as he or she acts in front of theboard by using a standard video camera and is then separated by a novel videosegmentation approach that is discussed in the following chapters. The imageof the instructor can then be overlaid on the board, creating the impressionthat the lecturer is working directly on the screen of the remote student. Fig-ure 9.1 shows the approach. Facial expressions and gestures of the instructorthen appear in direct correspondence to the board events. The superimposedlecturer helps the student to better associate the lecturer’s gestures with theboard content. Pasting the instructor on the board also reduces bandwidth andresolution requirements. Moreover, the image of the lecturer can be transpar-ent. This enables the student to look through the lecturer. In the digital world,the instructor does not occlude any board content, even if he or she is standingright in front of it. In other words, the digitization of the lecture scenario solvesanother “layout” problem that occurs in the real world (where it is actuallyimpossible to solve).

Page 108: Adaptive Audio and Video Processing for Electronic Chalkboard ...

100 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

9.2 Related Approaches

9.2.1 Transmission of Gestures and Facial Expressions

The importance of transmitting gestures and facial expressions is not specificto remote chalkboard lecturing. In a computer-supported collaborative-workscenario people first work together on a drawing and then want to discuss itby pointing to specific details of the sketch. For this reason, several projectshave begun to develop means to present gestures in their corresponding context.Two early projects of this kind were called VideoDraw [Tang and Minneman,1990] and VideoWhiteboard [Tang and Minneman, 1991]. On both ends of thetransmission a person can draw atop a monitor using whiteboard pens. Thedrawings together with the arms of the drawer were captured using an analogcamera, so that each side sees the picture of the remote monitor overlaid ontheir own drawings. Polarizing filters were used to omit video feedback. TheVideoWhiteboard uses the same idea, but people are able to work on a largeupright frosted-glass screen and a projector is used to display the remote view.Both projects are based on analog technology without any involvement of thecomputer. Modern approaches include a solution presented in [Roussel, 2001]that uses chroma keying for segmenting the hands of the acting person and thenoverlaying it on a shared drawing workspace. In order to use chroma keying,people have to gesture in front a solid blue surface and not in front of theirdrawing. This has been reported to produce confusion in several situations.LIDS [Apperley et al., 2003] captures the image of a person working in front ofa shared display with digital cameras. The image is then transformed via back-ground subtraction into a frame containing the whiteboard strokes and a digitalshadow of the person. The VideoArms project by [Tang et al., 2006] works withtouch-sensitive surfaces and a web camera. After a short calibration, the soft-ware extracts skin colors and overlays the extracted pixels semi-transparentlyover the image of the display. This combined picture is then transmitted live toremote locations. The system allows multi-party communication. [Tang et al.,2004] present an evaluation of the VideoArms project. They argue that the keyproblem is still a technical one: “VideoArms’ images were not clear and crispenough for participants. [...] The colour segmentation technique used was notperfect, producing on-screen artifacts or holes and sometimes confusing users”.In summary, the presented approaches either tried to work around object ex-traction, or the technical requirements for the segmentation made the systemssuboptimal. It is therefore important that the lecturer extraction approach isboth easily used in classroom and/or after a session, and technical requirementsdo not disturb the classroom lecture.

9.2.2 Segmentation

Image and video segmentation is an object of current research. Although humanbeings are easily able to distinguish different objects for example on a photo-graph, there is no generic solution that would make computers perform thistask. The main obstacle here is that human vision is not well understood yet.Human visual perception takes into account not only patterns of illuminationon the retina but also other senses and past experiences. Human beings areable to use context information and to fill in missing information by associating

Page 109: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.2. RELATED APPROACHES 101

parts of objects with already learned ones (for a detailed discussion on humanvision see for example [Graham, 2001]).

All approaches that perform vision tasks on the computer – including thosepresented in this dissertation – are based on heuristics or on special assumptionsbelonging to a certain problem domain. Since many tasks that are automatedwith the aid of computers involve image or video processing and computer visionproblems, researchers have found many spezialised tricks to achieve their goals.Current books that provide overviews of the topic include [Forsyth and Ponce,2003] and [Bovik, 2005].

The standard technologies for overlaying foreground objects onto a givenbackground are chroma keying (see for example [Gibbs et al., 1998]) and back-ground subtraction (see for example [Gonzalez and Woods, 2002]). For chromakeying, an actor is filmed in front of a blue or green screen. The image is thenprocessed by analog devices or a computer so that all blue or green pixels areset to transparent. Background subtraction works similarly: A static scene isfilmed without actors once for calibration. Then the actors play normally infront of the static scene. The filmed images are then subtracted pixel by pixelfrom the initially calibrated scene. In the output image, regions with pixel differ-ences near zero are defined transparent. In order to suppress noise, illuminationchanges, reflections of shadows, and other unwanted artifacts, several techniqueshave been proposed that extend the basic background subtraction approaches.Mainly, abstractions are used that substitute the pixelwise subtraction by usinga classifier (see for example [Li and Leung, 2002]). Although non-parametricapproaches exist, such as [Elgammal et al., 1999], per-pixel Gaussian MixtureModels (GMM) are the standard tools for modeling a relatively static back-ground, see for example [Friedmann and Russel, 1997]. These techniques arenot applicable to the given lecturer segmentation problem, because the back-ground of the scene is neither monochromatic nor fixed. During a lecture, theinstructor works on the electronic chalkboard and thus causes a steady changeof the “background”.

Much work has been done on tracking (i. e., localization) of objects for com-puter vision. For example in robotic soccer [Simon et al., 2001], surveillancetasks [Haritaoglu et al., 2000], or traffic applications [Beymer et al., 1997].Most of these approaches concentrate on special features of the foreground andin these domains, real-time performance is more relevant than segmentationaccuracy as long as the important features can be extracted from each videoframe. Separating the foreground from more or less dynamic background is theobject of current research. Many systems use complex statistical methods thatrequire intensive calculations not possible in real time (for example [Li et al.,2003]) or use domain-specific assumptions (a typical example is [Jiang et al.,2004]). Numerous computationally intensive segmentation algorithms have alsobeen developed in the MPEG-4 research community, for example [Chien et al.,2001]. For the task investigated here, the segmentation should be as accurate aspossible. A real-time solution is needed for live transmission of lectures. [Wangand Adelson, 1994] presents a video segmentation approach that uses the op-tical flow to discriminate between layers of moving pixels on the basis of theirdirection of movement. In order to be able to track an object, the algorithm hasto classify it as one layer. However, a set of pixels is grouped into a layer if theyperform the same correlating movement. This makes it a useful approach formotion-based video compression but it is not perfectly suited for object extrac-

Page 110: Adaptive Audio and Video Processing for Electronic Chalkboard ...

102 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.2: Using a stereo camera for segmentation. Top: Left and right views ofthe camera. Bottom left: Depth-range image, darker means farther away (whitemeans unknown). Bottom right: Segmentation result by thresholding a certaindepth, i. e., showing only pixels with depth coordinates in a certain interval. Theexperiment is described in the text.

tion. [Wang et al., 2003] are combining motion estimation and segmentation byintensity through a Bayesian believe network to a spatio-temporal segmentation.The result is modeled in a Markov Random Field, which is iteratively optimizedto maximize a conditional probability function. The approach relies purely onintensity and movement, and is therefore capable to segment grey scale. Sincethe approach also groups the objects by the similarity of the movement, thesame limitations as in [Wang and Adelson, 1994] apply. No details on the realtime capability were given.

Research is also investigating segmentation approaches using spezialisedhardware. The Thermo-Key project [Yasuda et al., 2004] investigated the seg-mentation of persons from the background using thermal cameras. The systemuses the fact that the temperature of the human body is usually higher thanthe temperature of the surroundings, well-known, and rather constant. Thepresented system achieves a good illumination and texture-independent seg-mentation in real time. However, thermal cameras are still very expensive. Thecamera used in the thermo-key project cost about $ 40,000. Because the lecturerstands in front of a plane (the board), segmentation is also possible by using a3D model of the scene. Everything closer to the camera than the board surfaceis considered to be the instructor. Several technologies for 3D scene analysiscurrently exist. 3D laser scanners, for instance, are being used with increas-ing frequency, for example for the conservation of historical heritage [Ogleby,2001], special effects [18], and autonomous robots [Nuchter et al., 2003]. Usuallytriangulation is used to reconstruct depth information. However, the processof reconstructing the 3D model of a scene or object is computationally expen-sive [Bernadini et al., 2001] and far from being computable in real time.

Page 111: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.3. SETUP 103

The use of stereo cameras for the reconstruction of depth information hasbeen thoroughly investigated (see for example [Bradski and Boult, 2001]). Thedifferent perspectives of the two human eyes lead to slight relative displacementsof objects (disparities) in the two monocular views of a scene (in contrast toseveral animals that have two non-overlapping views, for example horses). Thehuman visual system is not only able to merge both monocular views into a fusedview of the scene, it also uses the disparities for depth-estimation. However, thecorrect and fast estimation of disparities is a difficult problem for computers.It is a calculation-intensive task and real-time processing requires additionalhardware [Zitnick and Kanade, 2000]. Moreover, because it involves texturematching, it is affected by the same problems as texture classification methods.For example, similar or homogeneous areas are very difficult to distinguish.

Figure 9.2 shows the result of a quick segmentation test using the stereocamera “STH-MDCS2-C” by Videre Design, Inc. [65]1. The frame rate of thecamera was acceptable for low resolutions (25 frames per second at 320 × 240pixels). However, at higher resolutions such as 640 × 480, frame rates droppedbelow three frames per second on a 3-GHz Pentium 4 processor even on a highlyoptimized version of the provided disparity estimation software with MMX sup-port. The camera needs to be calibrated extensively before first use and theresults are not suitable for the given problem. As can be seen in the picture,disparity estimation fails in similar or homogeneous regions.

Time-of-flight 3D cameras avoid the practical issues resulting from 3D imag-ing techniques based on triangulation or interferometry. They are currentlybecoming available on the market (see for example [1, 11, 13, 39]). They allowcapturing of 3D data at usual video frame rates. As [Gordon et al., 1999] hasalready shown, range information can be used to get a better sample of thebackground faster. [Gokturk and Tomasi, 2004] investigate the use of 3D timeof flight sensors for head tracking. They use the output of the 3D camera asinput for various clustering techniques in order to obtain a robust head tracker.Since these initial results appear promising, Chapter 11 will investigate lecturerextraction using time-of-flight 3D cameras.

9.3 Setup

In E-Chalk, the principal scenario is that of an instructor using an electronicchalkboard at the front of the classroom. The camera records the instructoracting in front of the board so that just the screen showing the board content isrecorded. With a zoom camera this is easily done from a non-disturbing distance(for example from the back of the classroom) and lens distortion is negligible.In the remaining chapters it will be assumed that the instructor operates usingan electronic chalkboard with rear projection, such as the interactive datawalldescribed in Section 3.2. The reason for this is that when a person acts in frontof the board and a front projector is used, the board content is also projectedonto the person. This makes segmentation very difficult. Furthermore, givena segmentation, the projected board artifacts disturb the appearance of thelecturer. Once set up, the camera does not require operation by a cameraperson. The E-Chalk Startup Wizard takes care of starting and terminating

1I thank Hans-Ulrich Kobialka for providing me with this camera.

Page 112: Adaptive Audio and Video Processing for Electronic Chalkboard ...

104 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.3: A sketch of the setup for lecturer segmentation. An electronic chalk-board is used to capture the board content and a camera records the instructoracting in front of the board.

the video recording. In order to facilitate segmentation, changes in lighting and(automatic) camera adjustments should be avoided as far as possible.

Figure 9.3 shows a sketch of the setup. The E-Chalk system is used to recordchalkboard content, audio, and video. Additional SOPA nodes (see Chapter 4)take care of the segmentation. The replay of such a lecture has already beendiscussed in Chapter 5.

9.4 Initial Experiments

This section summarizes some of the initial experiments conducted for instructorsegmentation. Even though they were never used productively for any lecture,they teach us a few facts on the nature of the given segmentation problem.

9.4.1 Simple Approaches

Very simple approaches like subtracting the board background color do notwork. Figure 9.4 shows two sample camera views. Even though in both casesthe background color of the board is black (RGB value (0, 0, 0)), the camera seesa quite different picture. Noise and reflections in particular make it impossibleto threshold a certain color. Furthermore, while the instructor is working onthe board, strokes and other objects appear in a different color than the boardbackground color so that several colors have to be subtracted.

Another experiment consisted of matching the blackboard image on thescreen with the picture seen by the camera and subtracting them. During lecturerecording, an additional program regularly takes screenshots. The screenshotscontain the board content as well as any window borders and dialogs shownon the screen. However, subtracting the screenshots from the camera view wasimpractical. In order to match the screen picture and the camera view, lense

Page 113: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.4. INITIAL EXPERIMENTS 105

Figure 9.4: Two examples of frames captured by the video camera in the setupdescribed in Figure 9.3. Segmentation problems include reflections of the class-room on the board, interlacing artifacts, and noise induced into the camera pro-duced by the backlight of retroprojectors.

distortion and other geometric displacements have to be removed. This requiresa calibration of the camera before each lecture. Taking screenshots with a res-olution of 1024 × 768 pixels or higher is not possible at high frame rates. Inmy experiments, I was able to capture about one screenshot every second andthis took almost a hundred percent of the CPU time. Furthermore, it is almostimpossible to synchronize screen grabbing with the camera pictures. In a reg-ular lecture, many things may happen during a second. After all, a matchingbetween the colors in the camera view and the screen shots has to be found.

Instead of using screenshots, one could also require that the instructor entersthe camera scene a few seconds after the start of the lecture. During this firstfew seconds the computer captures frames without instructor, which can later besubtracted from the images containing the instructor. The approach works verywell until the instructor writes something on the board. Then, not only do boardstrokes appear along with the instructor, the board strokes also cause a slightillumination change, so that the background subtraction fails for large regionsof the image as more and more content is put onto the board. In Figure 9.5,a worst-case example of changing illumination is illustrated. Fortunately, thisworst-case scenario is rare in practice and can be dramatically reduced whenusing a video camera that is able to adjust to changing lighting conditions.

Geometric Assumptions

In most cases, one can assume that a human being consists of two legs, a body,two arms, and a head. Theoretically, these geometric features could be ex-ploited to construct a model of the instructor (see for example [Remondino andRoditakis, 2003]). However, relying on many geometric assumptions or tryingto extract the instructor image using geometric features is impractical. Thesetechniques may work well for tracking, i. e., it may be possible to localize aperson or an object; boundary-accurate segmentation, however, is not solvedby these approaches. The instructor is seldomly seen in its entirety, becausewhile giving a lecture in the real world, he or she will try not to occlude thechalkboard as much as possible. In most parts of the recorded instructor video,only an arm or a part of the instructor is visible. It is possible that the lec-turer disappears completely or a second person might come up to the board.

Page 114: Adaptive Audio and Video Processing for Electronic Chalkboard ...

106 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.5: Worst case example of a change of lighting conditions during alecture. Fortunately, such bad cases are very rare and can be reduced by using aproper configuration of the video camera.

Furthermore, any feature tracked might actually be part of the board content,for example a drawing or an inserted image. Finding borders or shapes is hardsince video recordings contain motion blur, interlace effects, and noise. Evensimple geometric assumptions like “the instructor always begins at the bottomof the image” (i. e., the lecturer cannot fly), “a person’s surface cannot grow orshrink more than a threshold from one frame to another”, or “a sudden disap-pearance of the teacher is impossible” cannot be used practically for improvingthe robustness of the segmentation approach. For example, when only an armis visible because the instructor is out of the camera’s view, it is very unlikelythat his or her image begins at the bottom of the frame.

9.4.2 Motion-Based Segmentation

Instead of thresholding colors or building a model of the background, an initialapproach tried to extract the instructor using motion statistics. The approachuses an observation from E-Chalk’s video-encoding approach. The lecturer iscaptured acting in front of the board and encoded using the strategy describedin Chapter 8. Most of the blocks marked as opaque in the T-Frames containthe lecturer’s image because from one frame to another, changes in the boardcontent only affect a few individual blocks. Notable exceptions are the insertionof images, appearance and disappearance of dialog boxes, and scrolling of theboard content. However, the set of opaque blocks almost never renders thecomplete instructor, and briefly obscured parts of the background have to beupdated using opaque blocks, too. So the idea of the algorithm is to collectthe blocks showing the instructor over several frames by comparing the non-transparent blocks appearing in each frame with a block history table. Thehistory table contains a set of blocks, each associated with a counter. In thebeginning the table is empty and new blocks are inserted with each counter setto 1. During the following frames, each non-transparent block is compared tothe existing blocks in the history table. If a block is sufficiently similar to analready existing block in the history table, the counter is increased, otherwisethe block is also inserted into the history table. After several seconds of video,the blocks in the history table with the highest counter values are assumed tobelong to the instructor image. Blocks that have a low usage count are thrown

Page 115: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.4. INITIAL EXPERIMENTS 107

Figure 9.6: An initial instructor segmentation experiment using motion statis-tics. The idea is discussed in [Jantz et al., 2004].

out of the block history table. After a few seconds, the instructor is segmentedby displaying those blocks in each frame that are similar to those with thehighest count in the block history table. Figure 9.6 shows some results of theapproach. Since long-term motion statistics are used, the approach copes prettywell with rapid changes of the board content, such as scrolling. However, whenthe lecturer stands still for a while, he or she disappears. Another problem areartifacts which result from working with 8 × 8-blocks. The biggest downside,however, is that the similarity search for each block in each frame takes toolong. The experimental approach needed about 10-15 seconds per frame. Theoverall result is not yet robust. The details can be found in [Jantz et al., 2004].

9.4.3 A Combined Approach

Extracting the instructor using motion statistics alone is rather difficult. Forthis reason, a combined approach was targeted with the following experiment.The idea was to create a coarse grain cut of the foreground objects by exploitingthe temporal differences between several frames and then exploiting color andcolor distribution information to improve the segmentation. The method is alsodescribed in [Friedland, 2004] and [Friedland et al., 2005a].

Temporal Foreground and Background Classification

The input for the classifier is a sequence of digitized YUV video frames. Eachframe is subdivided into 8 × 8-pixel blocks. The classifier uses two main datastructures:

• A foreground block buffer that is filled with any blocks that have a highchance of being part of the foreground, and

Page 116: Adaptive Audio and Video Processing for Electronic Chalkboard ...

108 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

• a background buffer that contains those blocks classified as being definitelypart of the background.

A block is moved into the foreground buffer if, during a sequence of n frames,the block has changed more than twice. The underlying assumption is that abackground block usually changes twice when it is occluded by a foregroundobject and again when the view is freed. So a block that changes more thantwice is an indication for a moving object (of course, background blocks close tothe moving object are changed just as often as the foreground blocks replacingthem). My experiments have shown that a good value is setting n to half theframe rate. A block is considered to have changed when it differs significantlyfrom the block at the same position in the previous frame according to theEuclidean distance. The background buffer contains all blocks that have notchanged during the sequence being processed, and which were never classifiedas foreground during later operations. Both foreground buffer and backgroundbuffer are organized as ageing FIFO queues.

Color Distribution Classification

All frames are color quantized from YUV to a fixed 256 color palette (using4 bits for Y and 2 bits each for U and V) and are divided into 8 × 8-pixelblocks. For each quantized block, a color histogram is calculated. The blockhistograms are now classified into foreground and background by comparing eachof them with block histograms of the foreground and background buffer. Be-cause the Euclidean distance does not work very well for histograms, the EarthMover’s Distance (EMD) [Cohen, 1999] (and an approximation of it describedin [Schindler, 2006]) was used instead.

Combining the Classifiers

The temporal classifier tends to find the borders of moving objects, while thecolor distribution classifier is better for surfaces. Given a frame and the resultsof the two classifications, any block considered foreground by at least one of theclassifiers is considered foreground. The remaining blocks are a subset of thereal background. For the foreground blocks, a connected component analysis isperformed. The biggest blob is considered to be the instructor, and all otherblocks (mostly noise or other moving objects) are put into the background buffer.Edge detection, using the Sobel Operator [Gonzalez and Woods, 1992], helpsto smooth the edges of the blob, which appear ragged because of the resolutionreduction to 8 × 8-pixel blocks. Smaller holes are filled and the correspondingblock pixels are taken out of the background list. The resulting segmented videois scaled to fit the board resolution and is pasted over the board content at thereceiving end of the transmission or lecture replay. Figure 9.7 shows the resultalready overlaid onto the board.

Although the idea seemed to be promising, the realization was far fromreal-time performance. At the time being, only one frame per second can beprocessed this way. Skin colors are difficult to extract using the approach (andalso in general, compare [Zhu et al., 2004]) and block-wise operation tendsto produce artifacts that are sometimes hard to retouch [Schindler, 2006]. Theimplementation, did not yet handle the reinitialization needed when the lecturermoves out of the video frame. Still another problem is that if the instructor

Page 117: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.4. INITIAL EXPERIMENTS 109

Figure 9.7: Another initial instructor segmentation experiment. For segmenta-tion both, a temporal and a color distribution statistics was used.

points at a rapidly changing object (for example, an animation on the boardscreen), the two corresponding blobs could become merged.

9.4.4 Conclusion

The facts learned by the initial experiments can be summarized as follows.Tracking the instructor is insufficient, a boundary-accurate segmentation isneeded. Simple color thresholding or static background subtraction models donot work because of noise and illumination changes. While the background isconstantly changing, the lecturer sometimes stands still. This makes a cleardistinction between foreground and background difficult using motion statis-tics alone. Modeling the instructor using geometric assumptions is impracticalbecause it requires modeling all possible instructor appearances that occur inreality (cases include that only parts of the instructor are visible or severalpersons work on the board).

Obviously, combined approaches work better in terms of robustness. The ini-tial experiments presented here operated on 8× 8-pixel blocks. The techniquesare closely connected to well-known video encoding techniques and to E-Chalk’svideo codec. Blocks provide a good mechanism to deal with camera-noise be-cause they provide a good abstraction to average away outlier-colors. However,this resolution reduction produces artifacts and calculation of block similarityis computationally expensive, so another abstraction mechanism would be de-sirable.

Moreover, the study of literature and the experiments raise several questions.For example, it is not clear how to measure color similarity, let alone pixel blocksimilarity. All the presented algorithms constitute heuristics that tend to includemany “magic constants”. Consequently, the question comes up how to measurerobustness in order to make sure that a given approach actually works withmany videos and not only with a few tested samples. The next section presentsa segmentation approach that provides answers to many of the questions thathave been raised during the initial experiments.

Page 118: Adaptive Audio and Video Processing for Electronic Chalkboard ...

110 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

9.5 Robust Real-Time Instructor Extraction

As discussed in the previous section, a robust segmentation between instructorand background is hard to find using motion statistics alone. However, gettinga subset of the background by looking at a short series of frames is possible.Given a subset of the background, the problem reduces to classifying the restof the pixels into either belonging to the background or not.

The core idea behind the approach presented here is based on the notionof a color signature. A color signature models an image or part of an imageby its representative colors. This abstraction technique is frequently used indifferent variants in image retrieval applications, where color signatures areused to compare patterns representing images, see for example [Nascimentoand Chitkara, 2002, Ooi et al., 1998]. A variation of the notion of a colorsignature is able to solve the lecturer extraction problem and is useful for avariety of other image and video segmentation tasks. Further details on thefollowing algorithm are also available in [Friedland et al., 2005d,Friedland et al.,2005b]. The approach presented here is based on the following assumptions: Thehardware is set up as described in Section 9.3, the colors of the instructor imageare overall different from those in the rest of the image, and during the first fewseconds after the start of the recording, there is only one instructor and he orshe moves in front of the camera. The input is a sequence of digitized YUV orRGB video frames either from a recorded video or directly from a camera. Thefollowing steps are performed:

1. Convert the pixels of each video frame to the CIELAB color space.

2. Gather samples of the background colors using motion statistics.

3. Find the representative colors of the background (i. e., build a color sig-nature of the background).

4. Classify each pixel of a frame by measuring the distance to the colorsignature.

5. Apply some post-processing steps, e. g., noise reduction and biggest com-ponent search.

6. Suppress recently drawn board strokes.

The segmented instructor is then saved into E-Chalk video format. As dis-cussed in Chapter 5, the client scales the video up to board size and replays itsemi-transparently.

9.5.1 Conversion to CIELAB

The first step of the algorithm is to convert each frame to the CIELAB colorspace [CIE, 1978]. Using a huge amount of measurements (see [Wyszecki andStiles, 1982]), this color space was explicitely designed as a perceptually uniformcolor space. It is based on the opponent-colors theory of color vision [Hering,1872,Hurvich and Jameson, 1957]2. The theory assumes that two colors can-not be both green and red or blue and yellow at the same time. As a result,

2Some literature even refers to Leonardo da Vinci as being the first to propose this theory[da Vinci, 1492].

Page 119: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.5. ROBUST REAL-TIME INSTRUCTOR EXTRACTION 111

Granularity RMS Error10 35.26

100 3.241000 0.32

10000 0.03

Table 9.1: CIELAB conversion approximation accuracy versus classification er-ror. The details of the experiment are described in the text.

single values can be used to describe the red/green and the yellow/blue at-tributes. When a color is expressed in CIELAB, L defines lightness, a denotesthe red/green value and b the yellow/blue value. In the algorithm describedhere, the standard observer and the D65 reference white [CIE, 1971] is used asan approximation to all possible color and lighting conditions that might ap-pear in an image. CIELAB’s perceptual color metric is still not optimal (see forexample [Hill et al., 1997]) and the aforementioned assumption sometimes leadsto problems. But in practice, the Euclidean distance between two colors in thisspace better approximates a perceptually uniform measure for color differencesthan in any other color space, like YUV, HSI, or RGB. Section 10.7 presentssome experiments on this. Section 10.8 presents a short discussion on the limitsand issues of using this color space for my purpose.

A major disadvantage of using CIELAB is the computational costs involvedfor the conversion from usual color spaces (usually RGB or YUV). To reduce thecomputational cost, I experimented with two approaches: Using a hash tableto lookup already converted colors and using a lookup table filled with pre-calculated values to approximate the cubic roots appearing in the conversionformula. These appeared to be the efficiency bottleneck. Table 9.1 shows theclassification error for different approximation granularities. The granularitydefines the number of lookup values within a domain interval of size 1 for thecubic root table. The table’s domain is [0, 100], i. e., the number of entries is100×granularity. The approximation error was measured as root mean squareerror over one million random pixels. The measured speed-up gained for theconversion is about a factor of 10 to 30, depending on the machine3. For thepurpose of error reduction versus table size, a non-linear distribution of inter-polation points would yield better results but then the lookup itself would bemore complicated and thus slower.

Approximating CIELAB using the described approach, however, only makessense when memory is a bottleneck. Building a hashtable, or even an entirelookup table for all 16777216 colors is not a problem on modern PC hardware.

9.5.2 Gathering Background Samples

As discussed in Section 9.4.1 it is hard to get a background image for directsubtraction. The instructor can paste images or even animations onto the boardand when the instructor scrolls a page of board content upwards, the entirescreen is updated. However, the instructor may also stand still, sometimes

3Tested on a few Windows and Linux PCs with Java Runtime Environment version 1.4(using defaults).

Page 120: Adaptive Audio and Video Processing for Electronic Chalkboard ...

112 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.8: Using motion statistics a sample of the background is gathered.The images show the original video (left) and known background that was recon-structed over several frames (right). The white regions constitute the unknownregion.

producing less pixel changes than the background noise. The idea is thus toextract only a representative subset of the background that does not containany foreground for further processing. The following approach assumes a non-static instructor over an inital period of a few seconds.

To distinguish noise from real movements, I use the following simple butgeneral model. Given two measurements m1 and m2 of the same object, witheach measurement having a maximum deviation e from the real world due tonoise or other factors, it is clear that the maximum possible deviation betweenm1 and m2 is 2e. Given several consecutive frames, e is estimated to findout which pixels changed due to noise and which pixels changed due to realmovement. To achieve this, the color changes of each pixel (x, y) is recordedover a certain number of frames t(x, y), called the recording period. It is assumedthat in this interval, the minimal change should be caused only by noise. Theimage data is continuously evaluated. The frame is divided into 16 equally-sizedregions, and changes are accumulated in each region. Under the assumptionthat at least one of these regions was not touched by any foreground object (theinstructor is unlikely to cover the entire camera region), 2e is estimated to be themaximum variation of the region with the minimal sum. I then join all pixels ofthe current frame with the background sample that during the recording periodt(x, y) did not change more than my estimated 2e. The recording period t(x, y)is initialized within one second and is continuously increased for pixels thatare seldom classified as background, to avoid adding a still-standing foregroundobject to the background buffer. In my experiments, it took a few secondsuntil enough pixels could be collected to form a representative subset of the

Page 121: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.5. ROBUST REAL-TIME INSTRUCTOR EXTRACTION 113

Figure 9.9: Original picture (above) and a corresponding color signature repre-senting the entire image (below). For visualization purposes, the color signaturewas generated using very rough limits so that it contains only a few representativecolors.

background. I call this time period the initialization phase. The backgroundsample buffer is organized as an aging FIFO queue. Figure 9.8 shows typicalbackground samples after the initialization phase.

The background sample is fed into the clustering method described in thefollowing section. Once built up, the clustering is only updated when morethan a quarter of the underlying background sample has changed. However,constant updating is still needed in order to be able to react to changing lightingconditions. The algorithm needs about one to two seconds to recover from theexample illustrated in Figure 9.5.

9.5.3 Building a Model of the Background

The idea behind color signatures is to provide a means for abstraction that sortsout individual outliers caused by noise and small error. A color signature is aset of representative colors, not necessarily a subset of the input colors. Whilethe set of background samples from Section 9.5.2 typically consists of a fewhundreds of thousands of colors, the following clustering reduces the backgroundsample to its representative colors, usually about a few hundreds. The knownbackground sample is clustered into equally-sized clusters because in CIELABspace specifying a cluster size means to specify a certain perceptual accuracy. Todo this efficiently, I use the modified two-stage k-d tree [Bentley, 1975] algorithmdescribed in [Rubner et al., 2000], where the splitting rule is to simply divide thegiven interval into two equally-sized subintervals (instead of splitting the sampleset at its median). In the first phase, approximate clusters are found by buildingup the tree and stopping when an interval at a node has become smaller than theallowed cluster diameter. At this point, clusters may be split into several nodes.In the second stage of the algorithm, nodes that belong to several clusters arerecombined. To do this, another k-d tree clustering is performed using just the

Page 122: Adaptive Audio and Video Processing for Electronic Chalkboard ...

114 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.10: Two examples of color segmented instructor videos. Original framesare shown on the left, segmented frames are shown on the right. The framebelow shows an instructor scrolling the board, which requires an update of manybackground samples.

cluster centroids from the first phase. I use different cluster sizes for the L, a,and b axes. The values can be set by the user according to the perceived colordiversity in each of the axes. The default is 0.64 for L, 1.28 for a, and 2.56 forthe b axis. For further abstraction, clusters that contain less than 0.1 % of thepixels of the entire background sample are removed. Section 10.7 explains thedetermination of the constants.

The k-d tree is explicitly built and the interval boundaries are stored in thenodes. Given a certain pixel, all that has to be done is to traverse the tree tofind out whether it belongs to one of the known background clusters or not.Figure 9.9 shows a sample color signature.

9.5.4 Postprocessing

The pure foreground/background classification based on the color signature willusually select some individual pixels in the background with a foreground colorand vice versa, resulting in tiny holes in the foreground object. The wronglyclassified background pixels are eliminated by a standard “erode” filter opera-tion while the tiny holes are filled by a standard “dilate” operation. A standardGaussian noise filter smoothing reduces the amount of jagged edges. A biggestconnected component search is to be performed. The biggest connected com-ponent is considered to be the instructor, and all other connected components(mostly noise and other moving or newly introduced objects) are eliminatedfrom the output image. Figure 9.10 shows two sample frames of a video wherethe instructor has been extracted as described here.

Page 123: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.6. EXAMPLE RESULTS 115

Figure 9.11: Board drawings that are connected to the instructor are often con-sidered foreground by the classification. An additional board stroke suppressioneliminates these artifacts. Left picture: The result of the color signature classi-fication. Right picture: After applying a postprocessing step to eliminate boardstrokes.

9.5.5 Board Stroke Suppression

As described in Section 9.5.2, the background model is built using a statisticsover several frames. Recently inserted board content is therefore not part of it.For example, when a macro is used on the board (as described in Chapter 3),a huge amount of new board content is shown on the board in a short time.With the connected component analysis performed for the pixels classified asforeground, most of the unconnected strokes and other blackboard content hasalready been eliminated. In order to suppress strokes just drawn by the lec-turer, all colors from the board system’s color palette are inserted as clustercentroids to the k-d tree. However, as the real appearance of the writing varieswith projection screen, camera settings, and illumination, not all of the boardactivities can be suppressed. Additionally, strokes are surrounded by regionsof noise that make them appear to be foreground. In order to suppress mostof those thinner objects, i. e., objects that only expand a few pixels in the Xand/or the Y -dimension, are eliminated using an erode operation. Fortunately,a few remaining board strokes are not very disturbing because the segmentedvideo is later overlaid onto the board drawings anyway. Figure 9.11 comparestwo segmented frames with and without board stroke suppression.

9.6 Example Results

The resulting segmented video is scaled to fit the board resolution (usually1024 × 768) and is pasted over the board content at the receiving end. Severalexamples of lectures that contain an extracted and overlaid instructor can beseen in different chapters of this document, including Figures 9.1, 9.12 and 9.13.

Reflections on the board display are mostly classified as background andsmall moving objects rarely make up the biggest connected component. Thresh-olding the minimum size of the biggest component improves the stability whenthe instructor leaves the camera’s field of view. For the background recon-struction process to collect representative background pixels, it is not necessaryto record a few seconds without the instructor. The only requirement is thatfor the first few seconds of initialization, the lecturer keeps moving and does

Page 124: Adaptive Audio and Video Processing for Electronic Chalkboard ...

116 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

Figure 9.12: The instructor is extracted from the original video (left) and pastedsemi-transparently over the vector-based board content (right).

not occlude background objects that differ significantly from those in the otherbackground regions.

The performance of the algorithm depends on the complexity of the back-ground and on how often it has to be updated. Usually the current Java-basedprototype implementation processes a 640 × 480 video at 25 frames per secondafter the initialization phase. This includes a preview window and the E-Chalkvideo compression (on a PC with a 3-GHz Intel Pentium 4).

As the algorithm focuses on the background, it provides rotation and scaling-invariant tracking of the biggest moving object. The tracking still works whenthe instructor turns around or when he leaves the scene and a student comesup to work on the board. Once initialized, the instructor does not disappear,even if he or she stands absolutely still for several seconds (which is actuallyvery unusual).

9.7 Limits of the Approach

The most critical drawback of the presented approach is the requirement thatthe instructor moves at least during the initialization phase. During our exper-imental recordings, we did not find this to be impractical. However, it requiressome knowledge and is therefore prone to usage errors. The quality of the seg-mentation is suboptimal if the instructor does not appear in the the pictureduring the first few frames or does not move at all. Too much camera noise dur-ing the initilization phase is most often the cause for a bad segmentation result.Another problem is that if the instructor points at a rapidly changing object(for example, an animation on the board screen) of a similar color structure, the

Page 125: Adaptive Audio and Video Processing for Electronic Chalkboard ...

9.8. CONCLUSION 117

Figure 9.13: A 90-minute lecture containing dynamic board content, audio, andsuperimposed lecture can be played back on mobile phones. The lecture requiresabout 40MB of storage space.

instructor and the animation might both be classified as foreground. If they areconnected somehow, the two corresponding components could become mergedand displayed as the biggest single component.

Although the instructor videos are mostly well-separable by color, the ap-proach fails when parts of the instructor are very similar to the background.When the instructor wears a white shirt, for example, the segmentation some-times fails because E-Chalk’s toolbox often also appears white to the camera(compare Figure 9.4). One of the ideas was to improve the situation by combin-ing the color segmentation approach with edge detection algorithms. However,the experiments failed because the videos taken in front of a rear projection areoften very noisy, which results in many wrongly detected edges. Interlace ef-fects and the internal color quantization of the camera can produce further falseedges. On the other hand, when the instructor moves, the edges between theperson and the board are blurred. At the edges and in high textured regions,spill colors can sometimes be observed. Spill colors arise when pixels contain amix of the colors between foreground and background. This happens especiallyat borders of objects or highly structured regions when a pixel contains partsof an object as well as parts of the background. Removing spill-colors requiressub-pixel-accurate segmentation.

9.8 Conclusion

Using the presented segmentation approach, a solution to the split attentionproblem has been implemented. This improves the quality of the lecture replayat the receiving end. The lecturer is cut out of the video stream and pasted

Page 126: Adaptive Audio and Video Processing for Electronic Chalkboard ...

118 CHAPTER 9. MERGING VIDEO AND BLACKBOARD

onto the vector-based dynamic board image. The superimposed lecturer helpsthe student to better associate the lecturer’s gestures with the board contentsand conveys facial expressions. Pasting the instructor on the board also reducesspace and resolution requirements. This makes it possible to replay an E-Chalklecture on a mobile device even if it includes a video of the lecturer (see Fig-ure 9.13). A lecture containing board, overlaid instructor, and audio can beplayed back on a handheld device at 64 kbit/s.

The next chapter generalizes the instructor video segmentation to a generalframework that can be applied to various segmentation tasks. The generaliza-tion is able to solve several of the problems discussed in the previous section.The next chapter also presents a more detailed discussion on the limits anddrawbacks of the color signature segmentation approach and presents an evalu-ation of the performance and robustness of the approach.

Page 127: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 10

Generalizing the InstructorExtraction

A variation of the instructor video segmentation approach presented in the pre-vious chapter can also be applied to a variety of other problems where a fore-ground object has to be extracted in an image or video. This generalization ofthe instructor extraction approach solves many of the open problems describedin the last chapter. I have released it as an open-source segmentation frame-work under the name Simple Interactive Object Extraction (SIOX) [48]. SIOXhas been integrated into several open-source image and video manipulation ap-plications. This chapter describes SIOX, compares it with related approaches,and presents some details of the integration of the method as a cut-out tool indifferent image and video manipulation applications. Further information canbe found in [Friedland et al., 2005b], [Friedland et al., 2005e], [Friedland et al.,2006a], and [Friedland et al., 2006b].

10.1 The State of the Art

Many popular image manipulation programs contain semi-automatic object ex-traction tools. The most popular tool for extracting foreground semi-automatic-ally in image manipulation programs is Magic Wand. Magic Wand starts witha small user-specified region. The algorithm then performs a region growing byabsorbing connected pixels such that all selected pixels fall within some user-adjustable tolerance of the color statistics of the specified region. For naturalimages, finding the correct tolerance threshold is often problematic. The meth-ods works well for images which contain few colors, such as drawings. For“natural” images that contain many colors, such as photographs, the results areunusable or the interaction required is far from being feasible. In practice, it isbetter to use non-automatic tools, for example a path tool, to extract an objectfrom a photograph by hand rather than to use Magic Wand.

Intelligent Scissors [Mortensen and Barrett, 1999] can be used to select con-tiguous areas of similar color in a fashion similar to Magic Wand. IntelligentScissors creates a selection boundary by assisting the user to create a set ofconnected line segments around the object. Clicking with the mouse createsnodes that are joined using curve shapes that attempt to follow color weights.

119

Page 128: Adaptive Audio and Video Processing for Electronic Chalkboard ...

120 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.1: A sample comparison (horse image from [14]) between Corel’sKnockout 2 and SIOX as implemented in GIMP (see Section 10.3). Upper row:Knockout 2 requires the user to specify the outer region (red) and an inner region(green). The tool then tries to classify the unknown pixels between the two strokes.Kockout’s output is shown in the right picture. Bottom row: SIOX requires theselection of a region of interest and optionally a very coarse grain specification ofknown foreground. The segmentation result is shown in the right picture.

Although the method works even with sub-pixel accuracy, a satisfactory seg-mentation is only achieved with very simple photographs that have clear edges.

Bayes Matting [Chuang Y.-Y. and R., 2001] gets a shrinked shape of theobject and a subset of the background as input. The user uses a brush tocoarsely redraw the shape of the input with the brush stroke having to containboth foreground and background. The algorithm then tries to compute opacityvalues over the pixels marked with the brush. The main disadvantage is thatfor complicated objects the user must specify quite detailed shape informationfor the algorithm to work properly. Knockout-2 is a proprietary plug-in forPhotoshop [Corel Corporation, 2002]. According to [Chuang Y.-Y. and R.,2001] the results are sometimes similar, sometimes of less quality than BayesMatting. Adobe’s Photoshop contains a tool called Extract. It requires a littleless user interaction. Instead of two strokes, only one thick brush strokes hasto be drawn by the user, which has to cover the edge of the object. Extractgives similar results to Knockout-2 [Trinkwalder, 2006]. Figure 10.1 shows acomparison of interaction and results between SIOX and Knockout 2.

GrowCut [Vezhnevets and Konouchine, 2005] is a very recent algorithmbased on a cellular automaton. The classification of a pixel is partly deter-mined by the classification of its neighbors. Doing this over many iterations,the selection will become more and more stable. Due to the large number ofiterations required, this process takes more than a minute even for moderatecomplex screen-resolution images that are far from the high resolutions of mod-ern digital cameras for example.

Page 129: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.2. ALGORITHM DESCRIPTION 121

Grabcut [Rother et al., 2004] is a two step approach. The first step is au-tomatic segmentation that relies on the work of Graph Cut [Boykov and Jolly,2001]. The second step is manual post-editing. The idea of automatic classifi-cation is reduced to building a graph where each pixel is a node with outgoingedges to each of the 8 neighboring pixels. The edges are weighted such that amax-flow/min-cut problem computes the segmentation. The user only selectsthe region of interest. Grabcut ’s manual post-processing tools include a so-calledbackground brush, a foreground brush, and a matting brush to smooth bordersor re-edit classification errors manually. In terms of robustness, Grabcut sur-passes all the algorithms mentioned earlier but can only select one object at atime. The algorithm minimizes a global cost function which cannot distinguishbetween fine local details and noise. It therefore fails for highly detailed regionsand noisy pictures (compare Figure 10.14).

GrabCut was extended by [Li et al., 2005] and [Wang et al., 2005] whopresent semi-automatic video cut-out tools that are far from being realtimecapable. To accelerate the interactive refinement, [Wang et al., 2005] cluster thepixels by a hierarchical mean shift into 2D regions, which in turn, are combinedby motion estimation to 3D regions. After a manual specification of knownbackground, the GrabCut algorithm generates a contour, which is further refinedby reconstructing a 3D-contour mesh. The whole process is reported to becomputationally quite expensive: more than 22 seconds per frame.

10.2 Algorithm Description

Like the instructor extraction approach in Chapter 9, SIOX separates foregroundobjects from background based on their color characteristics. Consequently, itrequires color images and the assumption that the foreground objects are suffi-ciently perceptually different from the background. Fortunately, digital camerastypically try to optimize color variance resulting in perceptual dissimilarity ofdifferent objects [Adams et al., 1998]. Of course, there is no unique definitionof “foreground” or “object” because the semantics ultimately depends on theunderstanding of the individual who is perceiving the image. Inside the scopeof the algorithm, SIOX defines foreground to be a set of spatially connectedpixels that are “of interest to the user”. The rest of the image is consideredbackground. The user has to specify at least a superset of the foreground.

The input for the SIOX algorithm is a color image or a video-frame inCIELAB space and an initial confidence matrix Mi. A confidence matrix isa matrix of the same dimensions as the image. Each element of the matrixcontains a floating point number that lies in the interval [0, 1] and correspondsto one pixel in the image. A value of 0 means the corresponding image pixelbelongs to the background, a value of 1 means the corresponding image pixelbelongs to the foreground. Any value between 0 and 1 describes a certain ten-dency that the corresponding pixel belongs to either foreground or background,with 0.5 expressing no tendency. In the following, confidence values of 1 meanknown foreground, values of 0 known background, and values of 0.5 unknown.This notion of a confidence matrix has a few advantages. The confidence matrixalong with the original picture can easily be passed between different processingsteps or be serialized. Its elements can easily be interpreted as probabilitiesor as values of a gray-scale image. The latter interpretation allows to apply

Page 130: Adaptive Audio and Video Processing for Electronic Chalkboard ...

122 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.2: The original image (source [Martin et al., 2001]), a user-providedrectangular selection (red: region of interest, green: known foreground), andthe corresponding confidence matrix (black: known background; gray: unknown;white: known foreground). Figure 10.4 shows the segmentation result.

standard image operations, such as convolutions or morphological operators,without having to touch the original picture. The confidence values can directlybe mapped to transparency values. The input confidence matrix for SIOX maycontain known foreground, known background, and unknown elements (definedby the confidence values 1.0, 0.0, and 0.5). It must, however, at least con-tain known background. Mi is either specified by the user or generated by anautomatic classifier. The steps of the algorithm are as follows:

1. Create color signatures SB and SF . SB represents the specified knownbackground and SF the known foreground (either it has been specified orthe signature is calculated as a difference signature between the signatureof the entire image and SB).

2. Classify each unknown pixel of the image as foreground or backgroundusing a nearest-neighbor search in SF and SB . This produces a newconfidence matrix Mo.

3. Filter out noise by applying erode/dilate and blur on the matrix Mo toremove artifacts and optionally close holes up to a specific size.

4. Find the connected components with high confidence in Mo which arelarge enough or correspond to user markings. Set the values of all otherconnected components to 0.

5. Apply the confidence matrix Mo to the image. This is usually done bymapping the elements of Mo directly to the transparency values of thepixels contained in the image.

As defined above, the algorithm computes a separation of one (possibly dis-connected) foreground object from the background. However, a straightforwardextension of the algorithm also allows for multi-labeling, that is the separationof the image into several different objects and background. An example of suchan approach is described in Section 10.5. Figure 10.2 shows a sample input forthe algorithm and the corresponding confidence matrix.

10.2.1 Construction of Color Signatures

As already defined in the previous chapter, a color signature is a set of repre-sentative colors, not necessarily a subset of the input colors. A color signature

Page 131: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.2. ALGORITHM DESCRIPTION 123

(a) Original image (b) All colors (c) KD-tree signa-ture

(d) Array signa-ture

Figure 10.3: In (b), all colors from (a) are visualized as points in CIELAB space,(c) shows the color signature resulting from the tree clustering algorithm, (d)shows the signature from the faster array-based algorithm described in [Friedlandet al., 2006b]. The array-based clustering is faster, but the segmentation result isworse.

is constructed by clustering a set of pixels into equally-sized clusters. The cen-troids of the clusters are defined to be the representative colors.

The algorithm for creating a color signature from a set of pixels has alreadybeen described in Section 9.5.3: Given a set of color pixels, all colors are re-garded as points in a d-dimensional color space. This color space is subdividedrecursively, starting with the whole space. In step i, the points in the currentbox B of the subdivision are projected onto the axis a along dimension i mod d.The two extreme projections p, q are determined, and if ‖p − q‖ is larger thana given threshold li mod d, B is split into two with a plane orthogonal to a atp+q2 . This is done until all boxes have at least one dimension that is smaller

than the threshold for that dimension. As described in Section 10.7, the triple(0.64, 1.28, 2.56) for the box width in dimension i was found to be a good set ofthreshold values using genetic algorithms.

In a second pass, all center points of the boxes resulting from pass one aretaken and are used as input points for the same algorithm. To improve noiserobustness, only the center points of such boxes B are considered that containat least t points for a fixed threshold t. These points are representative pointsand therefore become part of the signature. A good value for t is the number ofspecified pixels divided by 1000 (again, see Section 10.7). As already observedby [Rubner et al., 2000], this clustering method produces a good distributionand a representative signature with few points.

For applications where speed is a major issue and quality a minor problem,such as high-resolution video segmentation with high frame rates, an alternativecolor segmentation algorithm can be used that is much simpler and almost anorder of magnitude faster [Friedland et al., 2006b]. The dynamic splitting rulefrom the kd-trees is then exchanged by a fixed discretization of the color space.This can be realized as a simple three-dimensional array. This does not onlyallow very fast access to every cell but also allows incremental updates when

Page 132: Adaptive Audio and Video Processing for Electronic Chalkboard ...

124 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.4: The result of the color classification before (left) and after post-processing (right). The original image and the user selection is shown in Fig-ure 10.2.

new foreground or background is selected without the need to rebuild the entiresignature.

The signatures resulting from clustering typically contain only a few hun-dred points or less, which makes the subsequent steps very fast. To comparedifferent clustering techniques, one can look at the clusters they create as shownin Figure 10.3. The discretized CIELAB space yields a very regular signaturecompared to the kd-tree approach due to its array implementation which allowsfurther geometric optimizations. However, this clustering gives slightly worsesegmentation results (see 10.7).

A color signature is built for the set of pixels having confidence 0 and an-other one is built for the pixels of confidence 1. If the confidence matrix doesnot contain any pixels with confidence 1, the foreground signature is found bycolor signature subtraction which is defined as follows. Two color signatures S1

and S2 are subtracted into a resulting signature R = S1\S2 by comparing therepresentative colors contained in S1 and S2 using the Euclidean distance. Foreach element in S2, the element in S1 with minimum distance is marked. R is asubset of S1 that contains only those representative colors of S1 that have notbeen marked. S2 must not contain more elements than S1. In order to builda foreground signature when only known background is given, the backgroundsignature is subtracted from the signature of the entire image (which has alwaysthe same or a higher cardinality).

10.2.2 Classification of Unknown Pixels

The pixels with confidence value 0.5 are classified using nearest neighbor search.If the Euclidian distance of a pixel’s color is closer to an element of the fore-ground signature than to all elements of the background signature, it is classi-fied as foreground, otherwise it is classified as background. If a color has equalminimal distances to both signatures, the pixel is considered foreground. Thereason for this is a practical one: In image editing tools it is usually easier toerase wrongly classified foreground than to reconstruct wrongly classified back-ground. However, in natural images, such as photographs, this case has a verylow probability.

Page 133: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.3. SEGMENTATION OF STILL IMAGES 125

10.2.3 Post-processing

As already explained in Section 9.5.4, the pure foreground/background classi-fication based on the distances to the color signatures will usually select someindividual pixels in the background with a foreground color and vice versa, re-sulting in tiny holes in the foreground object. Again, the wrongly classifiedbackground pixels are eliminated by a standard “erode” filter operation whilethe tiny holes are filled by a standard “dilate” operation [Gonzalez and Woods,2002] directly performed on the confidence matrix. A breadth-first search onthe confidence matrix is performed to identify all spatially connected regionsthat were classified as foreground. Either the biggest region or all regions withan area greater than a threshold are considered the final foreground object(s).The user can specify a smoothness factor to define how much smoothing shouldbe applied to the confidence matrix. More smoothing reduces small classifica-tion errors. Less smoothing is appropriate for high-frequency object boundaries,for clouds or drawings. The values of the confidence matrix are directly usedas transparency values (also known as α values) for each corresponding pixel.Figure 10.4 shows a sample result before and after post-processing.

10.3 Segmentation of Still Images

For still image object extraction, the user specifies the known background andknown foreground regions manually. In the following, the user-specified regionsare called trimap. As explained above, the known foreground is optional, butit improves the robustness of the segmentation. To provide this information,the user makes several selections with the mouse. The outer region of the firstselected area specifies the known background while the inner region defines asuperset of the foreground, i. e., the unknown region. Using additional selec-tions, the user may specify one or more known foreground regions or additionalbackground regions to refine the region of interest. Internally, the trimap ismapped into a confidence matrix.

Using this interaction style, SIOX has been integrated into the core of theopen-source project GIMP (GNU Image Manipulation Program) [21]. Fig-ure 10.5 shows the user interaction necessary to create the initial confidencematrix as implemented in GIMP version 2.3.9. A freehand selection tool is usedto specify the region of interest (Figure 10.5 a and b). It contains all foregroundobjects to be extracted and as few background as possible. The pixels outsidethe region of interest form the known background while the inner region definesa superset of the foreground, i. e., the unknown region. The known backgroundis visualized as dark area.

The user then uses a foreground brush to mark representative foregroundregions (Figure 10.5 c). Internally, this input is mapped into a confidence ma-trix, where each element of the matrix corresponds to a pixel in the image. Thevalues of the elements lie in the interval [0, 1] where a value of 0 specifies knownbackground, a value of 0.5 specifies unknown, and a value of 1 specifies knownforeground. Once the mouse button has been released, the selection is shownto the user. The selection can be refined by either adding further foregroundmarkings or by adding background markings using the background brush (Fig-ure 10.5 d). Pressing the “Enter” key results in the creation of the final selection

Page 134: Adaptive Audio and Video Processing for Electronic Chalkboard ...

126 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

(a) User loads an image and chooses the fore-ground extraction tool...

(b) selects region of interest...

(c) specifies representative foreground re-gions...

(d) optionally refines the result...

(e) and is provided with a tight selection. (f) The object can now be handled indepen-dently.

Figure 10.5: User interaction to provide initial confidence matrix in the image-editing program GIMP. SIOX has been integrated as a core feature in GIMP sinceversion 2.3.3 (seagull image from [14]).

Page 135: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.4. SUB-PIXEL REFINEMENT 127

Figure 10.6: An illustration of the basic idea of the Detail Refinement Brush:Spill colors can be detected by the ratio of the distances to the closest represen-tative foreground color f and the closest representative background color b.

mask (Figure 10.5 e). The object can then be manipulated independently (Fig-ure 10.5 f).

10.4 Sub-pixel Refinement

In most cases, a pixel-accurate object extraction gives satisfying results. Some-times, however, a single pixel contains parts of the foreground as well as parts ofthe background. The resulting color of the pixel is a mixture of the foregroundand the background. For this reason, images containing highly structured tex-tures, such as hair or fine tree branches, look sloppy if they are classified onlywith pixel resolution. Sub-pixel accuracy is also needed to remove spill colorsthat result from motion blur or image filters that smooth borders.

Fortunately, color signatures provide an adequate model, and a simple exten-sion of the algorithm allows to cope with this issue. Figure 10.6 illustrates theidea. Let c be a certain CIELAB color in the picture. Let f be the closest repre-sentative color to c in the foreground signature and b the closest representativecolor to c in the background signature. Let c’ be the orthogonal projection of conto the segment fb. The point c’ splits fb into the two segments fc′ and c′b.Checking if the ratio ‖fc′‖

‖c′b‖ comes closer to 1 than a threshold t allows to detectwhether a color c is likely to be a mixture between foreground and background.In other words, if the Euclidian distances between c and f and between c and bare very similar or equal, then c is assumed to be a mixture between foregroundand background. Of course, for sensible results, the angle spanned by f, c, bmust not be too small, or a more suitable pair of points f, b has to be found.

However, this method fails for colors that are inherently a mixture of manycolors, for example white. Although their nearest neighbor clearly classifies themas part of the background or foreground, these colors are very often inherentlydetected as mixture colors because there are also many close representativesin the antagonist signature. Another question is how to set the threshold t,i. e., how to define “close enough to 1”. These two questions make it hardto implement a full-automatic solution that would allow for the detection ofmixture colors. In practice, a semi-automatic solution was favored and wascalled Detail Refinement Brush (DRB). The Detail Refinement Brush is offeredto the user as a simple interactive drawing tool. Using coarse strokes, the brushis used to refine regions where the results achieved by the automatic, pixel-accurate segmentation are not satisfying. With the user specifying the regions

Page 136: Adaptive Audio and Video Processing for Electronic Chalkboard ...

128 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

(a) Antialiased border withspill colors

(b) Brush interaction (c) Result

(d) Highly detailed texturewith spill colors

(e) Brush interaction (f) Result

(g) Highly detailed texturewhere too many pixels wereclassified as background

(h) Brush interaction (i) Result

Figure 10.7: Sample results for the Detail Refinement Brush. The image inthe left column show the result after SIOX, the middle column shows the userinteraction with the DRB, and the right column the results. Spill colors can beremoved in subtract mode (red), pixel omissions in highly textured regions can berepaired using add mode (blue).

to search for mixed colors, the risk that a wrong detection destroys alreadyapproved segmentation results is lowered. In addition to the brush, the useris provided with a slider that enables the adjustment of the threshold t. Thebrush has two different modes: add and subtract. Add re-adds wrongly classifiedforeground. Subtract is used to remove spill colors at borders or from highlydetailed textures.

The brush affects the confidence conf(p) of a pixel p with color c in thefollowing way (f and b being the closest representiative color in foreground andbackground signature, respectively):

conf(p) =

1 − min(‖c−f‖

‖c−b‖ , 1) in subtract mode

min( ‖c−b‖‖c−f‖ , 1) in add mode

Page 137: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.5. EXTRACTION OF MULTIPLE OBJECTS 129

Figure 10.8: Simultaneously extracting multiple objects of the same color struc-ture using SIOX can save time. From left to right: Original image (source: [14]),user selection (blue: known background, green: known foreground), and finalresult.

If conf(p) is smaller or equal to the user-defined threshold, the pixel con-fidence value is directly mapped to the opacity value of the given pixel. Fig-ure 10.7 a, d, g show some sample results after automatic object extraction.The manual brush interaction illustrates Figure 10.7 b, e, h. The refined resultsare shown in Figure 10.7 c, f, i.

I also experimented with complete removal of the background tone from thepixel’s color but this turned out to be too aggressive. The perceptual result ofmixing two or more colors is non-linear and an “un-mixing” would require amore accurate color model.

10.5 Extraction of Multiple Objects

The extraction of multiple objects (also often referred to as multi-labeling) ofuniform color structure and size has already been described in Section 10.2:Instead of defining the biggest connected component as the final result in thepost-processing step, one allows for multiple objects that have at least a certainsize. In practice, this can easily be implemented by providing the user with acheckbox that disables or enables multi-object extraction and a slider that allowsto adjust the minimum allowed object size. In order to facilitate usage, the SIOXimplementation in GIMP provides only a checkbox: If the extraction of severalobjects is enabled, all those connected foreground components are consideredobjects of interest that have at least one quarter of the size of the biggestconnected component. Figure 10.8 shows some examples of the extraction ofmultiple objects with similar color structures.

Of course, it is also possible to extract multiple objects of different colorstructure using repeated extractions of single objects. Graph-based segmenta-tion approaches, such as Grabcut (see Section 10.1), have to rely on repeatedextractions in order to implement multi-labeling because they seek a minimal-cut. Using SIOX, the extraction of multiple objects of differing color structurein a single step only requires the creation of a color signature for each object.

Page 138: Adaptive Audio and Video Processing for Electronic Chalkboard ...

130 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

An example where the extraction of several objects in a single step is desirableis given in the next section.

10.6 Video Segmentation

Figure 10.9: The user specifies known foreground and known background for thefirst frame in a scene (above). SIOX segments it and reuses the color signaturesfor automatic segmentation of subsequent frames (below).

For object extraction in videos, the confidence matrix can be either specifiedby the user or can be learned from motion statistics. If the matrix is specified bythe user, the approach is similar to the one described in Section 10.3, with theexception that color signatures can be reused in consecutive frames. Since manycolors are identical in consecutive frames, a hash table allows for very efficientclassification of the non-background pixels in each frame. However, when thedisplayed scene changes too much, the segmentation has to be stopped and a

Page 139: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.6. VIDEO SEGMENTATION 131

Figure 10.10: Original video images (left), color classification using the tra-ditional strategy by [Gunnarsson et al., 2005] (middle), and result using SIOX(right). The gray regions define known background, the white regions are unclas-sified, all other colors mark a certain object. No post-processing step is appliedbecause the result is already sufficient for the subsequent robot-control processingsteps.

new manual selection has to be performed. Fortunately, many heuristics existfor scene change detection (see for example [Zabih et al., 1995, Aoki et al.,1996, Feng et al., 2005]). I experimented with observing the hit/miss rate ofthe hash table for each frame, which results in a robust detection of most scenechanges.

If background and foreground signatures have been build, the current Javareference implementation easily processes a 640×480 video at 30 frames per sec-ond. Figure 10.9 shows a sample object extraction in a video. The videos wereextracted using a manual specification of known background and foreground inthe first frame. For fully automatic object extraction, any known method maybe used that is able to provide at least a subset of the background and prefer-ably also a subset of the foreground so that color signatures can be computedwithout manual interaction. A specialized approach used for instructor segmen-tation is described in Section 9.5.2. A hardware-based approach is presentedin Chapter 11. Of course, the instructor can also be extracted using a manualspecification of foreground and background samples. In fact, this provides amore robust and motion-independent lecturer extraction.

Of course, multi-object extraction as described in Section 10.5 is also possiblein videos. The following experiment presents a practical video segmentationapplication where the extraction of several objects in a single step is desirable:object tracking in robotic soccer. Robocup [57] is a competition of autonomousrobots playing soccer in a color-coded environment. Each class of objects seenby the robots is associated with a unique color. A robot’s vision relies on theclassification to identify and discriminate various objects on the field which inturn is very important for its behavior and finally for the success of the entiresoccer team. The canonical approach used by many Robocup participants isto perform a color calibration by either manually [Gunnarsson et al., 2005]or automatically [Mayer et al., 2002] selecting representative regions of each

Page 140: Adaptive Audio and Video Processing for Electronic Chalkboard ...

132 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.11: From left to right: The original image from the benchmark,the original lasso selection as provided by [35], a user-specified trimap used forbenchmarking SIOX, and the ground-truth provided in the benchmark dataset.SIOX has been benchmarked with the user-specified trimaps because they aremore realistic and the lasso trimaps are not suitable for SIOX because they donot select representative colors.

object in several video frames and feed them into a classifier. A look-up tableis then built in which each color is associated with a class. [Gunnarsson et al.,2005] proposes to fill the color table by a computational intensive process thatautomatically identifies regions of the various objects by shape and marks theircolors correspondingly. This is done in non-time-critical moments so that duringthe actual game, a simple look-up suffices to classify each object.

The perceived color, however, depends on several factors such as lightingconditions (which may change over time), camera settings, and shadows orreflections cast by surrounding objects. Filling the look-up table exclusively withmarked colors only yields satisfactory results when several dozens of frames havebeen processed this way. However, the abstraction mechanism provided by colorsignatures allows for a satisfying classification even with a small initial regionclassification. Using either the user-selected regions or an automatic output ofa geometric pre-classifier, a color signature is generated for each object class:goal1, goal2, ball, playing field, robots, and residual objects.

Figure 10.10 illustrates the difference in the results between the methoddescribed by [Gunnarsson et al., 2005] and SIOX. In the first frame, sampleregions were assigned to their respective classes manually. The experimentindicates that SIOX is also useful for real-time tracking of multiple objects,at least in an environment where the colors are deliberately chosen to be easilydistinguishable.

10.7 Evaluation

Unfortunately, showing that a certain image or video-processing method worksis often reduced to publishing results achieved on a small pre-selected set ofsample pictures. Until now, this dissertation has done likewise. Even thoughthis often suffices to demonstrate that a certain idea may be applicable for aspecial problem domain, this kind of “proof by example” does not guaranteethat the image or video-processing approach yields satisfying results in general.On the other hand, proving mathematically that a certain image-processingmethod works is often impossible because both the problem and the output

Page 141: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.7. EVALUATION 133

Figure 10.12: Per-image error measurement from applying SIOX on the bench-mark dataset provided by [35]. Please refer to the text for a detailed description.

are not mathematically well defined. In practice, there is almost no image orvideo-processing method that does not fail in certain special cases. As alreadydiscussed in Section 9.2.2, this is especially a problem for object extractionmethods. Because the processing of the human brain that enables vision is notyet understood researchers are forced to rely on heuristic approaches.

This section presents my experiments to provide evidence that SIOX givessatisfying results for a large set of images and is therefore a generalization ofthe instructor extraction problem. However, it is impossible to provide an un-deniable proof because object extraction is not yet mathematically definable.

10.7.1 Benchmarking and Tuning of Thresholds

In [Blake et al., 2004] a database of 50 images plus the corresponding groundtruth to be used for benchmarking foreground extraction approaches is pre-sented. The benchmark data set is available on the Internet [35] and also in-cludes 20 images from the Berkeley Image Segmentation Benchmark Database[Martin et al., 2001]. The data set contains color images, a pixel-accurate groundtruth, and user-specified trimaps. I chose comparison with this database becausethe solutions presented in [Blake et al., 2004] are commonly considered to bevery successful methods for foreground extraction.

The trimaps, however, are not optimal inputs for the algorithm presentedhere because the specified known foreground is not always a representative colorsample of the entire foreground. Furthermore, creating such a trimap would betoo cumbersome for the user, as it already contains quite detailed shape infor-mation. The benchmark therefore does not represent the results a user couldprovide. For this reason, I created an additional set of trimaps better suited

Page 142: Adaptive Audio and Video Processing for Electronic Chalkboard ...

134 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Color Space Worst Image Error Total ErrorCIELAB 15.4% 3.6%

RGB 97.0% 11.8%HSI 54.4% 5.2%YUV 34.7% 4.74%

Table 10.1: Average and worst-case classification results for different colorspaces. Details of the experiment are explained in the text.

for testing the approach. I asked several independent users to draw appropriaterectangles for the region of interest and known foreground in each of the im-ages. These trimaps may still be suboptimal but it is assumed here that theyrepresent the typical input of a user. Using a rough freehand selection insteadof a rectangular area, for example, would improve the segmentation result ofthose images where the smallest possible rectangle already covers almost theentire picture. Figure 10.11 shows an example of an image in correspondencewith both types of trimaps and the ground truth.

Unfortunately, it is difficult, maybe impossible, to create a generally validerror measure. Assuming such a perceptually accurate error measure for fore-ground extraction approaches would exist. The entire task would be reduced tominimizing this error function. Because I want to create comparable results, Istick to the error measurement defined in [Blake et al., 2004], which is definedas:

ε =no. misclassified pixels

no. of pixels in unclassified region

In low-contrast regions, a true boundary cannot be observed using pixel-accuratesegmentation (see Section 10.4). This results in the ground truth databasecontaining unclassified pixels. For comparability, these pixels are excluded fromthe number of misclassified pixels as in [Blake et al., 2004].

As discussed in Sections 9.5.3 and 10.2.1, the results of the SIOX algorithmdepend on the setting of the thresholds for the box width dimensions for eachcluster in CIELAB space as well as the abstraction threshold for the removal ofclusters that contain too few pixels. Therefore, the purpose of running SIOX onthe benchmark data set is two-fold: Besides comparing the segmentation resultswith other algorithms, the benchmark also helps to find the optimal valuesfor the four parameters. I tuned the parameters manually and used a geneticalgorithm [47] to verify the result. As already mentioned in Sections 9.5.3 and10.2.1, the triple (0.64, 1.28, 2.56) for the box width in dimension i seems to beoptimal for the cluster size. The best value for the abstraction threshold seemsto be 0.01, i. e. the number of specified pixels divided by 1000. The resultspresented in the following were generated using these values for the parameters.

If only the background signature is given and the foreground signature hasto be calculated by color signature subtraction, the overall error is 11.32 %. Theoverall error when applying the lasso trimaps provided by the database is 8.75 %.As already mentioned, the lasso selections are not optimal for the segmentationalgorithm presented here. Figure 10.12 shows the result for the additional setof trimaps based on rectangular user selections1. The overall error is 3.59 %

1Throughout this dissertation, the benchmark images are always listed in the same order.

Page 143: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.7. EVALUATION 135

Abstraction Worst Image Error Total ErrorAlgorithm as proposed 15.4% 3.6%No abstraction at all 40.2% 8.8%

No relevance threshold 44.3% 9.9%

Table 10.2: Using color signatures as an abstraction mechanism does not onlyimprove the speed of the segmentation, it also improves the results. The detailsof the experiment are explained in the text.

and the segmentations subjectively appear much better. This indicates that therobustness of the algorithm is significantly improved with the user providinga foreground sample. Appendix G presents the benchmark images along withtheir segmentation results. Using the alternative clustering method describedin Section 10.2.1, the error is 4.21 %.

The best-case average error rate on the database for the GrabCut underlyingalgorithm is reported as 7.9 % [Blake et al., 2004].2 Using different trimaps forclassification results in a higher number of pixels to classify. One could objectthat a higher number of pixels to classify contains more pixels that are easier toclassify and thus may beautify the error rate because there is no focus on thecritical boundary pixels. This may be true for algorithms that seek an accurateboundary by growing from some center of the picture, or by shrinking a lasso.The algorithm proposed here makes no distinction between critical and non-critical pixels: In the color classification step, every pixel has an equal chanceof being misclassified no matter where in the image it is located. Having morepixels to classify therefore makes the test even harder.

10.7.2 Testing Assumptions

The benchmark offers the possibility to check some of the basic assumptions un-derlying SIOX presented in the preceding chapters. The following experimentshave been conducted to provide evidence that the keystones of the theoreticalderivation for the SIOX algorithm hold.

CIELAB vs RGB vs HSI vs YUV

In order to test the impact of using CIELAB as the underlying color space,the algorithm was also applied to the benchmark images using YUV, HSI, andRGB. The parameters were again tuned using genetic algorithms. Otherwisethe algorithm remained completely unchanged. Although there is no guaranteethat the genetic algorithm found the optimal constants in each case (and didnot get stuck in a local minimum of the fitness function), looking at both theaverage error and also the worst-case error reveals a clear evidence. CIELABproves to be better than all other color spaces. Although YUV comes closeon average, CIELAB shows a significantly smaller worst-case error. Table 10.1summarizes the average and worst-case results. Of course, a small worst-caseerror is very important for a generic image manipulation tool.

2At the time of writing of this dissertation, a per-image error measurement has not beenpublished.

Page 144: Adaptive Audio and Video Processing for Electronic Chalkboard ...

136 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.13: Cutting out an object with a fairly complex shape from thishigh-resolution image (2592× 1944 pixels) takes about 6 seconds in GIMP v2.3.9(Pentium 4 3GHz, 2GB RAM). Further refinement steps would take about 2-3 seconds per interaction.

Need for Abstraction

Section 9.5.3 discusses that color signatures provide an important means ofabstraction. Without this abstraction, noise and outliers make segmentationdifficult. On the other hand, too much abstraction makes segmentation impos-sible. The right trade-off between abstraction and accuracy is tuned with theconstants that have been found as described in the previous section. Table 10.2shows the results of two benchmark experiments to provide evidence that the ab-straction mechanism provided by color signatures does not only improve speedbut also improves accuracy.

Without the clustering performed to create the color signatures, the segmen-tation is not only several orders of magnitude slower because more comparisonshave to be made, the result is also worse. If the unknown pixels are directly com-pared with each pixel of the background and foreground sample, the resultingclassification error more than doubles to 8.8 % (worst result: 40.2 %).

As described in Section 9.5.3, an abstraction threshold removes clusters thatrepresent only very few pixels in the picture. This is especially useful for re-moving a few wrongly specified known foreground or known background pixels.These appear frequently in human-generated trimaps. If the clustering is per-formed for creating the representative colors but clusters that contain only a fewpixels are not removed, the classification error increases to 9.9% (worst result:44.3%).

10.7.3 Other Means of Evaluation

The benchmark provides some evidence of the robustness of the SIOX algorithm,expecially because the pictures were not selected by me. However, the resultsare only partly significant because the images in the data set did not containany images with highly detailed textures where sub-pixel accuracy would bea requirement. Furthermore, these regions were excluded from the test. Thebenchmark did not test interactive refinement by the user and it did not con-tain any noisy or blurry images. Finally, videos where not part of the dataset. SIOX has been implemented into the core of the open-source image manip-

Page 145: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.7. EVALUATION 137

(a) Original

(b) GrabCut (c) SIOX

Figure 10.14: Extracting multiple objects from a noisy image (source: [14]).Graph-Cut-based approaches like [Rother et al., 2004] are usually only capable ofextracting one object at a time and have difficulty segmenting noisy images.

ulation program GIMP. In February 2006, an early GIMP implementation ofSIOX was tested by the editorial staff of c’t magazine [Trinkwalder, 2006]. Themagazine compared SIOX to GrabCut and KnockOut. Although the testedimplementation did not yet include the Detail Refinement Brush and multi-ple object extraction, the magazine positively mentioned SIOX’s capability toextract objects with highly complex shapes. The main concern of the maga-zine was that, compared to the two other commercial tools, the implementationhad not yet been optimized for speed and the extraction of an object from a5 megapixel image took too long to process. Since the publication of the article,many technical optimizations had been done by the open-source community (seeAppendix A for a list of particular names) and the extraction of the object fromthe 5 megapixel photograph shown in Figure 10.13 takes about 6 seconds on a3-GHz Pentium 4. Further refinement steps, which can partially reuse alreadycalculated data, take about 2-3 seconds per interaction.

The inclusion of the algorithm into GIMP also resulted in a huge amountof user feedback, mainly in newsgroups, blogs, and mailing lists. This feedbackallows to extract a few rules of thumb for as to which image properties increasethe chance of an instantly perfect segmentation result:

• The better the foreground object is distinguishable from the background,the easier the segmentation.

Page 146: Adaptive Audio and Video Processing for Electronic Chalkboard ...

138 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.15: If color signatures overlap heavily, the result is bad segmentation.The original (taken from [14]) has very smooth transitions making it hard even fora human to find the exact boundaries (left). The automatic segmentation resulthas clearly visible dents and holes which have to be eliminated by user interaction(right).

• The better the foreground and background selections, the better the seg-mentation result. The user must make sure the entire object is inside theregion of interest and the foreground samples are representative and donot contain any background. Ideally, the foreground samples should con-tain all the colors that the foreground object contains. Of course, findingsuch a set of samples is often cumbersome or even impossible. For animalsand persons the following rules of thumb seem to hold:

– Animals: The user should select at least the face, a large part of thebody, and every special feature that a specific animal might have.

– Human beings: The user should select a part of the face, the hair,and different parts of the clothes he or she wears. Skin is difficult toextract because of reflections, so as much skin as possible should beselected.

• High color variance: With the color spectrum being wide, the chance thatbackground and foreground share the same colors is decreased. If thecolor spectrum of an image is very narrow, colors are shared by differentobjects.

• Good contrast: If the object boundaries are unclear, segmentation is oftennot accurate and manual refinement is required. The higher the numberof mixture colors, the lower the chance for fast and accurate segmentation.

A summarizing rule of thumb could be deduced stating that if a picture wastaken to show a particular object – like a portrait foto of a human being, ananimal, or any other particular item – SIOX will most probably be able toextract it.

10.8 Limits of the Approach

The benchmark as well as the experiences of many users indicate that the pre-sented algorithm performs well on a number of difficult pictures where it is evendifficult to construct an accurate ground truth. The classification copes wellwith noise although the computation needs considerably more time for noisy

Page 147: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.8. LIMITS OF THE APPROACH 139

Figure 10.16: The SIOX algorithm is currently being adopted for several open-source applications. This screenshot was provided by Brecht van Lommel andshows a SIOX implementation in Blender.

input. Figure 10.14 shows the result of classifying a noisy image with SIOX andusing a graph-cut-based algorithm. However, looking at the resulting picturesalso discloses some weaknesses. The segmentation depends heavily on the userprovided trimap. The user has to select a region of interest that does contain thewhole foreground object. Failing to do so will give unsatisfying results. Difficultimages require a wise selection of representative foreground. Therefore, the usermust have at least a little knowledge of what could be representative. If two verysimilar objects exist on the picture, where only one of them is to be consideredforeground, the segmentation mostly gives bad results. The reason is that mostof the colors of the foreground are then considered background because manysimilar colors exist on the second object. The only workaround is to includeboth objects in the region of interest and to provide good foreground samples.Still, this method may fail when the unwanted similar object is bigger than thewanted one. When SIOX is integrated into an image manipulation applicationthis problem can be avoided easily by combining SIOX with other operations,for example by cropping the image prior to using SIOX. Foreground objects thatare connected with objects of the same color structure (for example, two peopleembracing each other) are almost impossible to extract using SIOX. Most of themisclassified pixels in the benchmark result from objects that are close to theforeground object, both in color structure and in location. The same applies toshadows and reflections.

Still another problem is the use of the standard observer and the D65 refer-ence white. Pictures taken with different illumination conditions are segmentedpoorly. Especially underwater scenes are awkward to segment, because of thenatural color quantization underwater [Richardson, 2000]. For these pictures, adifferent model would have to be used3. As already explained in Chapter 9, the

3In the case of underwater photography, this model would have to depend on the depth

Page 148: Adaptive Audio and Video Processing for Electronic Chalkboard ...

140 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Figure 10.17: This image is a screenshot of a running Inkscape v0.44pre3 pro-vided by Bob Jamison. Inkscape provides SIOX in combination with bitmaptracing. This allows users to vectorize only certain objects in an image.

most critical drawback of the approach is color dependence. Although manyphotos are well separable by color, the algorithm cannot deal well with camou-flage. If the foreground and background share many identical shades of similarcolors, the algorithm might give a result with parts missing or incorrectly classi-fied foreground, as can be seen in Figure 10.15. Gray-scale pictures or picturesthat have already been color quantized give bad results (for example GIF imagesor videos encoded with a codec that performs color reduction). Although SIOXalso works for drawings, the postprocessing steps blur their edges. Computer-created drawings with a few colors are better segmented by using Magic Wand.

Future enhancements may include an automatic adaption of the clusteringstrategy according to the color distribution of the image and a further improve-ment of the algorithm taking into account the first derivative of the picture.The implementation of CIELAB’s different observers and illumination modelsmay improve segmentation of underwater scenes, space images, or pictures takenat night. I also experimented with the integration of color-distribution-basedmethods and with the SCIELAB space [Zhang et al., 1997]. Automatically ap-plying the DRB to detected spill-color regions on the boundary is a matter offurther research, but in the end there will always be cases where a computercannot distinguish between detail and noise without additional user interactionor information about the content.

where the picture was taken.

Page 149: Adaptive Audio and Video Processing for Electronic Chalkboard ...

10.9. CONCLUSION 141

10.9 Conclusion

This chapter illustrated that a generalization of E-Chalk’s instructor extractionapproach solves problems that currently cannot be addressed by state-of-the-art solutions, mostly relying on graph-cut algorithms. Using graph-cut-basedapproaches, recovering the boundary of an object can be impractical and some-times even impossible. Highly detailed textures, such as hair or trees withbranches, prevent object extraction from being reduced to finding a simple cut.Especially in the case of videos, there may be fuzzy edges, for example due tointerlace effects, noise, or motion blur. Changing the way the confidence matrixis generated, the core of the algorithm can be used for a variety of applications.The generated color signatures can further be used to cope with highly detailedtextures even with sub-pixel accuracy. The presented approach can be appliedto a variety of other problems where a foreground object should be tracked,extracted, and/or identified. SIOX has already been put into practical use inGIMP, an open-source image manipulation program.

Since the release of an open-source reference implementation in Java [48],implementations of SIOX are currently also being integrated into Krita (part ofKOffice) [29], Inkscape [27], and Blender [71]. Figures 10.16 and 10.17 show pre-liminary screenshots from Blender and Inkscape. The next chapter will presentyet another application of the algorithm, namely the use of SIOX in combinationwith a 3D camera.

Page 150: Adaptive Audio and Video Processing for Electronic Chalkboard ...

142 CHAPTER 10. GENERALIZING THE INSTRUCTOR EXTRACTION

Page 151: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 11

Hardware-SupportedInstructor Extraction

Although the previous chapters present a pretty robust and usable approach,image segmentation heuristics are always at risk of failing in certain situations.Since the logic behind human vision is not yet understood, the only alternativeis provided by spezialized hardware. Although more expensive in acquistion,range-sensing devices might provide a fail-safe alternative for E-Chalk’s lecturersegmentation task. 3D laser scanners use triangulation, which is computation-ally expensive and not real time capable. Stereo cameras obey the same rules astexture-based segmentation approaches. A rather new kind of device is calledtime-of-flight 3D camera. They promise to make real-time segmentation of ob-jects easier, avoiding the practical issues resulting from 3D imaging techniquesbased on triangulation or interferometry. This chapter presents investigationson using a time-of-flight 3D camera for extracting the instructor acting in frontof an electronic chalkboard.

11.1 The Time-of-Flight Principle

Time-of-flight 3D cameras are now becoming available (see for example the offersby 3DV Systems, Inc. [1], Canesta, Inc. [11], CSEM [13], or PMD Technologies,Inc. [39]). I tested a miniature camera called “SwissRanger SR-2” [13] built bythe Swiss company CSEM and a prototype camera called “Observer 1K” builtby the German company PMD Technologies [39].

A schematic view of the design of time-of-flight 3D cameras is shown inFigure 11.1. A time-of-flight camera works very similar to radar. The cameraconsists of an amplitude-modulated infrared light source and a sensor field thatmeasures the intensity of backscattered infrared light. The infrared source isconstantly emmitting light that varies sinusoidal. Object A reflects almost themaximum intensity while object B, having a greater distance to the camera,reflects less light. This is because at any specific moment, objects that havedifferent camera distances are reached by different parts of the sinus wave. Asshown in Figure 11.1, the incoming light is then compared to the sinusoidalreference signal which triggers the outgoing infrared light. The phase shift ofthe outgoing versus the incoming sinus wave is then proportional to the time of

143

Page 152: Adaptive Audio and Video Processing for Electronic Chalkboard ...

144 CHAPTER 11. HW-SUPPORTED INSTRUCTOR EXTRACTION

Figure 11.1: Left image: Two objects reflect amplitude-modulated infraredlight. Object A reflects more light than object B because at the point of timewhen the photons hit object A, they were emitted with maximum light intensity.The photons that hit object B at the same time where emitted before that, withlower intensity. Right image: The actual distance can be calculated by measuringthe phase shift between the emmitted and the reflected light. If the distance ofthe reflecting object were zero, the two curves would have no phase shift. Thefarther the object away, the greater the phase shift.

flight of the light reflected by a distant object. This means, by measuring theintensity of the incoming light, the phase-shift can be calculated and the camerasare able to determine the distance of a remote object that reflects infrared light.The output of the cameras consists of depth images and a conventional low-resolution gray-scale video, as a byproduct. A detailed description of the time-of-flight principle can be found in [Luan et al., 2001,Oggier et al., 2004,Gokturket al., 2004].

The depth resolution depends on the modulation frequency. For the experi-ments a frequency of 20 MHz was used which gives a depth range between 0.5 mand 7.5m, with a theoretical accuracy of about 1 cm. Usually, the cameras allowto configure frame rate, integration time, and a user-defined region of interest(ROI) by writing to internal registers of the camera. The cameras then calcu-late the distances in an internal processor. The resolution of the SwissRangercamera is 160 × 124 non-square pixels. The resolution of the Observer 1K is64× 16 non-square pixels. Both cameras behaved similarly in my experiments,but its low resolution made the Observer 1K unusable for instructor segmen-tation purposes. The following descriptions therefore refer to the SwissRangercamera.

11.2 Setup

The setup is essentially the one described in Section 9.3. The SwissRanger 3Dcamera is mounted on top of a video camera, and both cameras capture datasynchronously. Figure 11.2 shows the 3D camera as well as the camera standfor two cameras.

In the experiments, the achieved frame rate varied around eleven framesper second. When an object is too close to the camera lense, overflows occurdue to the large amount of light that is reflected. When an object is too faraway from the lense, the distance measurement becomes imprecise. Ideally,the camera should be located at the end of the room to minimize the opticaldisparity between the video and the 3D camera without requiring a mirror setup.

Page 153: Adaptive Audio and Video Processing for Electronic Chalkboard ...

11.3. TECHNICAL ISSUES 145

Figure 11.2: Left image: The SwissRanger time-of-flight 3D camera by CSEM.One can see the infrared light emmiting diodes on both sides of the receiving lense.Right image: The camera stand that allows a 2D camera and the Swissranger tocapture the same scene.

Due to the restricted range of 7.5 m, the disparity is noticeable. However, apractical range for segmentation is between 2m and 4 m in front of the electronicchalkboard. This range is optimal in terms of minimal visual noise and lowoverflow probability.

11.3 Technical Issues

Theoretically, the entire segmentation problem is reduced to a simple depth-range check. In the image captured by the 2D camera, only those pixels areconsidered to belong to the foreground that correspond to pixels with 3D cameradepth coordinates smaller than the distance of the electronic chalkboard. Inpractice, however, several issues have to be solved first. They are currentlyinvestigated by Neven Santrac as part of a diploma thesis. A summary of hiscurrent results can also be found in [Santrac et al., 2006].

Camera Synchronisation

Unfortunately, the tested time-of-flight cameras do not support hardware-trig-gered synchronisation. It is therefore impossible to guarantee that the 2D cam-era and the 3D camera capture the frames at exactly the same time. Whilethis is a technical issue and may be solved by the manufacturers in future, it ishard to match the images of the two devices in software. Measuring the time-difference between the two cameras manually is cumbersome, especially whenthe frame rates of the two cameras differ.

Resolution Inequality

The two cameras have different resolutions. In order to provide for a smoothviewing experience, the resolution for the lecturer extraction video should be atleast 640×480. The maximum resolution of the SwissRanger SR-2 is 160×124.The resolution of the camera can be raised using different approaches. CuneytGoktekin and Frank Darius have found out1 that the resolution of the 3D camera

1Unfortunately, they have not published the results of their experiments at the time ofwriting this text.

Page 154: Adaptive Audio and Video Processing for Electronic Chalkboard ...

146 CHAPTER 11. HW-SUPPORTED INSTRUCTOR EXTRACTION

Figure 11.3: Two examples of raw depth-range segmentation (i. e., using onlya camera calibration and depth-range treshold). Noise, motion blur, and cameradesynchronization are the three most disturbing factors.

could be increased by pixel shifting, a technique that is commonly used for 2Dphoto cameras and recently also for video cameras [Ben-Ezra et al., 2005]. Theresolution is increased for each dimension at the cost of a lower frame rate bycombining several pictures from slightly shifted camera positions. Pixel shifting,however, requires additional hardware to shift the sensor device for each frameby a few micrometers. In [Diebel and Thrun, 2005] Markov Radom Fields areused to increase the resolution. They, too, use a combination of a 2D Cameraand a 3D camera, exploiting the fact that discontinuities in range (i. e., in the3D image) and coloring (i. e., in the 2D image) tend to co-align. Experimentswith their approach indicated that it works reasonably well. Figure 11.4 showsan example. However, the method is computationally expensive and far fromreal-time performance. A downside is that the method sometimes blurres theedges between instructor and chalkboard. Sometimes the influence of the 2Dcamera image gets too strong and reflections from the board surface appear inthe final segmentation result.

Lense distortion

As the cameras are positioned close to the board in order to avoid noise andoverflows, the cameras’ radial distortions have to be eliminated. Having fixedboth cameras in the right position, lense distortion can be elimanated using well-known calibration methods, for example [Tsai, 1987]. However, the calibrationhas to be repeated whenever the position of a camera changes. This makes thesolution very impractical for a mobile setup.

Light Scattering and Motion Blur

Objects reflect light in different angles and in different intensities. The proper-ties of light reflection depend on the form and on the material of a given object.In the result, the depth measurement is not texture and material-independent.Since darker objects reflect less light, the output of the camera is noisier than inthe measurement of brighter objects. Light scattering is a prelimiary cause foredge bluring. 3D time-of-flight cameras tend to blur motions even more thanregular 2D cameras. Motion interferes with the measurement of the reflectedlight. A quickly moving arm, for example, reflects light from two positions in oneframe. Motion blur is particularily observed at the borders of moving objects.

Page 155: Adaptive Audio and Video Processing for Electronic Chalkboard ...

11.4. SEGMENTATION APPROACH 147

Figure 11.4: Left: Depth image obtained by the camera. Right: Depth imageenhanced using the method proposed by [Diebel and Thrun, 2005]. The whitespot in the center is an artifact produced by reflection.

Noise

Noise is one of the biggest issues. The signal-to-noise ratio shrinks radially fromthe center to the corners. The highest signal-to-noise ratio is found in the centerof the camera view and the smallest in the corners. The measurement errorinduced by noise is up to 4 % (this means 30 cm in a total scale of 7.50 cm).When an electronic chalkboard with back-projection is used, noise becomeseven worse because residual light of the electronic chalkboard is caught by thelense, which interferes with the measurement of the reflected infrared light. Thehands of the instructor can usually not be distinguished from the backgroundby the camera since they are too close to the board surface and moving too fast.Figure 11.3 shows two sample frames with typical noise distortion and otherissues discussed here.

11.4 Segmentation Approach

While the above-mentioned problems make a direct segmentation based on theoutput of the camera virtually impossible, an advantage of the time-of-flightprinciple is that the method does not make use of motion or texture-basedsegmentation techniques. It is therefore possible to use the 3D camera in com-bination with complementary texture-based methods. For this reason, a combi-nation of the output of the 3D camera with the method presented in Chapter 9provides a working solution. The idea is to find a subset of the instructor aswell as a subset of the background using the depth information provided by thecamera and then to use the color signature-based segmentation approach.

The depth image is mapped onto the 2D image by using a grid calibration.The resoltion of the depth image is increased using pixel duplication. In order toget a subset of the background, a connected component search is performed onthe depth picture of the camera. The biggest connected set of pixels that is in apredefined depth range is considered to be a mixture between background andinstructor. A gracefully chosen bounding box of the corresponding area in the2D image is considered to be a superset of the instructor. The regions outsidethe bounding box are considered background. This heuristics sometimes failsfor the hands of the instructor when these are close to the corners of the imagewhere the camera’s signal-to-noise ratio is very small. In most cases, however,the method gives good results. A small subset of the instructor is chosen with

Page 156: Adaptive Audio and Video Processing for Electronic Chalkboard ...

148 CHAPTER 11. HW-SUPPORTED INSTRUCTOR EXTRACTION

Figure 11.5: Top left: The depth data of the 3D camera is used to constructboth a superset and a subset of the foreground. Other images: A few sampleframes showing the segmentation result using the color segmentation approachdescribed in the previous chapter. The instructor is robustly segmented whileworking with any content on the electronic chalkboard even if he or she is movingquickly. Results taken from [Santrac et al., 2006].

the following strategy. The biggest connected component is shrunk radialy fromthe edges until the variance of the corresponding depth information in the areais below a threshold. This way, an input is generated that is similar to theinput described in Section 10.6. Figure 11.5 shows a sample input image withthe constructed bounding box and the subset of the instructor along with afew sample results of the following color segmentation. The color signatures areregularily updated in order to adapt to changing background and foreground.

The approach works in real time with 25 frames per second on a video with640 × 480 pixels and is rather resistant against blurred edges and desynchroni-sation effects. Only a subset of the noisy 3D camera image is used. The finedetails are segmented by the color segmentation.

11.5 Conclusion

3D time-of-flight cameras initially promise an efficient way to solve the instructorsegmentation problem. In practice however, the exact calibration and synchro-nization of the two cameras is tricky. The 3D cameras do not yet provide anyexplicit synchronization capability, such as those provided by many FireWirecameras. The low x and y-resolution of the 3D camera results in coarse edges.The z-resolution is just not high enough, since the instructor usually stands veryclose to the board and the range of interest usually is about 50 cm. The lowsignal-to-noise ratio does not allow for direct segmentation. Adequate segmen-tation is only possible in combination with other techniques. Besides overflows,

Page 157: Adaptive Audio and Video Processing for Electronic Chalkboard ...

11.5. CONCLUSION 149

there are other artifacts caused by quickly moving objects, light scattering,background illumination, or the non-linearity of the measurement. Last but notleast, using a time-of-flight camera requires a large budget.

For instructor segmentation, the ideal time-of-flight camera should offer ahigher depth range (for example 15 m) and a z-axis resolution of a few mil-limeters. The image resolution should be at least PAL. It would be ideal ifa color video chip could be combined with the depth-measurement chip in asingle unit. Given such a camera at a low price, the computational costs ofimage segmentation could be dramatically reduced. Because a time-of-flight 3Dcamera captures depth and intensity information at acceptable frame rates andrather texture-independent, the segmentation problem is theoretically reducedto camera calibration and a simple depth-range check. At the time being, thesoftware segmentation approach presented in Chapter 9 and its generalization inChapter 10 proved to be an adequate tool for combination with 3D time-of-flightcameras in order to facilitate segmentation.

Page 158: Adaptive Audio and Video Processing for Electronic Chalkboard ...

150 CHAPTER 11. HW-SUPPORTED INSTRUCTOR EXTRACTION

Page 159: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Chapter 12

Conclusion

12.1 Summary

This dissertation presents an audio and video recording and transmission systemfor lectures held with an electronic chalkboard that was developed in conjunctionwith the E-Chalk project. Related work builds upon standard transmissionand archiving methods that were implemented for traditional radio and TVstations. E-Chalk’s audio and video systems were created under the assumptionthat computers can be better utilized to facilitate the creation of multimediacontent in a more automated and yet integrated fashion. The result permitsdistance teaching to be a side effect of the use of the electronic chalkboard inthe classroom. Moreover, most of the presented research is general and can alsobe used in different application areas. The contributions can be summarized asfollows:

• The dissertation starts with discussing the underlying architectural ideasof the E-Chalk system for both server and client side. A novel component-oriented software framework is introduced under the name SOPA. SOPAis based on a component-management framework based on the OSGistandard and a component-deployment framework. On top of these twoframeworks, SOPA provides a component-assembly mechanism that makesthe creation of typical multimedia streaming and processing applicationseasier. It supports collaborative extension and updating of the systemwhile reducing administrative overhead. SOPA allows to rapidly combineE-Chalk with other streaming applications and to integrate new formatsand content types easily.

• The second part presents E-Chalk’s audio system and its evolution from asystem that aimed to provide a solution to broadcast traditional radio pro-grams over the Internet to a fully integrated component inside E-Chalk.The system was developed while used at different educational institutionsin response to feedback and requests from users. One of the major issuesconcerns the resulting audio quality when automatically recording in class-rooms or lecture halls. A system called Active Recording was created thatsimulates the typical work of an audio technician to help instructors cre-ate better speech recordings. The system measures certain critical factors

151

Page 160: Adaptive Audio and Video Processing for Electronic Chalkboard ...

152 CHAPTER 12. CONCLUSION

in the sound equipment, monitors possible malfunctions during recording,and filters out typical audio distortions.

• Then, E-Chalk’s video system is described. While traditional video codecsare not suitable for the transmission of chalkboard content, the exclusivereplay of vector-based board strokes is suboptimal, too. According to ourexperience, which is also backed by several studies, the remote studentneeds to see the image of the instructor because it conveys important con-textual information. A mere side-by-side replay of the video and the boardcontent, however, results in an ergonomic issue called the split attentionproblem. This dissertation presents a solution where the instructor isfilmed in front of the electronic board and his or her image is automati-cally cut out in order to be pasted semi-transparently over the vector-basedboard image during remote replay. This method solves the split attentionproblem as well as several layout issues and also eases replay on smalldevices, such as mobile phones.

• Finally, a generalization of the instructor-extraction method, called SIOX,is presented. It enables semi-automatic segmentation in image and videomanipulation programs as well as the improvement of 3D-time-of-flightcamera segmentation results. A thorough evaluation of the solution and itsimplementation in several common open-source applications is presented.

12.2 Future Work

Although the presented system already uses methods that go beyond standardcomputer-based audio and video recording techniques, many problems remain,and the following paragraphs present a small selection of them.

Chapter 5 comes to the conclusion that no matter what current format wouldbe used, the transmission of the overlaid instructor requires at least a DSL orcable connection. However, if only gestures without facial expressions are trans-mitted, the overall bandwidth (including audio) can be brought down to lessthan 64 kbit/s. This is achieved by transmitting only the outline of the instruc-tor as shown in Figure 12.1. The idea was inspired by the Italian cartoon figure,“La Linea”, by Osvaldo Cavandoli. Of course, the usefulness of such a trans-mission is still to be evaluated. However, encoding the shape of the instructoras a polyline allows direct application of standard online-character-recognitionmethods. This would allow for a recognition of some of the instructor’s gesturesby the computer. Certain shape patterns, for example, could be identified aswiping of the board, and a marker could be set to provide navigational hints forthe replay.

Another problem concerns the scalability of the board data. Most standardscaling methods are easily able to scale down images from one screen resolutionto another (for example from 1024 × 768 pixels to 640 × 480 pixels). The res-olution of electronic chalkboards, however, is becoming higher and higher, asit is desirable to have a writing resolution that comes very close to that of areal chalkboard. Handheld devices on the other hand are naturally constrainedin their display size and resolution. Replaying electronic chalkboard lectureson handheld devices, such as mobile phones or PDAs, therefore requires moreand more new scaling strategies. One possibility for proper replay of electronic

Page 161: Adaptive Audio and Video Processing for Electronic Chalkboard ...

12.3. FINAL NOTE 153

Figure 12.1: In order to allow transmission of the overlaid instructor even forlow bandwidth connections, the instructor could be transmitted as an outline only.The image shows a preliminary test implementation of the idea in MPEG-4.

chalkboard content on small devices would be to show only a smaller regionof interest. In order to determine what region is currently of interest, both theboard data and the image of the overlaid instructor could be used. For example,a region of interest could be defined around newly appearing board content oraround the instructor’s acting hand.

Concerning audio recording, the goal should be to get speech recording qual-ity enhancement without the constraint of having to use a directed microphoneand without the need for creating an audio profile. Speech recognition andspeaker segmentation methods, as well as physiological models of the humanauditory system, could be included to simulate the work of an audio technicianduring a recording session. The system should be able to decide when to applya special filter technique out of a set of several possibilities. The system shouldbe capable of handling multiple microphones and inputs to enable switchingbetween classroom questions and the lecturer’s voice. One would also like to in-terface with external studio hardware, such as mixer desks, to enable automaticoperation. The system should be able to enhance the quality as far as possiblegiven a certain sound equipment.

12.3 Final Note

The word “multimedia” triggers different associations, depending on whose earsare actually perceiving it. A shop assistant in an electronics supply will surelypoint you to the department where television sets, all kinds of audio receiversand players, and lately also digital photo cameras are sold. Artists and designersoften use this word to refer to activities connected to digital content creation,such as building web sites. When school teachers use “multimedia”, they usually

Page 162: Adaptive Audio and Video Processing for Electronic Chalkboard ...

154 CHAPTER 12. CONCLUSION

mean that their classroom instruction is supported by some kind of audio-visualcontent.

Computer scientists primarily see the huge amount of information that has tobe handled in contrast to regular texts files or binary programs. As a result, mostof the research is focused on building hardware and creating algorithms that areable to analyze, process, and retrieve the data masses. With the exception ofdigital photography, content creation is seldom discussed in scientific articlesbecause multimedia content is mostly assumed to be already given. Moreover,multimedia is hardly ever discussed in the sense of “integrated media” unless,perhaps, in papers on pedagogy or psychology. Mostly, multimedia simply refersto the combination and simultaneous use of different media.

In my opinion, multimedia as a research field must investigate the creationof methods to generate new combinations of different sensory input and outputusing appropriate devices while optimizing the human perceptibility and theinteraction styles. The term “multi” implies more than the addition of an audiotrack to a sequence of images, or the combination of digitized pictures with aset of text paragraphs to form an electronic book. My personal definition ofmultimedia is holistic: Content that uses different “media” (types of content)can easily be created, edited, and played back in an integrated fashion, so thatthe resulting combination forms more than the sum of its parts. I have focusedmy work in this dissertation bearing in mind this understanding of the doubleplural “multimedia” and researched solutions that promote this idea.

It is my hope that the work described in this dissertation will be seen as anexample and have an impact on the future development of multimedia contentcreation systems.

Page 163: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix A

E-Chalk: Project Overview

This appendix provides an overview of the components of the E-Chalk systemand their contributors. Figure A.1 provides a conceptual diagram.

E-Chalk Server

Board System E-Chalk was conceived as an update of the traditional chalk-board by Raul Rojas in 1999. Wolf-Ulrich Raffel created a first prototypeof the board software as diploma thesis [Raffel, 2000]. Further develop-ment was done by Lars Knipping [Knipping, 2005].

SOPA SOPA has been developed by Gerald Friedland and is described inthis dissertation. SOPA contains the component framework Oscar, whichhas been developed by Richard Hall [Hall and Cervantes, 2004] and thecomponent-discovery engine Eureka which was developed by Karl Pauls[Pauls, 2003]. The visual node editor was implemented by Bastian Voigt.A frame-by-frame testing environment for SOPA video nodes was devel-oped by Kristian Jantz.

Audio System A first version of the audio system, called World Wide Radio(WWR) [Friedland and Lasser, 1998] was conceived in 1997 by GeraldFriedland and Tobias Lasser. E-Chalk’s first prototype included a Javarebuild of the audio system, called World Wide Radio 2 (WWR2), cre-ated by Gerald Friedland in cooperation with Bernhard Frotschl [Manhart,1999]. Since version 1.2, E-Chalk uses the adaptive World Wide Radio 3system by Gerald Friedland, which is described in this thesis.

Video System An initial prototype video codec was created by Gerald Friedlandin cooperation with Sven Behnke. An automatic video hardware detectionwas developed by Kristian Jantz. The rest of the video system has beendeveloped by Gerald Friedland and is described in this thesis. NevenSantrac has contributed several experiments on time-of-flight 3D camerasegmentation [Santrac et al., 2006].

SIOX SIOX is a spin-off of the instructor video segmentation approach. It isdescribed in this thesis. The SIOX Java Reference API was written byGerald Friedland in cooperation with Kristian Jantz and Lars Knipping.

155

Page 164: Adaptive Audio and Video Processing for Electronic Chalkboard ...

156 APPENDIX A. E-CHALK: PROJECT OVERVIEW

The GIMP implementation of SIOX was developed by Gerald Friedland,Kristian Jantz, Tobias Lenz, and Sven Neumann. The Inkscape imple-mentation was derived by Bob Jamison, the Krita version by MichaelThaler, and the Blender version by Brecht Van Lommel. Experiments onthe usage of SIOX for robotic soccer were performed by Fabian Wiesel.

E-Chalk Startup Wizard An initial version of the E-Chalk Startup Wizard wasdeveloped by Gerald Friedland. Since E-Chalk 1.1, a new Startup Wizardhas been used that was developed by Lars Knipping [Knipping, 2005].

Macro Recorder The macro recorder was developed by Lars Knipping [Knip-ping, 2005].

Audio Wizard The Audio Profile Wizard, described in this thesis, was devel-oped by Gerald Friedland and Kristian Jantz.

Tools

Lecture Repair Tool The Lecture Repair Tool was developed by Kristian Jantz.

E-Chalk to Video Converter The E-Chalk-to-video converter was developedby Kristian Jantz. It was extended for MPEG-4 support by BenjaminJankovic.

LMS Connectivity The BlackBoard connectivity was developed by ThomasReimann.

DBMS Connectivity An initial version of the database connectivity for E-Chalkwas developed by Peter Siniakov, Sebastian Frielitz, Robert Gunzler, andGerald Friedland. It was later refined by Sebastian Frielitz and RobertGunzler.

Keyword Generator Automatic indexing of E-Chalk lectures was investigatedby Michael Theimer [Theimer, 2004].

PowerPoint to E-Chalk The PowerPoint converter was done by Shirzad Ka-mawall and Alexandar Rakovski.

Lecture Replay

Java Client The board client was created by Wolf-Ulrich Raffel [Raffel, 2000].The devlopment was taken over by Lars Knipping [Knipping, 2005]. Aninitial version of the console was developed by Karsten Flugge before LarsKnipping rebuilt it. The audio, video, and slide-show clients were devel-oped by Gerald Friedland.

WMV Client An experiment on a Windows Media Video format client forE-Chalk was performed by Stephan Lehmann.

PDA and Mobile Phone Replay Experiments on E-Chalk lecture replay forhandheld devices were performed by Gerald Friedland.

Page 165: Adaptive Audio and Video Processing for Electronic Chalkboard ...

157

MPEG-4 Replay MPEG-4 replay was developed by Benjamin Jankovic in co-operation with Gerald Friedland [Jankovic et al., 2006]. The “La Linea”-prototype was implemented by Benjamin Jankovic after suggestion byGerald Friedland.

Chalklets

Logic Simulator The Logic Simulator Chalklet was developed by Marcus Liwiki[Liwicki, 2004].

Python Chalklet Henrik Steffien and Brendan O’Connor developed the Python-interpreting Chalklet [Steffien, 2004].

Geometry Work on the geometry Chalklet was performed by Andreas Stoffland Ittay Eyal.

Algorithm Visualizations Chalklets on algorithmic animations were developedby Margarita Esponda [Arguero, 2004]. They are maintained by KristianJantz now.

NeuroSim Chalklet Olga Krupina authored a Chalklet on Neural Network sim-ulations as part of her dissertation [Krupina, 2005].

Chess Chalklet Marco Block built a Chalklet that enables playing chess [Blocket al., 2004b].

Deployment

i18n Tools Tools that automate the internationalization were built by LarsKnipping [Knipping, 2005].

Installer The installer was built by Gerald Friedland using InstallAnywhere byZeroG, Inc. Some custom code was developed by Abid Hussain.

Version Converters Converters were developed to upgrade format changes insubsequent releases of E-Chalk. The WWR2-to-WWR3 converter was de-veloped by Gerald Friedland, the board format version converter by LarsKnipping.

Lecture Editing

Exymen Exymen and all but three plug-ins were developed by Gerald Friedland[Friedland, 2002a,Friedland, 2002b]. Mary Ann Brennan developed a JMFaudio plug-in. Kristian Jantz developed a SID-file plug-in [Jantz et al.,2003, Friedland et al., 2004a]. Thomas Schakert developed a scriptinglanguage that is able to control the entire functionality of the editor inbatch mode.

Page 166: Adaptive Audio and Video Processing for Electronic Chalkboard ...

158 APPENDIX A. E-CHALK: PROJECT OVERVIEW

Hardware

Multi-Screen Board The data wall was designed by Raul Rojas and built byChristian Zick in corporation with different workers from the Freie Uni-versitat Berlin. The laser-pen tracking used by the data wall is founded onthe work of Michael Diener [Diener, 2003], continued by Gerald Friedland,and completed by Kristian Jantz [Jantz, 2006]. He also built Windowsand Linux versions of a minimalistic LED-ring tracking.

Bluetooth Keypad The bluetooth extension possibilities for the data wall weretested by Jorg Rebenstorf [Rebenstorf, 2004].

Other contributions and Ongoing Projects

Joachim Schulte conducted an extensive user evaluation on the system in uni-versity teaching [Schulte, 2003]. Stefanie Eule evaluated the deployment in K-12schools [Eule, 2004]. Damian Schmidt built the poor man’s slide-show recorder.Oleksiy Varchyn planned the combined stand for the 3D camera and the 2Dcamera. Michael Beckmann is currently developing a Haskell interpreter Chalk-let as a bachelor’s thesis. Alexander Luning is currently investigating scalingmethods to enable play back of high-resolution board content on handheld de-vices as a diploma thesis.

Page 167: Adaptive Audio and Video Processing for Electronic Chalkboard ...

159

Figure A.1: Conceptual Overview of the E-Chalk system.

Page 168: Adaptive Audio and Video Processing for Electronic Chalkboard ...

160 APPENDIX A. E-CHALK: PROJECT OVERVIEW

Page 169: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix B

SOPA: Technical Details

This section presents technical insights into language grammar and structuresof the SOPA framework in addition to the description in Chapter 4.

B.1 DTD of SOPA Graph Serialization

The following listing presents the DTD of SOPA’s default media graph descrip-tion.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE SOPA:script [

<!ENTITY source "de.echalk.sopa.SourceNode"><!ENTITY target "de.echalk.sopa.TargetNode"><!ENTITY pipe "de.echalk.sopa.PipeNode"><!ENTITY mixer "de.echalk.sopa.MixerNode"><!ENTITY fork "de.echalk.sopa.ForkNode"><!ENTITY generic "de.echalk.sopa.MediaNode"><!ELEMENT SOPA:script ((synchronize*)|(on*)|(include*)|(loadprops*)

|(setproperty*)|(service*))><!ELEMENT on ((synchronize*)|(on*)|(include*)|(loadprops*)

|(setproperty*)|(service*))><!ATTLIST on match CDATA #REQUIRED>

<!ELEMENT setproperty EMPTY><!ATTLIST setproperty key CDATA #REQUIRED><!ATTLIST setproperty value CDATA ""><!ATTLIST setproperty persistent (true|false) "false">

<!ELEMENT error ((PCDATA)|(property))><!ATTLIST error level (message|debug|warning|error|fatalerror) "error">

<!ELEMENT include ((PCDATA)|(property))><!ELEMENT loadprops ((PCDATA)|(property))><!ELEMENT property EMPTY>

<!ATTLIST property key CDATA #REQUIRED><!ATTLIST property default CDATA "">

<!ELEMENT service (((PCDATA)|(property*))*)><!ATTLIST service label CDATA #REQUIRED><!ATTLIST service type CDATA #REQUIRED><!ATTLIST service match CDATA #REQUIRED><!ATTLIST service target CDATA "">

<!ELEMENT synchronize (((PCDATA)|(property*))*)>]>

B.2 LDAP Query Syntax

SOPA’s main mechanism to locate nodes and evaluate conditions are LDAPqueries as standardized by RFC 1960. An impression of the power and the

161

Page 170: Adaptive Audio and Video Processing for Electronic Chalkboard ...

162 APPENDIX B. SOPA: TECHNICAL DETAILS

limits of this language can be obtained from its syntax. The following BNFgrammar has been verbosely extracted from RFC 1960 [Howes, 1996].

<filter> ::= ’(’ <filtercomp> ’)’<filtercomp> ::= <and> | <or> | <not> | <item><and> ::= ’&’ <filterlist><or> ::= ’|’ <filterlist><not> ::= ’!’ <filter><filterlist> ::= <filter> | <filter> <filterlist><item> ::= <simple> | <present> | <substring><simple> ::= <attr> <filtertype> <value><filtertype> ::= <equal> | <approx> | <greater> | <less><equal> ::= ’=’<approx> ::= ’~=’<greater> ::= ’>=’<less> ::= ’<=’<present> ::= <attr> ’=*’<substring> ::= <attr> ’=’ <initial> <any> <final><initial> ::= NULL | <value><any> ::= ’*’ <starval><starval> ::= NULL | <value> ’*’ <starval><final> ::= NULL | <value>

〈attr〉 is a string representing an AttributeType. 〈value〉 is a string represent-ing an AttributeValue. If a 〈value〉 must contain one of the characters ‘*’ or ‘(’or ‘)’, these characters should be escaped by preceding them with the backslashcharacter.

B.3 SOPA Command Line Commands

SOPA’s command-line console allows direct access to some framework methodsand provides debugging functionality. The following is a lists of all availablecommands as it is printed out on the console.

Basic commands:help - Show this helphide - Hide consoleclear - Clear consolever - Show application versiongc - Request garbage collectionfinalize - Request finalization of pending objectsfree - Show memory statisticsgsp - Get list of system propertiesgp - Get list of all SOPA propertiesspp <key> <value> - Set persistent property <key> to <value>swp <key> <value> - Set work property <key> to <value>rp <key> - Remove property named <key>ts - List all threadssuspend <no> - Suspend thread <no> (Caution: Read Java doc!)resume <no> - Resume thread <no> (Caution: Read Java doc!)interrupt <no> - Interrupt thread <no>tstop <no> - Stop thread <no> (Caution: Read Java doc!)destroy <no> - Destroy thread <no>setPriority <no> <pri> - Set priority of thread <no> to <pri>shutdown - Request regular shutdownseppuko - Immediate exit w/o saving (DANGEROUS: ALL UNSAVED DATA LOST)save <filename> - Dump console content to <filename>publish <classname> <url> - Publish node into Eureka network (if service available).unpublish <classname> <url> - Remove node from Eureka network (if service available).oscar <command> - Pass <command> to oscar. (In case of ambiguity)

Commands for the Oscar subsystem:bundlelevel <level> <id> ... | <id> - set or get bundle start level.cd [<base-URL>] - change or display base URL.headers [<id> ...] - display bundle header properties.help - display shell commands.install <URL> [<URL> ...] - install bundle(s).obr help - Oscar bundle repository.

Page 171: Adaptive Audio and Video Processing for Electronic Chalkboard ...

B.4. A MINIMAL PIPENODE 163

packages [<id> ...] - list exported packages.ps [-l] - list installed bundles.refresh - refresh packages.services [-u] [-a] [<id> ...] - list registered or used services.shutdown - shutdown Oscar.start <id> [<id> <URL> ...] - start bundle(s).startlevel [<level>] - get or set framework start level.stop <id> [<id> ...] - stop bundle(s).uninstall <id> [<id> ...] - uninstall bundle(s).update <id> [<URL>] - update bundle.version - display version of Oscar.

B.4 A Minimal PipeNode

The following code sample is a demonstration implementation of a minimalPipeNode that implements the identity function. The example shows that veryfew methods have to be implemented and a minimum set of propietary con-cepts are to be learned by any developer wanting to create a costum filter inthe SOPA framework. Even though a PipeNode requires the methods of botha SourceNode and a TargetNode, no more than ten methods have to be imple-mented.

import java.util.*;import de.echalk.sopa.*;

public class IdentityPipe extends PipeNode

private FormatDescriptor outfd;

// Metadata for graph resolutionpublic int getVersion()

return 100; // is divided by 100 = 1.00.

public String getNodeName()

return "IdentityPipe";

public String getCopyright()

return getNodeName()+" (c) 2003-2006 by G. Friedland";

public Properties getProperties()

Properties p=new Properties();p.put("author","Friedland");// Put additional, optional properties herereturn p;

public FormatDescriptor[] needsAsInput()

return null; // null means: handles any format

public FormatDescriptor[] givesAsOutput(FormatDescriptor inputformat)

return new FormatDescriptor[]inputformat;

// Operational codepublic void startWork(FormatDescriptor formattoproduce, TargetNode t)

Page 172: Adaptive Audio and Video Processing for Electronic Chalkboard ...

164 APPENDIX B. SOPA: TECHNICAL DETAILS

public int syncputdata(SourceNode sn, Object o, int numframes)

outfd=sn.getDataFormat(); // sn, o are guaranteed to be non-nullreturn res=getTarget().putdata(this,o,numframes);

public void stopWork()

public FormatDescriptor getDataFormat()

return outfd;

Page 173: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix C

Board-Event Encoding

This appendix provide a summary of the E-Chalk board event format and itsmapping to MPEG-4 BIFS.

C.1 The E-Chalk Board Format

The E-Chalk board-event format is thoroughly described in [Knipping, 2005],Section 4.11. This section only provides a short overview. The events aredescribed by a simple line-based, human-editable ASCII format. After a fewheader lines, that provide version information, specify the resolution of theboard, the lecture title, and the background color, every event is stored in aseparate line in the following syntax:

<timestamp>"$"<event>["$"<arguments>*]

The timestamp is the hexadecimal-coded amount of milliseconds that have goneby since starting the lecture. The dollar character serves as token delimiter.After the mandatory name of the event, a number of parameters can be passedto the event, again delimited by the dollar sign. E-Chalk knows the followingevents:

• Nop This event can be used to add comments. The event as well as anystring passed as argument is ignored.

• RemoveAll This event clears the entire board and sets the board positionback to the beginning. It takes no arguments.

• Undo This event triggers the undo manager to undo the last set of strokes.The event is inserted when the user pushes the respective button in theboard toolbox.

• Redo This event is the inverse of undo.

• Terminate The command is only used in live transmissions to signal ter-mination of the transmission.

• Scrollbar This event takes one integer parameter that specifies the newvertical offset of the board’s top position.

165

Page 174: Adaptive Audio and Video Processing for Electronic Chalkboard ...

166 APPENDIX C. BOARD-EVENT ENCODING

• Form This event marks that something is actually drawn on the board.The next parameter specifies what exactly:

– Line This event takes the arguments $x0$y0$x1$y1$r$c and triggersthe drawing of a line segment from point (x0, y0) to point (x1, y1)with stroke radius r and color c.

– Image This event gets an id and two coordinates and inserts im-age number id at the specified position. A forth argument specifieswhether the inserted image may be updated. Images are tagged asupdatable if they actually show screenshots from a Java Applet in-serted into the board.

– Text This event takes a MIME-encoded [Freed and Borenstein, 1996a]string, two coordinates, a color, and a font size. After this event, oneof several possible events can follow:

1. Text$End This event signals that the text can be put directly onthe board. The events follows directly after a Form$Text event ifthe text was printed out automatically, for example the responseof a CGI-script.

2. Text$Char The event takes a character as first argument. Aset of these events is used between a Form$Text event and aText$End event if the text is typed by the user. When the userpresses a certain key on the keyboard the event is inserted. Spe-cial characters, like backspace or delete have their own sub-events.

3. Text$SetTxt and Text$Str Both events take strings as first ar-guments and are used when the user sets an entire text line atonce using the text history (cursor up and down) or pastes textat the current cursor position.

4. Text$Cursor Takes a position as first argument and is used forcursor movement during string input.

• Image$Update This event can happen any time after Form$Image. Ittakes two arguments: id1 and id2. Applets that are inserted into theboard are played back as consecutive screenshots. The command triggersa replacement of the image with id1 with the image having id2.

C.2 Mapping E-Chalk Events to MPEG-4 BIFS

This section shows two examples of mapping the E-Chalk board format to theBIFS text format as described in Section 5.4. A more detailed description canbe found in [Jankovic et al., 2006].

Scene Definition

After the obligatory InitialObjectDescriptor (not shown here), the initialscene containing the board, audio, and video is defined as follows:

# Root of the scene treeDEF Root OrderedGroup

children [

Page 175: Adaptive Audio and Video Processing for Electronic Chalkboard ...

C.2. MAPPING E-CHALK EVENTS TO MPEG-4 BIFS 167

Background2D backColor 0.0 0.0 0.0

# Define hook for audio nodeSound2D

source AudioSource # Object descriptor with ID 3 is defined belowurl [od:3]

# empty boardDEF BOARD Transform2D

translation 0 0children []

# Define hook for video node (for overlaid replay)Transform2D

translation 0 20 # Video offset to better match instructorchildren [

Shape appearance Appearance

# Transparency settings for the videomaterial DEF M1 MaterialKey

keyColor 0 0 0lowThreshold 0.1transparency 0.1

texture MovieTexture # Object descriptor with ID 4 is# defined below

url [od:4]

# Video isn’t scaled now but could be.geometry Bitmap

scale 1.0 1.0

]

]

Stroke Encoding

The first line segment of a stroke (i. e., when a line segment’s starting point isnot connected to the ending point of the last line segment)

3e8f$Form$Line$17c$aa$17c$ab$ff00ff00$3

is encoded as BIFS Text as follows (the coordinate systems of E-Chalk andMPEG-4 differ, so the coordinates have to be translated):

AT 16015 # at timestamp 16015 (after 16 seconds)

APPEND TO BOARD.children DEF STR1 OrderedGroup children [

# Beginning circle# (optical correction to simulate pen down)Transform2D

translation -81 129children [

Shape appearance DEF APP1 Appearance

material Material2D emissiveColor 0.0 1.0 0.0filled TRUE

geometry Circle

Page 176: Adaptive Audio and Video Processing for Electronic Chalkboard ...

168 APPENDIX C. BOARD-EVENT ENCODING

radius 1

]

# Line segment# Beginning of stroke starts with first line segmentsShape

appearance Appearance material Material2D

lineProps LineProperties lineColor 0.0 1.0 0.0width 3

geometry DEF STROKE1 IndexedLineSet2D

coord DEF STROKEP1 Coordinate2D point [-81 129 -81 129]

# (provisional) ending circle (simulate pen up)DEF ENDCIRCLE1 Transform2D

translation -81 129children [

Shape appearance USE APP1geometry Circle

radius 1

]

]

Further line segments that belong to the stroke are appended consecutively. Forexample, the line segment

3e9f$Form$Line$17c$ab$17c$ad$ff00ff00$3

is appended to the last line segment as follows

AT 16031 REPLACE ENDCIRCLE1.translation BY -81 127APPEND TO STROKEP1.point -81 127

until the next line segment does not belong to the same stroke (i. e., it is notconnected).

Scroll events

Scroll events are easily implemented by translating the board scene. For examplethe event:

2717f$Scrollbar$b

is encoded as:

AT 160127 REPLACE BOARD.translation BY 0 11

Page 177: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix D

E-Chalk’s Audio Format

This appendix describes the syntax of E-Chalk’s audio format, internally calledWWR3. A conceptual explanation can be found in Chapter 6. All integersdefined herein are unsigned 32-bit big-endian, all shorts are unsigned 16-bit big-endian, and all bytes contain 8-bit unsigned data. The audio data is sent andstored in the same format. In the case of file storage, the data is written to afile called content.wwr. The overall syntax of any E-Chalk Audio stream is:

(<event>|(<packetlength><zippacket>))*

Archived audio data is accompanied by an index file, called index.wwr residingat the same location as the content.wwr file. The index file is a list of offsetspointing to the beginning of each compressed packet. It is used by the client foran accelerated random seek. The syntax is:

(<offset>)*

An 〈offset〉 is an integer defining the absolute position of a compressed packet.The offsets are ordered and thus monotonically increasing. Audio files largerthan 4 GB cannot have an index file.

D.1 Events

Events were introduced in WWR2 to enable transparent broadcaster forwardingand for showing arbitrary URLs. An event has the following syntax:

<event>::=<type><url><type>::="0"|"1"

The entry 〈url〉 denotes a 1024-byte field containing a valid URL and 〈type〉a short value containing either ‘0’ or ‘1’. If 〈type〉 is ‘0’, the URL is to bedisplayed in a new browser window (or as specified by the user). If 〈type〉 is 1,the URL is to replace the web page that is currently used by the replay Applet.In other words, the page replaces itself and the replay Applet is closed. Thisis used to enable transparent broadcaster forwarding. Events of type ‘0’ areconsidered obsolete and are replaced by the slide-show client (see Chapter 5).Events of type ‘1’ are only allowed during live-transmissions.

169

Page 178: Adaptive Audio and Video Processing for Electronic Chalkboard ...

170 APPENDIX D. E-CHALK’S AUDIO FORMAT

D.2 Zipped packets

〈packetlength〉 is a short (greater or equal 2) specifying the length in bytes ofthe following 〈zippacket〉. 〈zippacket〉 contains ZIP-encoded compressed oruncompressed audio data. Each packet begins with a ZIP file header [P. Deutschand J-L. Gailly, 1996] which also contains information about its length. Theinformation should correspond, otherwise the packet is to be considered corruptand needs repair. WWR3 content is always recorded with 16 kHz mono 16 bitsigned linear. The ZIP filename of the packet specifies the content as follows.

String Content of unzipped packetrawzip32 16 kHz mono 16 bit signed linear audio data

hr40 40 kbit ADPCM encoded audio datahr32 32 kbit ADPCM encoded audio datahr24 24 kbit ADPCM encoded audio data

Please note that the names rawzip, mu4, fbus, fbus2, and fbus3 denotelegacy WWR2 codecs that should not be used any more.

D.3 Codecs

Packets labeled rawzip32, as the name implies, contain uncompressed audiodata. After unzipping, they contain 50,800 bytes of audio data in source format.Packets labeled hr40, hr32, hr24 contain 40 kbit/s, 32 kbit/s, and 24 kbit/sADPCM encoded audio data, respectively. Each packet is generated using50,800 bytes of source data. The encoding follows the ADPCM standard asdefined by [ITU-T, 1990]. Please refer to this document for details.

Page 179: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix E

Audio Recording Tools

The following sections present some technical details of the tools and methodsprovided by the Active Recording components described in Chapter 7.

E.1 VU Meter

The VU meter is still the most basic tool for measuring the input gain, andwas originally standardized by IEC 268-10:1974. E-Chalk’s VU meter is imple-mented as an independent SOPA node and is integrated by default in the audiowizard and the default audio processing graph. However, it is only providedas a debugging tool. During a real lecture, the movement of the meter barswould be too distracting for the students. E-Chalk’s VU meter displays boththe peak signal and the average signal level. It also counts overruns (an overrunis defined to occur when the signal reaches more than 98 % of the maximumallowed range). The average gain level is measured by calculating the root-mean-square value of a time window of 250 ms. The value ages with the lastthree measurements. The ideal recording maximizes the average signal withoutcausing overrun. Figure E.1 shows a screenshot of E-Chalk’s VU meter.

Figure E.1: A screenshot of E-Chalk Audio’s VU meter. This view shows theVU meter in stereo mode with a mono signal fed in. The inner bars show theaverage gain while the outer bars show the peak gain. Overflows are counted anddisplayed in red next to the peak gain meter.

171

Page 180: Adaptive Audio and Video Processing for Electronic Chalkboard ...

172 APPENDIX E. AUDIO RECORDING TOOLS

Figure E.2: A screenshot of E-Chalk Audio’s equalizer in a typical setting usedfor enhancing the intelligibility of speech.

E.2 Graphic Equalizer

A graphical equalizer can be used to fine-tune the frequency spectrum of anaudio signal. Certain frequency bands can be suppressed or amplified. Theequalizer is implemented as a SOPA node and is shown only on request sinceaudio quality can also be easily degraded if used without prior knowledge. Bydefault, the equalizer settings can be adjusted during the simulation step of theaudio wizard. E-Chalk’s graphical equalizer simulates a 10-band octave filterarray conforming to ISO R.266. Equalizer settings can be saved and loadedseparately. Figure E.2 shows a screenshot.

E.3 Assessment of the Audibility of Noise

Measuring the floor noise by calculating the root mean square of a few secondsof the signal is only a very rough estimate because the minimum audible soundlevel is frequency-dependent. The exact frequency/loudness curves have beenmeasured and standardized often, for example by [ISO, 2003]. Ultimately, thesecurves depend on the listeners’s individual anatomy and health status. In orderto provide comparable results, signal-to-noise ratio is measured using the A-weighted curve [DIN EN, 2003]. However, for finding out whether a given floornoise is above the hearing threshold, E-Chalk’s audio diagnose wizard uses themodel proposed by PEAQ [ITU, 2001], which provides a better approximationto the human auditory system. In order to provide a model for the lower audi-tory threshold, a Discrete Fourier Transform (DFT) of the signal is calculatedover 50 % overlapping blocks of 2048 samples, sliced by a Hann-window. Thespectrum is then weighted by the following function (outer and middle ear):

W [k]/dB = −0.6 · 3.64 ·(

f [k]kHz

)−0.8

+ 6.5 · e−0.6·( f[k]kHz−3.3)2

− 10−3 ·(

f [k]kHz

)3.6

withf [k]/Hz = k · 23.4375

Page 181: Adaptive Audio and Video Processing for Electronic Chalkboard ...

E.4. EQUIPMENT GRADING 173

being the frequency representation at line k that is applied to the DFT output.The minimum auditory threshold is then modelled by adding a basic noise of0.4 · 3.65 · (f/kHz)−0.8 (inner ear). Given this modelling of the lower auditorythreshold, any signal can be compared to it using spectral subtraction.

E.4 Equipment Grading

The audio system is graded based on the silence noise levels measured for soundcard and equipment and on the speech-level-to-noise ratio. As already discussedin Chapter 7, judging sound card or equipment quality based only on the mea-surement of floor noise is a very rough approximation. But it is often usedbecause it can easily be measured. The grading scale has been contructed bycollecting test results from the Internet. Zero noise level of sound cards aregraded as follows (when equipment is connected, the levels are shifted up by5 dB).

category(noiselevel) =

excellent, if noiselevel < −90 dB

good, if noiselevel < −70 dBsufficient, if noiselevel < −65 dB

scant, if noiselevel < −40 dBinapplicable, if noiselevel >= −40 dB

The signal-to-noise ratio (SNR) is rated according to the following mapping. Asa reference: A modern sound card has a typical SNR of over 100 dB (in 24 bitrecording mode), a compact-disc player has a typical SNR of about 80 dB, ananalog studio tape recorder has an SNR of about 70 dB, a vinyl disc player ofabout 60 dB, and shellac discs used to deliver sound with a typical SNR of about40 dB [Fries and Fries, 2005].

category(snr) =

excellent, if snr >= 90 dBgood, if 90 > snr >= 80 dB

sufficient, if 80 > snr >= 50 dBscant, if 50 > snr >= 40 dB

inapplicable, if snr < 40 dBimproperly measured if snr < 10 dB

Page 182: Adaptive Audio and Video Processing for Electronic Chalkboard ...

174 APPENDIX E. AUDIO RECORDING TOOLS

Page 183: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix F

E-Chalk’s Video Format

This appendix describes the syntax of E-Chalk’s video format. A conceptualexplanantion can be found in Chapter 8. All integers defined herein are unsigned32-bit big-endian, all shorts are unsigned 16-bit big-endian, and all bytes contain8-bits unsigned data. The overall syntax of any E-Chalk video stream is:

<header>(<packetlength><packet>)*

The 〈header〉 is a 10-byte sequence and is described below. The 〈packetlength〉is a short specifying the length in bytes of the following 〈packet〉. A 〈packet〉is a gzipped [P. Deutsch, 1996] sequence of frames. A video stream can containany number of packets. Archived video files may be accompanied by an indexfile, called index.wwv, residing at the same location as the video file. The indexfile associates a timestamp with a packet number and an offset inside the packetfor faster random seek. The syntax is:

(<timestamp><packetno><offset>)*

A 〈timestamp〉 is an integer counting the milliseconds from the beginning of therecording. The 〈packetno〉 is an integer counting the packets from the beginningof the file, and 〈offset〉 specifies the position of the frame in the uncompressedpacket (also a 4-byte big-endian integer). Video files greater than 4 GB cannothave an index file.

F.1 Header

The header is stored at the beginning of each file. If a lecture is appended toan older E-Chalk lecture, the header is substituted at the beginning of the oldfile, i. e., it is guaranteed that there are no headers in the middle of a video file.During a live transmission, the header is sent to any client that connects to theserver. The header is thus the first sequence of bytes that every player receives.The syntax is explained in the following table.

175

Page 184: Adaptive Audio and Video Processing for Electronic Chalkboard ...

176 APPENDIX F. E-CHALK’S VIDEO FORMAT

Offset Size (bytes) Content Description0 2 “FU” magic bytes2 1 byte ’0’=window mode, ’1’=board overlay3 2 short initial x-resolution in pixels5 2 short initial y-resolution in pixels7 1 byte initial framerate (frames per second)8 1 byte reserved for future use9 1 byte reserved for future use

In board overlay mode, the resolution information is ignored and the video isscaled to fit the board resolution.

F.2 Packet

The syntax of an uncompressed packet is defined as follows:

<packet>::=(<size><type><frame>)*<type>::=’0’|’1’|’2’|’3’<frame>::=<i-frame>|<t-frame>|<0-frame>

The entry 〈size〉 is a three-byte descriptor specifying the size of the compressedimage (high byte first, low byte last). The 〈type〉 is one byte describing theframe type: ‘0’ stands for an I-Frame, ‘1’ is obsolete and not supported anymore, ‘2’ stands for a T-frame, and ‘3’ for a 0-frame. Higher numbers arereserved for future use and may be ignored at this time. The three frame typesare described in the following.

I-Frames

I-Frames are optional. I-Frames are able to change the frame rate and the playerresolution. This enables merging several video streams that were recorded witha different resolution and/or frame rate. In board overlay mode, resolutioninformation is ignored and the video is scaled to fit the board resolution. Ad-ditionally, the color black is defined as transparent. I-Frames are encoded asfollows:

Offset Size (bytes) Content Description0 2 short new x-resolution in pixels2 2 short new y-resolution in pixels4 1 byte new framerate (frames per second)5 〈size〉 bytes JFIF encoded image data

T-Frames

T-Frames contain a transparency table that has one bit associated with each8× 8-pixel block in the image. The index in the table is canonically organized.It starts at the upper left corner and ends in the bottom right corner. If the bitis set, the corresponding block is to be drawn; if the bit is not set, the block istransparent. The size of the descriptor ds in bytes is d x·y

512e with x and y beingthe x and y resolution of the image, respectively. In board overlay mode, thecolor black is defined as transparent. T-Frames are encoded as follows:

Page 185: Adaptive Audio and Video Processing for Electronic Chalkboard ...

F.2. PACKET 177

Offset Size (bytes) Content Description0 ds bytes block transparency descriptords 〈size〉 bytes JFIF encoded image data

0-Frames

0-Frames are only defined by their type. They cause no drawing for a framerate’s part of a second.

Page 186: Adaptive Audio and Video Processing for Electronic Chalkboard ...

178 APPENDIX F. E-CHALK’S VIDEO FORMAT

Page 187: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Appendix G

SIOX Benchmark Results

The following table presents detailed best-case results of applying the SIOXalgorithm to the image dataset by [Blake et al., 2004,Martin et al., 2001]. Fora detailed explanation, please refer to Section 10.7. The results appear in thesame order as referenced in all tables and diagrams throughout this document.

Image Name Pixels to classify Wrong Pixels Error

banana1.bmp 217336 5104 2.348 %

banana2.bmp 181541 1363 0.751 %

banana3.bmp 177310 6780 3.824 %

book.bmp 149236 5753 3.855 %

bool.jpg 89436 1355 1.515 %

bush.jpg 80140 8830 11.018%

ceramic.bmp 141541 9020 6.373 %

cross.jpg 131367 2784 2.119 %

doll.bmp 84100 261 0.310 %

elefant.bmp 138369 1485 1.073 %

flower.jpg 84638 714 0.844 %

fullmoon.bmp 30609 8 0.026 %

grave.jpg 133324 996 0.747 %

llama.bmp 39243 1860 4.740 %

memorial.jpg 63443 5419 8.542 %

music.JPG 123759 3701 2.990 %

person1.jpg 178285 1371 0.769 %

person2.bmp 52214 471 0.902 %

person3.jpg 48819 934 1.913 %

person4.jpg 65989 2058 3.119 %

person5.jpg 27659 1887 6.822 %

person6.jpg 57223 4063 7.100 %

person7.jpg 33783 590 1.746 %

person8.bmp 63632 3456 5.431 %

scissors.JPG 183373 3159 1.723 %

sheep.jpg 17477 210 1.202 %

stone1.JPG 63949 638 0.998 %

stone2.JPG 113080 184 0.163 %

teddy.jpg 47677 956 2.005 %

tennis.jpg 36907 3106 8.416 %

179

Page 188: Adaptive Audio and Video Processing for Electronic Chalkboard ...

180 APPENDIX G. SIOX BENCHMARK RESULTS

Image Name Pixels to classify Wrong Pixels Error

106024.jpg 30888 2541 8.226%

124084.jpg 94731 2024 2.137%

153077.jpg 85225 6210 7.287%

153093.jpg 71508 1524 2.131%

181079.jpg 74573 6979 9.359%

189080.jpg 81215 3791 4.668%

208001.jpg 54619 1493 2.733%

209070.jpg 43280 5051 11.671%

21077.jpg 17425 1882 10.801%

227092.jpg 64448 2403 3.729%

24077.jpg 66354 3200 4.823%

271008.jpg 58207 4234 7.274%

304074.jpg 19732 3037 15.391%

326038.jpg 43581 1799 4.128%

37073.jpg 44911 3667 8.165%

376043.jpg 56738 3003 5.293%

388016.jpg 61026 1320 2.163%

65019.jpg 34973 3431 9.810%

69020.jpg 69878 4413 6.315%

86016.jpg 30495 1510 4.952%

Total: 3959266 142028 3.587%

Page 189: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Bibliography

[Abowd, 1999] Abowd, G. D. (1999). Classroom 2000: An Experiment with theInstrumentation of a Living Educational Environment. IBM Systems Journal,38(4):508–530.

[Adams et al., 1998] Adams, J., Parulski, K., and Spaulding, K. (1998). ColorProcessing in Digital Cameras. IEEE Micro, 18(6):20–30.

[Allen, 1994] Allen, J. (1994). How do humans process and recognize speech?IEEE Transactions on Speech and Audio Processing, 2(4):567–577.

[American National Standards Institute, 2002] American National StandardsInstitute (2002). American National Standard Acoustical Performance Cri-teria, Design Requirements, and Guidelines for Schools. ANSI S12.60-2002.

[Anderson et al., 2006] Anderson, R., Anderson, R., Chung, O., Davis, K. M.,Davis, P., Prince, C., Razmov, V., and Simon, B. (2006). Classroom Presenter– A Classroom Interaction System for Active and Collaborative Learning. InFirst Workshop on the Impact of Pen-based Technology on Education, WestLafayette, Indiana, USA.

[Aoki et al., 1996] Aoki, H., Shimotsuji, S., and Hori, O. (1996). A shot classi-fication method of selecting effective key-frames for video browsing. In Pro-ceedings of the fourth ACM International Conference on Multimedia, pages1–10, New York, New York, USA. ACM Press.

[Apperley et al., 2003] Apperley, M., McLeod, L., Masoodian, M., Paine, L.,Phillips, M., Rogers, B., and Thomson, K. (2003). Use of video shadow forsmall group interaction awareness on a large interactive display surface. InCRPITS ’03: Proceedings of the Fourth Australian user interface conferenceon User interfaces 2003, pages 81–90, Darlinghurst, Australia. AustralianComputer Society Inc.

[Apple Inc, 2001] Apple Inc (2001). Audio and MIDI on Mac OS X. AppleComputer Inc, California, USA.

[Arguero, 2004] Arguero, M. E. (2004). A New Algorithmic Animation Frame-work for the Classroom and the Internet. Ph.D. thesis, Freie UniversitatBerlin, Institut fur Informatik, Berlin, Germany.

[Baars, 1988] Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cam-bridge University Press, Cambridge, UK.

181

Page 190: Adaptive Audio and Video Processing for Electronic Chalkboard ...

182 BIBLIOGRAPHY

[Bacher et al., 1997] Bacher, C., Muller, R., Ottmann, T., and Will, M. (1997).Authoring on the Fly. A new way of integrating telepresentation and course-ware production. In Proceedings of International Conference on Computer inEducation (ICCE) 1997, Sarawak, Malaysia.

[Ben-Ezra et al., 2005] Ben-Ezra, M., Zomet, A., and Nayar, S. K. (2005).Video Super-Resolution Using Controlled Subpixel Detector Shifts. IEEETransactions on Pattern Analysis and Machine Intelligence, 27(6):977–987.

[Bentley, 1975] Bentley, J. L. (1975). Multidimensional binary search trees usedfor associative searching. Communications of the ACM, 18:509–517.

[Bernadini et al., 2001] Bernadini, F., Martin, I. M., and Rushmeier, H. (2001).High-Quality Texture Reconstruction from Multiple Scans. IEEE Transac-tions on Visualization and Computer Graphics, 7(4):318–332.

[Beymer et al., 1997] Beymer, D., McLauchlan, P., Coifman, B., and Malik, J.(1997). A Real-time Computer Vision System for Measuring Traffic Param-eters. In Proceedings of the IEEE International Conference on ComputerVision and Pattern Recognition (CVPR).

[Blake et al., 2004] Blake, A., Rother, C., Brown, M., Perez, P., and Torr, P.(2004). Interactive Image Segmentation using an adaptive GMMRF model.In Proceedings of the European Conference on Computer Vision (ECCV).Springer Verlag, Heidelberg, Germany.

[Block et al., 2004a] Block, M., Friedland, G., Knipping, L., and Rojas, R.(2004a). Schach spielen auf einer elektronischen Tafel. Technical ReportB-04-20, Freie Universitat Berlin, Institut fur Informatik, Berlin, Germany.

[Block et al., 2004b] Block, M., Friedland, G., Knipping, L., and Rojas, R.(2004b). Schach spielen auf einer elektronischen Tafel. Technical ReportB-04-20, Freie Universitat Berlin, Institut fur Informatik.

[Boll, 1979] Boll, S. (1979). Suppression of acoustic noise in speech by spectralsubstraction. IEEE Transactions on Acoustics, Speech, and Signal Processing,27(2):113–120.

[Bovik, 2005] Bovik, A. (2005). Handboook of Image and Video Processing.Elsevier Academic Press, San Diego, California, USA.

[Box, 1998] Box, D. (1998). Essential COM. Addison Wesley, New York, NewYork, USA.

[Boykov and Jolly, 2001] Boykov, Y. and Jolly, M.-P. (2001). Interactive GraphCuts for Optimal Boundary and Region Segmentation of Objects in N-DImages. In Proceedings of the International Conference on Computer Vision,pages 105–112, Vancouver, Canada.

[Bradley et al., 1999] Bradley, J., Reich, R., and Norcross, S. (1999). On thecombined effects of signal-to-noise ratio and room acoustics on speech intel-ligibility. The Journal of the Acoustical Society of America, 106:1820–1828.

Page 191: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 183

[Bradski and Boult, 2001] Bradski, G. R. and Boult, T. E. (2001). IEEE Work-shop on Stereo and Multi-Baseline Vision (SMBV’01), volume 00. IEEEComputer Society, Los Alamitos, California, USA.

[Brotherton, 2001] Brotherton, J. A. (2001). Enriching Everyday Experiencesthrough the Automated Capture and Access of Live Experiences: eClass:Building, Observing and Understanding the Impact of Capture and Accessin an Educational Domain. Ph.D. thesis, Georgia Institute of Technology,College of Computing, Atlanta, Georgia, USA.

[Burcham, 2003] Burcham, T. M. (2003). Making Your Blackboard CoursesTalk! In Proceedings of Eighth Annual Mid-South Instructional TechnologyConference on Teaching, Learning, and Technology, Murfreesboro, Tennessee,USA.

[C. Szyperski, 1998] C. Szyperski (1998). Component Software: Beyond Object-Oriented Programming. ACM Press/Addison-Wesley Publishing Co., NewYork, New York, USA.

[Case et al., 2002] Case, J., Mundy, R., Partain, D., and Stewart, B. (2002). In-troduction and Applicability Statements for Internet-Standard ManagementFramework. RFC 3410.

[Cervantes and Hall, 2004] Cervantes, H. and Hall, R. S. (2004). AutonomousAdaptation to Dynamic Availability Using a Service-Oriented ComponentModel. In Proceedings of the IEEE International Conference on SoftwareEngineering (ICSE), pages 614–623, Edinburgh, Scotland, GB.

[Chandler and Sweller., 1992] Chandler, P. and Sweller., J. (1992). The SplitAttention Effect as a Factor in the Design of Instruction. British Journal ofEducation Psychology, 62:233–246.

[Chien et al., 2001] Chien, S.-Y., Huang, Y.-W., Ma, S.-Y., and Chen, L.-G.(2001). Automatic Video Segmentation for MPEG-4 using Predictive Water-sheds. In Proceedings of IEEE International Conference on Multimedia andExpo, pages 239–243, Tokyo, Japan.

[Chuang Y.-Y. and R., 2001] Chuang Y.-Y., Curless B., S. D. and R., S. (2001).A Bayesian Approach to Digital Matting. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pages 264–272,Los Alamitos, CA, USA. IEEE Computer Society.

[CIE, 1971] CIE (1971). Colorimetry (Official Recommendations of the Inter-national Commision on Illumination). CIE Publication No. 15 (E-1.3.1).

[CIE, 1978] CIE (1978). Recommendations on Uniform Color Spaces, Color-Difference Equations, Psychometric Color Terms. Supplement No. 2 of CIEPublication No. 15 (E-1.3.1) 1971.

[Cohen, 1999] Cohen, S. (1999). Finding Color and Shape Patterns in Images.Ph.D. thesis, Stanford University, Department of Computer Science, PaloAlto, California, USA.

Page 192: Adaptive Audio and Video Processing for Electronic Chalkboard ...

184 BIBLIOGRAPHY

[Cooper, 1990] Cooper, G. (1990). Cognitive load theory as an aid for instruc-tional design. Australian Journal of Educational Technology, 6(2):108–113.

[Corel Corporation, 2002] Corel Corporation (2002). Knockout User Guide.

[da Vinci, 1492] da Vinci, L. (1492). Trattata della pittura. Re-Published asTreatise on Painting by Princeton University Press, 1956.

[Davis, 2003a] Davis, M. (2003a). Active capture: automatic direction for au-tomatic movies. In Proceedings of the eleventh ACM international conferenceon Multimedia, pages 602–603, New York, New York, USA.

[Davis, 2003b] Davis, M. (2003b). Editing out video editing. IEEE Multimedia,10(2):54–64.

[Dickreiter, 1997a] Dickreiter, M. (1997a). Handbuch der Tonstudiotechnik, vol-ume 1. K.G. Saur, Munich, Germany, 6th edition.

[Dickreiter, 1997b] Dickreiter, M. (1997b). Handbuch der Tonstudiotechnik, vol-ume 2. K.G. Saur, Munich, Germany, 6th edition.

[Diebel and Thrun, 2005] Diebel, J. and Thrun, S. (2005). An Application ofMarkov Random Fields to Range Sensing. In Proceedings of Conference onNeural Information Processing Systems (NIPS), Cambridge, Massachusetts,USA. MIT Press.

[Diener, 2003] Diener, M. (2003). Lichtpunkterkennung per Kamera an einerRuckprojektionswand: Ein Lichtgriffel fur E- Chalk. Bachelor’s thesis, FreieUniversitat Berlin, Institut fur Informatik, Berlin, Germany.

[DIN EN, 2003] DIN EN (2003). Elektroakustik – Schallpegelmesser – Teil 1:Anforderungen. DIN EN 61672-1:2003-10 (DIN-IEC 651).

[Dufour et al., 2005] Dufour, C., Toms, E. G., Lewis, J., and Baecker, R. (2005).User strategies for handling information tasks in webcasts. In CHI ’05 ex-tended abstracts on human factors in computing systems, pages 1343–1346,New York, New York, USA. ACM Press.

[Elgammal et al., 1999] Elgammal, A., Harwood, D., and Davis, L. (1999).Non-parametric Model for Background Substraction. In Proceedings of the 7thIEEE International Conference on Computer Vision, IEEE ICCV99 FrameRate Workshop, Kerkyra, Greece.

[Eule, 2004] Eule, S. (2004). Interaktive Whiteboards in Berliner Schulen –Chancen und Probleme eines neuen Mediums. Magisterarbeit, Freie Uni-versitat Berlin, Fachbereich Erziehungswissenschaft und Psychologie, Berlin,Germany.

[Feng et al., 2005] Feng, H., Fang, W., Liu, S., and Fang, Y. (2005). A newgeneral framework for shot boundary detection and key-frame extraction.In MIR ’05: Proceedings of the 7th ACM SIGMM international workshop onMultimedia information retrieval, pages 121–126, New York, New York, USA.ACM Press.

Page 193: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 185

[Fey, 2002] Fey, A. (2002). Hilft Sehen beim Lernen: Vergleich zwischeneiner audiovisuellen und auditiven Informationsdarstellung in virtuellen Ler-numgebungen. Unterrichtswissenschaften, Zeitschrift fur Lernforschung,4/2002:331–338.

[Fielding et al., 1999] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter,L., Leach, P., and Berners-Lee, T. (1999). Hypertext Transfer Protocol –HTTP/1.1. RFC 2616.

[Forsyth and Ponce, 2003] Forsyth, D. A. and Ponce, J. (2003). Computer Vi-sion – A Modern Approach. Prentice Hall, Upper Saddle River, New Jersey,USA.

[Freed and Borenstein, 1996a] Freed, N. and Borenstein, N. (1996a). Multipur-pose Internet Mail Extensions (MIME) Part One: Format of Internet MessageBodies. RFC 2046.

[Freed and Borenstein, 1996b] Freed, N. and Borenstein, N. (1996b). Multi-purpose Internet Mail Extensions (MIME) Part Two: Media Types. RFC2046.

[Friedland, 2002a] Friedland, G. (2002a). Towards a Generic Cross PlatformMedia Editor: An Editing Tool for E-Chalk. Diplomarbeit, Institut fur In-formatik, Freie Universitat Berlin, Berlin, Germany.

[Friedland, 2002b] Friedland, G. (2002b). Towards a Generic Cross PlatformMedia Editor: An Editing Tool for E-Chalk (Abstract). In Proceedings of thefourth Informatiktage 2002, Bad Schussenried, Bad Schussenried, Germany.Gesellschaft fur Informatik e.V.

[Friedland, 2004] Friedland, G. (2004). Solving the Divided Attention Problemin Lecture Recordings. Technical Report B-04-15, Freie Universitat Berlin,Institut fur Informatik, Berlin, Germany.

[Friedland et al., 2004a] Friedland, G., Jantz, K., and Knipping, L. (2004a).Conserving an Ancient Art of Music: Making SID Tunes Editable (revisedversion). In Lecture Notes in Computer Science, volume 2771, pages 290–296.Springer Verlag, Heidelberg.

[Friedland et al., 2004b] Friedland, G., Jantz, K., and Knipping, L. (2004b). To-wards Automatized Studioless Audio Recording: A Smart Lecture Recorder.Technical Report B-04-14, Institut fur Informatik, Freie Universitat Berlin,Berlin, Germany.

[Friedland et al., 2005a] Friedland, G., Jantz, K., Knipping, L., and Rojas, R.(2005a). Experiments on Lecturer Segmentation using Texture Classificationand a 3D Camera. Technical Report B-05-04, Freie Universitat Berlin, Institutfur Informatik, Berlin, Germany.

[Friedland et al., 2005b] Friedland, G., Jantz, K., Knipping, L., and Rojas, R.(2005b). Image Segmentation by Uniform Color Clustering – Approach andBenchmark Results. Technical Report B-05-07, Freie Universitat Berlin, In-stitut fur Informatik, Berlin, Germany.

Page 194: Adaptive Audio and Video Processing for Electronic Chalkboard ...

186 BIBLIOGRAPHY

[Friedland et al., 2005c] Friedland, G., Jantz, K., Knipping, L., and Rojas, R.(2005c). The Virtual Technician: An Automatic Software Enhancer for AudioRecording in Lecture Halls. In Lecture Notes in Computer Science Volume3681, Knowledge-Based Intelligent Information and Engineering Systems: 9thInternational Conference, KES 2005, Melbourne, Australia. Springer Verlag,Heidelberg.

[Friedland et al., 2006a] Friedland, G., Jantz, K., Lenz, T., Wiesel, F., andRojas, R. (2006a). A Practical Approach to Boundary Accurate Multi-ObjectExtraction from Still Images and Videos. In Proceedings of the Seventh IEEESymposium on Multimedia (to appear), San Diego, California, USA. IEEEComputer Society.

[Friedland et al., 2005d] Friedland, G., Jantz, K., and Rojas, R. (2005d). Cut &Paste: Merging the Video with the Whiteboard Stream for Remote Lectures.Technical Report B-05-19, Freie Universitat Berlin, Institut fur Informatik,Berlin, Germany.

[Friedland et al., 2005e] Friedland, G., Jantz, K., and Rojas, R. (2005e). SIOX:Simple Interactive Object Extraction in Still Images. In Proceedings of theSixth IEEE Symposium on Multimedia (ISM2005), pages 253–259, Irvine,California, USA. IEEE Computer Society.

[Friedland et al., 2002] Friedland, G., Knipping, L., and Rojas, R. (2002). E-Chalk Technical Description. Technical Report B-02-11, Fachbereich Mathe-matik und Informatik, Freie Universitat Berlin, Berlin, Germany.

[Friedland et al., 2003] Friedland, G., Knipping, L., and Rojas, R. (2003). Map-ping the Classroom into the Web: Case Studies from several Institutions. InAndras Szuks, Erwin Wagner, C. T., editor, The Quality Dialogue: Inte-grating Cultures in Flexible, Distance and eLearning, pages 480–485, Rhodes,Greece. 12th EDEN Annual Conference, European Distance Education Net-work.

[Friedland et al., 2005f] Friedland, G., Knipping, L., Rojas, R., Schulte, J.,and Zick, C. (2005f). Die E-Chalk Software: Einsatz und Evaluation inPrasenzunterrichts- und E-Learning-Szenarien, pages 243–255. Peter LangVerlag.

[Friedland et al., 2004c] Friedland, G., Knipping, L., Schulte, J., and Tapia,E. (2004c). E-Chalk: A Lecture Recording System using the ChalkboardMetaphor. International Journal of Interactive Technology and Smart Edu-cation, 1(1):9–20.

[Friedland et al., 2004d] Friedland, G., Knipping, L., and Tapia, E. (2004d).Web-Based Lectures Produced by AI Supported Classroom Teaching. Inter-national Journal on Artificial Intelligence Tools (IJAIT), 13(2):367–382.

[Friedland et al., 2004e] Friedland, G., Knipping, L., Tapia, E., and Rojas, R.(2004e). Teaching With an Intelligent Electronic Chalkboard. In Proceedingsof ACM Multimedia 2004, Workshop on Effective Telepresence, pages 16–23,New York, New York, USA.

Page 195: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 187

[Friedland and Lasser, 1998] Friedland, G. and Lasser, T. (1998). World WideRadio – Audio Live Ubertragung durch das Internet. Projektbeschreibung,Bundeswettbewerb Jugend forscht e.V., Hamburg, Germany.

[Friedland et al., 2006b] Friedland, G., Lenz, T., Jantz, K., and Rojas, R.(2006b). Extending the SIOX Algorithm: Alternative Clustering Methods,Sub-pixel Accurate Object Extraction from Still Images, and Generic VideoSegmentation. Technical Report B-06-06, Freie Universitat Berlin, Institutfur Informatik, Berlin, Germany.

[Friedland and Pauls, 2004] Friedland, G. and Pauls, K. (2004). SOPA – ASelf Organizing Processing and Streaming Architecture. Technical ReportB-04-13, Freie Universitat Berlin, Institut fur Informatik, Berlin, Germany.

[Friedland and Pauls, 2005a] Friedland, G. and Pauls, K. (2005a). ArchitectingMultimedia Environments for Teaching. IEEE Computer, 38(6):57–64.

[Friedland and Pauls, 2005b] Friedland, G. and Pauls, K. (2005b). Towardsa Demand Driven, Autonomous Processing and Streaming Architecture.In Proceedings of Workshop on Engineering of Autonomic Systems 2005(EASe’05) at the 12th Annual IEEE International Conference on the En-gineering of Computer Based Systems (ECBS 2005), page 473, Greenbelt,Maryland, USA.

[Friedland and Rojas, 2006] Friedland, G. and Rojas, R. (2006). Human-Centered Webcasting of Interactive-Whiteboard Lectures. In Proceedings ofthe First IEEE International Workshop on Multimedia Technologies for E-Learning (to appear), San Diego, California, USA. IEEE Computer Society.

[Friedland et al., 2005g] Friedland, G., Zick, C., Jantz, K., Knipping, L., andRojas, R. (2005g). An Interactive Datawall for an Intelligent Classroom.In Proceedings of the E-Lectures Workshop, Delfi Conference 2005, Rostock,Germany.

[Friedmann and Russel, 1997] Friedmann, N. and Russel, S. (1997). Image Seg-mentation in Video Sequences: A Probablistic Approach. In Proceedings ofthe 13th Conference on Uncertainty in Artificial Intelligence (UAI97), Prov-idence, Rhode Island, USA.

[Fries and Fries, 2005] Fries, B. and Fries, M. (2005). Digital Audio Essentials.O’Reilly Media Inc, Cambrige, Massachusetts, USA.

[Gibbs et al., 1998] Gibbs, S., Arapis, C., Breiteneder, C., Lalioti, V.,Mostafawy, S., and Speier, J. (1998). Virtual Studios: An Overview. IEEEMultimedia, 5(1):18–35.

[Gleicher and Masanz, 2000] Gleicher, M. and Masanz, J. (2000). Towards vir-tual videography (poster session). In Proceedings of the eighth ACM Interna-tional Conference on Multimedia, pages 375–378, New York, New York, USA.ACM Press.

[Glowalla, 2004] Glowalla, U. (2004). Utility und Usability von E-Learning amBeispiel von Lecture-on-demand Anwendungen. Fortschritt-Berichte VDI,22(16):603–621.

Page 196: Adaptive Audio and Video Processing for Electronic Chalkboard ...

188 BIBLIOGRAPHY

[Gokturk and Tomasi, 2004] Gokturk, S. B. and Tomasi, C. (2004). 3D HeadTracking Based on Recognition and Interpolation Using a Time-Of-FlightDepth Sensor. In Proceedings of IEEE Conference on Computer Vision andPattern Recognition, Washington D.C., USA.

[Gokturk et al., 2004] Gokturk, S. B., Yalcin, H., and Bamji, C. (2004). ATime-Of-Flight Depth Sensor – System Description, Issues and Solutions. InProceedings of IEEE Conference on Computer Vision and Pattern Recogni-tion, Washington D.C., USA.

[Gonzalez and Woods, 1992] Gonzalez, R. and Woods, R. (1992). Digital ImageProcessing. Addison-Wesley, Boston, Massachusetts, USA.

[Gonzalez and Woods, 2002] Gonzalez, R. and Woods, R. (2002). Digital ImageProcessing. Prentice Hall, Upper Saddle River, New Jerey, USA, 2nd edition.

[Gordon et al., 1999] Gordon, G., Darrel, T., Harville, M., and Woodfill, J.(1999). Background estimation and removal based on range and color. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, Fort Collins, CO, USA.

[Graham, 2001] Graham, N. V. S. (2001). Visual Pattern Analyzers. OxfordUniversity Press, Oxford, UK, 2nd edition.

[Gunnarsson et al., 2005] Gunnarsson, K., Wiesel, F., and Rojas, R. (2005).The Color and the Shape: Automatic On-Line Color Calibration for Au-tonomous Robots. In Proceedings of The 9th RoboCup International Sympo-sium, Osaka, Japan.

[Hahn and Kramer, 1998] Hahn, S. and Kramer, A. F. (1998). Further evi-dence for the division of attention among non-contiguous locations. VisualCognition, 5(1-2):217–256.

[Hall and Cervantes, 2004] Hall, R. and Cervantes, H. (2004). An OSGi Imple-mentation and Experience Report. In Proceedings of the First IEEE Con-sumer Communications and Networking Conference, Las Vegas, NV (USA).IEEE Press.

[Hansen, 2002] Hansen, S. (2002). Unerhort gut – MP3-Nachfolger im c’t-Hortest. c’t – Magazin fur Computer Technik, 2002(19):94–95.

[Haritaoglu et al., 2000] Haritaoglu, I., Harwood, D., and Davis, L. (2000). W4:Real-Time Surveillance of People and Their Activities. IEEE Transactionson Pattern Analysis and Machine Intelligence, 22(8):809–831.

[Haykin, 2003] Haykin, S. (2003). Cocktail Party Phenomenon: What is it, andHow do we solve it? In European Summer School on ICA, Berlin, Germany.

[Hering, 1872] Hering, E. (orginally published 1872). Outlines of a Theory ofthe Light Sense. Re-published 1964 by Harvard University Press, Cambridge,Massachusetts, USA.

[Hill et al., 1997] Hill, B., Roger, T., and Vorhagen, F. W. (1997). Comparativeanalysis of the quantization of color spaces on the basis of the CIELAB color-difference formula. ACM Transactions on Graphics, 16(2):109–154.

Page 197: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 189

[Hodgson et al., 1999] Hodgson, M., Rempel, R., and Kennedy, S. (1999). Mea-surement and prediction of typical speech and background-noise levels in uni-versity classrooms during lectures. The Journal of the Acoustical Society ofAmerica, 105:226.

[Holmes, 2004] Holmes, N. (2004). In Defense of PowerPoint. Computer,37(7):98–100.

[Howes, 1996] Howes, T. (1996). A String Representation of LDAP Search Fil-ters. RFC 1960.

[Hurst and Muller, 2001] Hurst, W. and Muller, R. (2001). The AOF (Author-ing on the Fly) system as an example for efficient and comfortable browsingand access of multimedia data. In Proceedings of the 9th Iternational Con-ference Human-Computer Interaction Education (HCI), New Orleans, USA.

[Hurvich and Jameson, 1957] Hurvich, L. and Jameson, D. (1957). Anopponent-process theory of color vision. Psychological Reviews, 64:384–404.

[ISO, 1997] ISO (1997). Acoustics – Preferred frequencies. RecommendationR.266:1997.

[ISO, 2003] ISO (2003). Acoustics – Normal equal-loudness-level contours. Rec-ommendation R.266:1997.

[ISO/IEC JTC1, 1993] ISO/IEC JTC1 (1993). Coding of moving pictures andassociated audio for digital storage media at up to about 1.5Mbit/s (akaMPEG1). ISO/IEC 11172-2.

[ISO/IEC JTC1, 1994] ISO/IEC JTC1 (1994). Digital compression and codingof continuous-tone still images: Requirements and guidelines (aka JPEG).ISO/IEC 10918-1.

[ISO/IEC JTC1, 1997] ISO/IEC JTC1 (1997). Virtual Reality Modeling Lan-guage (VRML). ISO/IEC 14772-1.

[ISO/IEC JTC1 and ITU-T, 1996] ISO/IEC JTC1 and ITU-T (1996). Genericcoding of moving pictures and associated audio information (aka MPEG2).ISO/IEC 13818-2.

[ISO/IEC JTC1 and ITU-T, 1999] ISO/IEC JTC1 and ITU-T (1999). Codingof audio-visual objects: Part 2 Visual (MPEG-4). ISO/IEC 14496-2.

[ISO/IEC JTC1 and ITU-T, 2005] ISO/IEC JTC1 and ITU-T (2005). Codingof audio-visual objects – Part 11: Scene description and application engine.ISO/IEC 14496-2.

[Itoh and Mizushima, 1997] Itoh, K. and Mizushima, M. (1997). Environmen-tal Noise Reduction Based on Speech/Non-Speech Identification for HearingAids. In Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, Munich, Germany.

[ITU, 2000] ITU (2000). ITU T H.263 Profile 0 Level 10 (aka H.263-2000). ITUH.263.

Page 198: Adaptive Audio and Video Processing for Electronic Chalkboard ...

190 BIBLIOGRAPHY

[ITU, 2001] ITU (2001). Method for objective measurements of perceived audioquality. ITU-R BS.1387-1.

[ITU-T, 1988] ITU-T (1988). Pulse Code Modulation (PCM) of Voice Frequen-cies. Recommendation G.711.

[ITU-T, 1990] ITU-T (1990). 40, 32, 24, 16 kbit/s Adaptive Differential PulseCode Modulation (ADPCM). Recommendation G.726.

[Jankovic et al., 2006] Jankovic, B., Friedland, G., and Rojas, R. (2006). Exper-iments on Using MPEG-4 for Broadcasting Electronic Chalkboard Lectures.Technical Report B-06-05, Freie Universitat Berlin, Institut fur Informatik,Berlin, Germany.

[Jantz, 2006] Jantz, K. (2006). Ein Stift-Treiber fur eine interaktive Multipro-jektionswand. Diplomarbeit, Institut fur Informatik, Freie Universitat Berlin,Berlin, Germany.

[Jantz et al., 2003] Jantz, K., Friedland, G., and Knipping, L. (2003). Con-serving an Ancient Art of Music: Making SID Tunes Editable. In ComputerMusic Modeling and Retrieval 2003, pages 76–84, Montpellier, France.

[Jantz et al., 2004] Jantz, K., Friedland, G., Knipping, L., and Rojas, R. (2004).Trennung von Dozenten und Tafel in einem E-Kreide Video. Technical ReportB-04-07, Freie Universitat Berlin, Institut fur Informatik, Berlin, Germany.

[Jantz et al., 2006] Jantz, K., Friedland, G., Zick, C., and Rojas, R. (2006).The Next Generation Classroom – Combining a Laser-Based Display Systemwith an Intelligent Teaching Tool. In New Media in Education and Research(to appear), volume 5, Berlin, Germany. Technische Universitat Berlin.

[Jiang et al., 2004] Jiang, S., Ye, Q., Gao, W., and Huang, T. (2004). A NewMethod to Segment Playfield and its Applications in Match Analysis in SportsVideo. In Proceedings of ACM Multimedia 2004, pages 292–295, New York,New York, USA. ACM Press.

[Katz, 2002] Katz, B. (2002). Mastering Audio: The Art and the Science. FocalPress (Elsevier), Oxford, UK.

[Kellman, 1995] Kellman, P. (1995). Ontogenesis of space and motion percep-tion, pages 327–364. Academic Press.

[Kelly and Goldsmith, 2004] Kelly, S. D. and Goldsmith, L. (2004). Gestureand right hemisphere involvement in evaluating lecture material. Gesture,4:25–42.

[Knecht et al., 2002] Knecht, H., Nelson, P., Whitelaw, G., and Feth, L. (2002).Background Noise Levels and Reverberation Times in Unoccupied ClassroomsPredictions and Measurements. American Journal of Audiology, 11(2):65–71.

[Knipping, 2005] Knipping, L. (2005). An Electronic Chalkboard for Classroomand Distance Teaching. Ph.D. thesis, Institut fur Informatik, Freie UniversitatBerlin, Berlin, Germany.

Page 199: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 191

[Krauss et al., 1995] Krauss, R., Dushay, R., Chen, Y., and Rauscher, F. (1995).The Communicative Value of Conversational Hand Gestures. Journal of Ex-perimental Social Psychology, 31:533–552.

[Krupina, 2005] Krupina, O. (2005). NeuroSim: Neural Simulation System witha Client-Server Architecture. Ph.D. thesis, Freie Universitat Berlin, Institutfur Informatik, Berlin, Germany.

[Li et al., 2003] Li, L., Huang, W., Gu, I. Y. H., and Tian, Q. (2003). Fore-ground Object Detection from Videos Containing Complex Background. InProceedings of ACM Multimedia 2003, Berkeley, California, USA.

[Li and Leung, 2002] Li, L. and Leung, M. K. H. (2002). Integrating intensityand texture differences for robust change detection. IEEE Transactions onImage Processing, 11(2):105–112.

[Li et al., 2005] Li, Y., Sun, J., and Shum, H.-Y. (2005). Video Object Cut andPaste. ACM Transactions on Graphics, 24(3):595–600.

[Liwicki, 2004] Liwicki, M. (2004). Erkennung und Simulation von logischenSchaltungen fur E-Chalk. Diplomarbeit, Freie Universitat Berlin, Institut furInformatik, Berlin, Germany.

[Liwicki and Knipping, 2005] Liwicki, M. and Knipping, L. (2005). Recogniz-ing and simulating sketched logical circuits. In Proceedings of the 9th Inter-national Conference on Knowledge-Based Intelligent Information and Engi-neering Systems, part 3, LNAI 3683, pages 588–594, Melbourne, Australia.Springer.

[Luan et al., 2001] Luan, X., Schwarte, R., Zhang, Z., Xu, Z., Heinol, H.-G.,Buxbaum, B., Ringbeck, T., and Hess, H. (2001). Three-dimensional intel-ligent sensing based on the PMD technology. Sensors, Systems, and Next-Generation Satellites V. Proceedings of the SPIE., 4540:482–487.

[Ma et al., 2003] Ma, M., Schillings, V., Chen, T., and Meinel, C. (2003). T-Cube: A Multimedia Authoring System for eLearning. In Proceedings of theAACE E-Learn – World Conference on E-Learning in Corporate, Govern-ment, Healthcare, and Higher Education, pages 2289–2296, Phoenix, Arizona,USA.

[Machnicki and Rowe, 2002] Machnicki, E. and Rowe, L. (2002). Virtual Direc-tor: Automating a Webcast. SPIE Multimedia Computing and Networking.

[Mack, 2002] Mack, S. (2002). Streaming Media Bible. Hungry Minds Inc, NewYork, New York, USA.

[Manhart, 1999] Manhart, K. (1999). Horfunk im Internet. Funkschau, 25/99.

[Martin et al., 2001] Martin, D., Fowlkes, C., Tal, D., and Malik, J. (2001).A Database of Human Segmented Natural Images and its Application toEvaluating Segmentation Algorithms and Measuring Ecological Statistics.In Proceedings of the 8th International Conference on Computer Vision(ICCV2001), volume 2, pages 416–423, Vancouver, Canada.

Page 200: Adaptive Audio and Video Processing for Electronic Chalkboard ...

192 BIBLIOGRAPHY

[Mastoropoulou et al., 2005] Mastoropoulou, G., Debattista, K., Chalmers, A.,and Troscianko, T. (2005). The influence of sound effects on the perceivedsmoothness of rendered animations. In APGV ’05: Proceedings of the 2ndsymposium on applied perception in graphics and visualization, pages 9–15,New York, New York, USA. ACM Press.

[Mathew et al., 1999] Mathew, J., Coddington, P., and Hawick, K. (1999).Analysis and Devlopment of Java Grande Benchmarks. Technical ReportDHCP-063, University of Adelaide, Department of Computer Science, Ade-laide, Australia.

[Mayer et al., 2002] Mayer, G., Utz, H., and Kraetzschmar, G. K. (2002). To-wards Autonomous Vision Self-Calibration for Soccer Robots. In Proceedingsof the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS-2002), volume 1, pages 214–219.

[Meinel et al., 2005] Meinel, C., Schillings, V., and Kutzner, M. (2005). tele-TASK – Ein praktikables, Standardkomponenten-basiertes, mobil einset-zbares Teleteaching-System. In Proceedings of the E-LEctures Workshop,Delfi Conference 2005, Rostock, Germany.

[Mertens et al., 2006] Mertens, R., Friedland, G., and Kruger, M. (2006). ToSee or Not To See: Layout Constraints, the Split Attention Problem and theirImplications for the Design of Web Lecture Interfaces. In Proceedings of theAACE E-Learn – World Conference on E-Learning in Corporate, Govern-ment, Healthcare, and Higher Education, Honolulu, Hawai, USA.

[Milutinovic, 2002] Milutinovic, V. E. (2002). E-Business and E-Challenges.IOS Press, Amsterdam, The Netherlands.

[Mortensen and Barrett, 1999] Mortensen, E. and Barrett, W. (1999).Tobogan-based Intelligent Scissors with a Four Parameter Edge Model. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), volume 2, pages 452–458, Los Alamitos, CA, USA. IEEEComputer Society.

[Nahrstedt and Balke, 2004] Nahrstedt, K. and Balke, W.-T. (2004). A taxon-omy for multimedia service composition. In Proceedings of the 12th annualACM international conference on Multimedia, pages 88–95, New York, NewYork, USA. ACM Press.

[Nahrstedt and Balke, 2005] Nahrstedt, K. and Balke, W.-T. (2005). Towardsbuilding large scale multimedia systems and applications: challenges and sta-tus. In Proceedings of the first ACM international workshop on Multimediaservice composition, pages 3–10, New York, New York, USA. ACM Press.

[Narcisse P. Bichot and Kyle R. Cave and Harold Pashler, 1999] Narcisse P.Bichot and Kyle R. Cave and Harold Pashler (1999). Visual selectionmediated by location: Feature-based selection of non-contiguous locations.Perception & Psychophysics, 61(3):403–423.

[Nascimento and Chitkara, 2002] Nascimento, M. A. and Chitkara, V. (2002).Color-based image retrieval using binary signatures. In SAC ’02: Proceedings

Page 201: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 193

of the 2002 ACM symposium on Applied computing, pages 687–692, NewYork, New York, USA. ACM Press.

[Nielsen, 1999] Nielsen, J. (1999). Designing Web Usability, The Practice ofSimplicity. New Rider Publishing, Indianapolis, Indiana, USA.

[Nuchter et al., 2003] Nuchter, A., Surmann, H., Lingemann, K., andHertzberg, J. (2003). Consistent 3D Model Construction with AutonomousMobile Robots. In Lecture Notes in Artificial Intelligence, volume 2821, pages550–564, Heidelberg, Germany. Springer Verlag.

[Object Management Group (OMG), 1999] Object Management Group(OMG) (1999). CORBA 3.0 New Components Chapters, TC Documentptc/99-10-04. Needham, Massachusetts, USA.

[Oggier et al., 2004] Oggier, T., Lehmann, M., Kaufmann, R., Schweizer, M.,Richter, M., Metzler, P., Lang, G., Lustenberger, F., and Blanc, N. (2004).An all-solid-state optical range camera for 3D real-time imaging with sub-centimeter depth resolution (SwissRanger). Optical Design and Engineering.Proceedings of the SPIE., 5249:534–545.

[Ogleby, 2001] Ogleby, C. (2001). Laser Scanning and Visualisation of an Aus-tralian Icon: Ned Kelly’s Armour. In Proceedings of 7th International Con-ference on Virtual Systems and Multimedia, pages 201–208, California, USA.IEEE.

[Ooi et al., 1998] Ooi, B. C., Tan, K.-L., Chua, T. S., and Hsu, W. (1998). Fastimage retrieval using color-spatial information. The VLDB Journal, 7(2):115–128.

[Ooi et al., 2000] Ooi, W. T., Pletcher, P., and Rowe, L. A. (2000). INDIVA:Middleware for Managing a Distributed Media Environment. Technical Re-port 166, Berkeley Media Research Center, Berkeley, California, USA.

[P. Deutsch, 1996] P. Deutsch (1996). GZIP file format specification version4.3. RFC 1952.

[P. Deutsch and J-L. Gailly, 1996] P. Deutsch and J-L. Gailly (1996). ZLIBCompressed Data Format Specification version 3.3. RFC 1950.

[Pauls, 2003] Pauls, K. (2003). Eureka – an OSGi Resource Discovery Service.Diplomarbeit, Freie Universitat Berlin, Institut fur Informatik, Berlin, Ger-many.

[Pauls and Hall, 2004] Pauls, K. and Hall, R. S. (2004). Eureka – A ResourceDiscovery Service for Component Deployment. In Proceedings of the 2ndInternational Working Conference on Component Deployment (CD 2004).

[Prechelt, 2000] Prechelt, L. (2000). An empirical comparison of C, C++, Java,Perl, Python, Rexx, and Tcl search/string-processing program. TechnicalReport 5-2000, Universitat Karlsruhe, Fakultat fur Informatik, Karlsruhe,Germany.

Page 202: Adaptive Audio and Video Processing for Electronic Chalkboard ...

194 BIBLIOGRAPHY

[Raffel, 2000] Raffel, W.-U. (2000). E-Kreide, eine elektronische Tafel fur diemultimediale Lehre. Diplomarbeit, Institut fur Informatik, Freie UniversitatBerlin, Berlin, Germany.

[Rebenstorf, 2004] Rebenstorf, J. (2004). Entwicklung eines Bluetooth Stiftsfur E-Kreide. Diplomarbeit, Freie Universitat Berlin, Institut fur Informatik,Berlin, Germany.

[Remondino and Roditakis, 2003] Remondino, F. and Roditakis, A. (2003). 3DReconstruction of Human Skeleton from Single Images or Monocular VideoSequences. In Lecture Notes in Computer Science, volume 2781, pages 100–107. Springer Verlag, Heidelberg.

[Richardson, 2000] Richardson, D. (2000). Adventures in Diving Manual. In-ternational PADI Inc, Rancho Santa Margarita, California, USA.

[Riseborough, 1981] Riseborough, M. (1981). Physiographic Gestures as De-coding Facilitators: Three Experiments exploring a Neglected Facet of Com-munication. Journal of Nonverbal Behaviour, 5:172–183.

[Rojas et al., 2001a] Rojas, R., Knipping, L., Friedland, G., and Frotschl, B.(2001a). Ende der Kreidezeit – Die Zukunft des Mathematikunterrichts. DMVMitteilungen, 2001(2):32–37.

[Rojas et al., 2001b] Rojas, R., Knipping, L., Raffel, W.-U., and Friedland, G.(2001b). Elektronische Kreide: Eine Java-Multimedia Tafel fur den Prasenz-und Fernunterricht. Informatik: Forschung und Entwicklung, 16(2):159–168.

[Rother et al., 2004] Rother, C., Kolmogorov, V., and Blake, A. (2004). Grab-Cut: Interactive Foreground Extraction using Iterated Graph Cuts. ACMTrans. Graph., 23(3):309–314.

[Roussel, 2001] Roussel, N. (2001). Exploring New Uses of Video withVideoSpace. In EHCI ’01: Proceedings of the 8th IFIP International Confer-ence on Engineering for Human-Computer Interaction, pages 73–90, London,UK. Springer-Verlag.

[R.S. Hall and H. Cervantes, 2003] R.S. Hall and H. Cervantes (2003). Gravity:Supporting Dynamically Available Services in Client-Side Applications. InPoster paper in Proceedings of ESEC/FSE 2003.

[Rubner et al., 2000] Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). TheEarth Mover’s Distance as a Metric for Image Retrieval. International Journalof Computer Vision, 40(2):99–121.

[Rui et al., 2001] Rui, Y., He, L., Gupta, A., and Liu, Q. (2001). Building anintelligent camera management system. In Proceedings of the ninth ACMInternational Conference on Multimedia, pages 2–11, New York, New York,USA. ACM Press.

[Santrac et al., 2006] Santrac, N., Friedland, G., and Rojas, R. (2006). HighResolution Segmentation with a Time-of-Flight 3D-Camera using the Exam-ple of a Lecture Scene. Technical Report B-06-09, Freie Universitat Berlin,Institut fur Informatik, Berlin, Germany.

Page 203: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 195

[Schindler, 2006] Schindler, Y. (2006). Realisierung und Vergleich von Algo-rithmen zur Berechnung des Earth Mover’s Abstands. Diplomarbeit, FreieUniversitat Berlin, Institut fur Informatik, Berlin, Germany.

[Schulte, 2003] Schulte, J. (2003). Evaluation des Einsatzes der Software E-kreide in der universitaren Lehre. Magisterarbeit, Technische UniversitatBerlin, Institut fur Sprache und Kommunikation, Berlin, Germany.

[Schulzrinne et al., 2003] Schulzrinne, H., Casner, S., Frederick, R., and Jacob-son, V. (2003). RTP: A Transport Protocol for Real-Time Applications. RFC3550.

[Sheng et al., 2005] Sheng, M., Celler, B., Ambikairajah, E., and Epps, J.(2005). Development of a virtual classroom player for self-directed learn-ing. In Proceedings of the 3rd International Conference on Multimedia andICTs in Education (m-ICTE), Caceres, Extremadura, Spain.

[Shirazi, 2003] Shirazi, J. (2003). Java Performance Tuning. O’Reilly & Asso-ciates, Cambrige, Massachusetts, USA, 2nd edition.

[Simon et al., 2001] Simon, M., Behnke, S., and Rojas, R. (2001). Robust RealTime Color Tracking. In RoboCup 2000: Robot Soccer World Cup IV, pages239–248, Heidelberg, Germany. Springer.

[Steffien, 2004] Steffien, H. (2004). Handschriftliche Erstellung und Ausfuhrungvon Python-Skripten auf der E-Kreide Tafel. Bachelor’s thesis, Freie Univer-sitat Berlin, Institut fur Informatik, Berlin, Germany.

[Sun Microsystems Inc, 1997] Sun Microsystems Inc (1997). JavaBeans Speci-fication. Version 1.0.1, Santa Clara, California, USA.

[Sun Microsystems Inc, 2000] Sun Microsystems Inc (2000). Enterprise Jav-aBeans Specification, Version 2.0, Final Draft. Santa Clara, California, USA.

[Sweller et al., 1990] Sweller, J., Chandler, P., Tierney, P., and Cooper, G.(1990). Cognitive Load as a Factor in the Structuring of Technical Mate-rial. Journal of Experimental Psychology: General, 119:176–192.

[Tanenbaum and van Steen, 2002] Tanenbaum, A. S. and van Steen, M. (2002).Distributed Systems, Principles and Paradigms. Prentince Hall, Upper SaddleRiver, New Jersey, USA.

[Tang et al., 2004] Tang, A., Neustaedter, C., and Greenberg, S. (2004). Em-bodiments for Mixed Presence Groupware. Technical Report 2004-769-34,University of Calgary, Department of Computer Science, Calgary, Canada.

[Tang et al., 2006] Tang, A., Neustaedter, C., and Greenberg, S. (2006).VideoArms: Embodiments for Mixed Presence Groupware. In Proceedingsof the 20th British HCI Group Annual Conference (HCI 2006).

[Tang and Minneman, 1991] Tang, J. C. and Minneman, S. (1991). Vide-oWhiteboard: video shadows to support remote collaboration. In Proceedingsof the SIGCHI conference on Human factors in computing systems (CHI ’91),pages 315–322, New York, New York, USA. ACM Press.

Page 204: Adaptive Audio and Video Processing for Electronic Chalkboard ...

196 BIBLIOGRAPHY

[Tang and Minneman, 1990] Tang, J. C. and Minneman, S. L. (1990). Video-Draw: a video interface for collaborative drawing. In Proceedings of theSIGCHI conference on Human factors in computing systems (CHI ’90), pages313–320, New York, New York, USA. ACM Press.

[Tapia, 2005] Tapia, E. (2005). Understanding Mathematics: A System for theRecognition of On-Line Handwritten Mathematical Expressions. Ph.d. thesis,Freie Universitat Berlin, Institut fur Informatik, Berlin, Germany.

[Tellinghuisen and Nowak, 2003] Tellinghuisen, D. J. and Nowak, E. J. (2003).The inability to ignore auditory distractors as a function of visual task per-ceptual load. Perception & Psychophysics, 65:817–828.

[The Eclipse Foundation, 2003] The Eclipse Foundation (2003). Eclipse Plat-form – Technical Overview. Technical report, Object Technology Interna-tional Inc.

[The Open Services Gateway Initiative, 2003] The Open Services Gateway Ini-tiative (2003). OSGi Service Platform. IOS Press, Amsterdam, The Nether-lands. Release 3.

[Theimer, 2004] Theimer, F. (2004). Automatische Handschrifterkennung in E-Kreide Dokumenten. Bachelor’s thesis, Freie Universitat Berlin, Institut furInformatik, Berlin, Germany.

[Thiede et al., 2000] Thiede, T., Treurniet, W. C., Bitto, R., Schmidmer, C.,Sporer, T., Beerends, J. G., Colomes, C., Keyhl, M., Stoll, G., Brandenburg,K., and Feiten, B. (2000). PEAQ – The ITU standard for Objective Measure-ment of Perceived Audio Quality. Journal of the Audio Engineering Society,48(1/2):3–29.

[Trinkwalder, 2006] Trinkwalder, A. (2006). Bitte Freimachen – Halbautoma-tische Verfahren zum Freistellen von Bildern. c’t – Magazin fur ComputerTechnik, 2006(3):168–171.

[Tsai, 1987] Tsai, R. Y. (1987). A versatile Camera Calibration Technique forHigh-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cam-eras and Lenses. IEEE Journal of Robotics and Automation, RA-3(4):323–344.

[Tufte, 2003] Tufte, E. R. (2003). The Cognitive Style of PowerPoint. GraphicsPress LLC, Cheshire, Connecticut, USA.

[Vezhnevets and Konouchine, 2005] Vezhnevets, V. and Konouchine, V. (2005).GrowCut – Interactive Multi-Label N-D Image Segmentation By Cellular Au-tomata. In Proceedings of GraphiCon 2005 – Fifteenth International Confer-ence on Computer Graphics and Vision, Novosibirsk Akademgorodok, Russia.

[Wallick et al., 2005] Wallick, M., Heck, R., and Gleicher, M. (2005). Markerand Chalkboard Regions. In Proceedings of Mirage 2005, pages 223–228.

[Wang et al., 2005] Wang, J., Bhat, P., Colburn, R. A., Agrawala, M., and Co-hen, M. F. (2005). Interactive Video Cutout. ACM Transactions on Graphics,24(3):585–594.

Page 205: Adaptive Audio and Video Processing for Electronic Chalkboard ...

BIBLIOGRAPHY 197

[Wang and Adelson, 1994] Wang, J. Y. A. and Adelson, E. H. (1994). Repre-senting moving images with layers. IEEE Transaction on Image Processing,3:625–637.

[Wang et al., 2003] Wang, Y., Tan, T., and Loe, K.-F. (2003). Video Segmen-tation Based on Graphical Models. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), volume 2, pages 335–342,Los Alamitos, CA, USA. IEEE Computer Society.

[William and Fardon, 2005] William, J. and Fardon, M. (2005). On-demandInternet-transmitted Lecture Recordings: Attempting to Enhance and Sup-port the Lecture Experience. In Proceedings of 12th International Conferenceof the Association for Learning Technology (ALT-C), Manchester, GB.

[Wyszecki and Stiles, 1982] Wyszecki, G. and Stiles, W. S. (1982). Color Sci-ence: Concepts and Methods, Quantitative Data and Formulae. John Wileyand Sons, New York, New York, USA.

[Yasuda et al., 2004] Yasuda, K., Naemura, T., and Harashima, H. (2004).Thermo-Key: Human Region Segmentation from Video. IEEE ComputerGraphics and Applications, 24(1):26–30.

[Zabih et al., 1995] Zabih, R., Miller, J., and Mai, K. (1995). A feature-based al-gorithm for detecting and classifying scene breaks. In Proceedings of the thirdACM International Conference on Multimedia, pages 189–200, New York,New York, USA. ACM Press.

[Zhang et al., 1997] Zhang, X., Farrell, J. E., and Wandell, B. A. (1997). Ap-plications of a Spatial Extension to CIELAB. In SPIE Electronic Imaging,New York, New York, USA. ACM Press.

[Zhu et al., 2004] Zhu, Q., Wu, C.-T., Cheng, K.-T., and Wu, Y.-L. (2004). AnAdaptive Skin Model and Its Application to Objectionable Image Filtering.In Proceedings of ACM Multimedia 2004, pages 56–63, New York, New York,USA.

[Ziewer and Seidl, 2004] Ziewer, P. and Seidl, H. (2004). Annotiertes LectureRecording. In Delfi Conference 2004, Paderborn, Germany.

[Zitnick and Kanade, 2000] Zitnick, C. L. and Kanade, T. (2000). A Coopera-tive Algorithm for Stereo Matching and Occlusion Detection. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 22(7):675–684.

Page 206: Adaptive Audio and Video Processing for Electronic Chalkboard ...

198 BIBLIOGRAPHY

Page 207: Adaptive Audio and Video Processing for Electronic Chalkboard ...

Web References

[1] 3DV Systems Inc. DMC 100 Depth Machine Camera (last visited:2005-07-01). http://www.3dvsystems.com, 2004.

[2] AICC and IMS and IEEE and Ariadne. Shareable Content ObjectReference Model (SCORM) (last visited: 2005-07-01).http://www.imsglobal.org, 2003.

[3] OSGi Alliance. OSGi Official Web Site (last visited: 2005-07-01).http://www.osgi.org, 2004.

[4] Apache Organization. The Avalon Framework (last visited: 2005-07-01).http://jakarta.apache.org/avalon.

[5] Apple Computer Inc. Bonjour – Official Web Site (last visited:2005-07-01). http://developer.apple.com/networking/bonjour/.

[6] Apple Inc. Apple iTunes Podcasts (last visited: 2005-07-01).http://www.apple.com/itunes/podcasts/, 2006.

[7] Apple Inc. Apple Logic Pro (last visited: 2006-07-01).http://www.apple.com/logicpro/, 2006.

[8] Apple Inc. Apple Quicktime (last visited: 2005-07-01).http://www.apple.com/quicktime/, 2006.

[9] Cliff Atkinson. Five Experts Dispute Edward Tufte on PowerPoint (lastvisited: 2005-07-01).http://www.sociablemedia.com/articles dispute.htm, 2006.

[10] Berliner Gruselkabinett Entertainment GmbH. Berliner Gruselkabinett(last visited: 2006-07-01). http://www.gruselkabinett.de/, 2000.

[11] Canesta Inc. CanestaVision EP Development Kit (last visited:2005-07-01). http://www.canesta.com/devkit.htm, 2004.

[12] TechSmith Corperation. Camtasia Studio Screen Recorder (last visited:2005-06-20). http://www.CamtasiaStudio.com.

[13] CSEM Sa. SwissRanger 3D Vision Camera (last visited: 2005-07-01).http://www.swissranger.ch, 2004.

[14] Dudo Erny. Free Pictures Download (last visited: 2006-07-01).http://www.bigfoto.com, 2005.

199

Page 208: Adaptive Audio and Video Processing for Electronic Chalkboard ...

200 WEB REFERENCES

[15] DyKnow Inc. The DyKnow System (last visited: 2005-07-01).http://www.dyknow.com, 2006.

[16] E-Chalk Team. MASI: Media Applet Synchronization Interface (lastvisited: 2005-07-01).http://kazan.inf.fu-berlin.de/echalk/docs/MASI/.

[17] E-Chalk Team. SOPA: Self Organizing Streaming and ProcessingArchitecture (last visited: 2005-07-01).http://www.sopa.inf.fu-berlin.de.

[18] Eyetronics Inc. Eyetronincs 3D Laser Scanner (last visited: 2005-07-01).http://www.eyetronics.com, 2005.

[19] Tim Ferguson. Cinepak (CVID) stream format for AVI and QT (lastvisited: 2005-07-01).http://www.csse.monash.edu.au/\%7etimf/videocodec/cinepak.txt,2001.

[20] Tom Fine. Designing an Electronic Classroom (last visited: 2005-07-01).http://hea-www.harvard.edu/∼fine/opinions/classroom.html, 2003.

[21] GIMP Developers. GIMP: GNU Image Manipulation Program, (lastvisited: 2006-07-01). http://www.gimp.org, 2006.

[22] GIOVE partners (EU-project). The Giustiniani Collection in a VirtualEnvironment (last visited: 2006-07-01). http://www.giustiniani.org/,1999.

[23] Richard S. Hall. Oscar – An OSGi framework implementation (lastvisited: 2005-07-01). http://oscar.objectweb.org/.

[24] Hitachi Software Engineering America, Ltd. (last visited: 2005-07-01).Hitachi Starboard. http://www.hitachi-soft.com/, 2004.

[25] imc information multimedia communication AG. Lecturnity – explaineverything (last visited: 2005-06-20). http://www.lecturnity.de.

[26] CollabWorx Inc. Overview of LecCorder system (last visited: 2005-06-20).http://www.leccorder.com.

[27] Inkscape Team. Inkscape. Draw Freely (last visited: 2006-07-01).http://www.inkscape.org, 2006.

[28] Interactive Whiteboards, Wireless Pads, and Digitizers (last visited:2005-07-01). GTCo CalComp Peripherals. http://www.gtco.com/, 2004.

[29] Krita Team. The KOffice Project – Krita (last visited: 2006-07-01).http://www.koffice.org/krita/, 2006.

[30] Magix AG. Official Magix Samplitude Website (last visited: 2006-07-01).http://www.samplitude.com/, 2006.

Page 209: Adaptive Audio and Video Processing for Electronic Chalkboard ...

WEB REFERENCES 201

[31] Microsoft Corporation. Microsoft DirectShow 9.0 (last visited:2005-07-01).http://msdn.microsoft.com/en-us/directshow/htm/directshow.asp,2006.

[32] Microsoft Corporation. Microsoft Office Online Homepage (last visited:2005-07-01). http://office.microsoft.com/, 2006.

[33] Microsoft Corporation. Microsoft Research Conference XP Project (lastvisited: 2005-07-01). http://www.conferencexp.net, 2006.

[34] Microsoft Corporation. Microsoft Windows Media – Your DigitalEntertainment Resource (last visited: 2005-07-01).http://www.microsoft.com/windows/windowsmedia/, 2006.

[35] Microsoft Research. Microsoft Foreground Extraction Benchmark Dataset(last visited: 2005-07-01. http://www.research.microsoft.com/vision/cambridge/segmentation/,2004.

[36] Numonics Corperation (last visited: 2005-07-01). The InteractiveWhiteboard People. http://www.numonics.com/, 2004.

[37] Omnipilot Inc. Lasso Professional Server (last visited: 2005-07-01).http://www.omnipilot.com, 2005.

[38] Plantronics Inc. Plantronics Volume Logic (last visited: 2006-07-01).http://www.octiv.com/, 2006.

[39] PMD Technologies GmbH. PMDTec 3D Vision Camera (last visited:2005-07-01). http://www.pmdtec.com, 2004.

[40] Polycom Inc. Polycom Worldwide (last visited: 2005-07-01).http://www.polycom.com, 2006.

[41] Hasso-Plattner-Institute Potsdam. tele-TASK – Tele-Teaching AnywhereSolution Kit (last visited: 2005-06-20). http://www.tele-task.de.

[42] RealNetworks Inc. RealNetworks Homepage (last visited: 2005-07-01).http://www.realnetworks.com/, 2005.

[43] RealNetworks Inc. Helix Universal Server Administration Guide – OnlineVersion (last visited: 2005-07-01).http://service.real.com/help/library/guides/HelixServerWireline/wwhelp/wwhimpl/js/html/wwhelp.htm, 2006.

[44] Richard Anderson. Classroom Presenter: A Tablet PC Based System toSupport Active Presentation (last visited: 2005-07-01).http://www.cs.virginia.edu/colloquia/event436.html, 2004.

[45] R. Rojas. E-Chalk Lecture on Statistical Classification (last visited:2005-07-01). http://www.inf.fu-berlin.de/lehre/WS05/Mustererkennung/gaussians.

Page 210: Adaptive Audio and Video Processing for Electronic Chalkboard ...

202 WEB REFERENCES

[46] Marc Roulo. JavaWorld: Accelerate your Java Apps (last visited:2005-07-01). http://www.javaworld.com/javaworld/jw-09-1998/jw-09-speed-p4.html.

[47] Scientific and Parallel Computing Lab, Computer Science Department,University of Geneva. About n-Genes, (last visited: 2006-07-01).http://cui.unige.ch/spc/tools/n-genes/, 2006.

[48] SIOX Team. SIOX: Simple Interactive Object Extraction (last visited:2006-07-01). http://www.siox.org, 2006.

[49] Smart Technologies Inc (last visited: 2005-07-01). Interactive WhiteboardTechnology. http://www.smarttech.com/, 2004.

[50] Stanford University. Stanford Video (last visited: 2005-07-01).http://stanfordvideo.stanford.edu/, 2006.

[51] Steinberg Media Technologies GmbH. Steinberg Media TechnologiesGmbH (last visited: 2006-07-01). http://www.steinberg.net/, 2006.

[52] Ulrich Stern. Java vs. C++ (last visited: 2005-07-01).http://verify.stanford.edu/uli/java\ cpp.html.

[53] Sun Microsystems Inc. Sun Java 1.1 API Documentation (last visited:2006-07-01).http://java.sun.com/products/jdk/1.1/docs/api/packages.html.

[54] Sun Microsystems Inc. List of Formats supported by the Java MediaFramework (last visited: 2006-07-01). http://java.sun.com/products/java-media/jmf/2.1.1/formats.html,2002.

[55] Tegrity Inc. Tegrity web Learner (last visited: 2005-07-01).http://www.tegrity.com, 2006.

[56] TG Publishing AG. Tom’s Hardware Guide (last visited: 2006-07-01).http://www.tomshardware.com/, 2006.

[57] The Robocup Federation. Robocup Official Site, (last visited:2006-07-01). http://www.robocup.org, 2006.

[58] UniRadio Berlin-Brandenburg e.V. Uniradio (last visited: 2006-07-01).http://www.uniradio.de, 2006.

[59] University of California, Berekeley. UC Berkeley Courses and Events Liveand On-demand (last visited: 2005-07-01).http://webcast.berkeley.edu, 2006.

[60] University of California, Los Angeles – Office of InstructionalDevelopment. UCLA Webcasts (last visited: 2005-07-01).http://www.oid.ucla.edu/webcasts, 2006.

[61] University of Indiana, Telecommunication Division. iPod LectureRecording Project (last visited: 2005-07-01).http://www.indiana.edu/∼video/stream/is ipod.php, 2006.

Page 211: Adaptive Audio and Video Processing for Electronic Chalkboard ...

WEB REFERENCES 203

[62] Computer Science University of Iowa. Introduction to XML (last visited:2005-06-20).http://weblog.cs.uiowa.edu/lectures/xml-intro/XML-intro.html.

[63] University of Washington, Computer Science. Classroom Presenter:Getting Started (last visited: 2005-07-01).http://www.cs.washington.edu/homes/rea/Presenter howto.htm,2005.

[64] University of Washington, Computer Science. Classroom Use ofClassroom Presenter (last visited: 2005-07-01). http://www.cs.washington.edu/research/edtech/presenter/classroom.html, 2005.

[65] Videre Design Inc. Videre Design (last visited: 2005-07-01).http://www.viderediesign.com, 2006.

[66] W3C. SOAP – Simple Object Access Protocol v1.2 (last visited:2005-07-01). http://www.w3.org/TR/soap, 2003.

[67] W3C. Synchronized Multimedia Integration Language (SMIL) (lastvisited: 2005-07-01). http://www.w3.org/AudioVideo, 2005.

[68] W3C. Common Gateway Interface (last visited: 2005-07-01).http://www.w3.org/CGI/, 2006.

[69] WebCT Inc. WebCT Learning without Limits (last visited: 2005-07-01).http://www.webct.com, 2005.

[70] Authoring on the Fly (AOF) (last visited: 2005-07-01).http://ad.informatik.uni-freiburg.de/mmgroup.aof/.

[71] Blender Homepage (last visited: 2006-07-01). http://www.blender.org,2006.

[72] Digital Video Broadcasting (last visited: 2005-07-01).http://www.dvb.org.

[73] E-Chalk Homepage (last visited: 2005-07-01). http://www.echalk.de.

[74] eClass – the project formerly known as Classroom 2000 (last visited:2005-07-01). http://www.cc.gatech.edu/fce/eclass/.

[75] Envivio Inc (last visited: 2005-07-01). http://www.envivio.com/.

[76] Exymen Homepage (last visited: 2005-07-01). http://www.exymen.org.

[77] Google Image Search (last visited: 2005-06-20).http://images.google.com.

[78] GPAC Project on Advanced Content (last visited: 2005-07-01).http://gpac.sourceforge.net/.

[79] Fernuniversitat Hagen(last visited: 2005-07-01).http://www.fernuni-hagen.de/.

Page 212: Adaptive Audio and Video Processing for Electronic Chalkboard ...

204 WEB REFERENCES

[80] Home Audio Video Interoperabilty (last visited: 2005-07-01).http://www.havi.org.

[81] IBM Toolkit for MPEG-4 (last visited: 2005-07-01).http://www.alphaworks.ibm.com/tech/tk4mpeg4.

[82] iLecture system – also known as Lectopia (last visited: 2005-07-01).http://ilectures.uwa.edu.au.

[83] The Community Resource for Jini Technology (last visited: 2005-07-01).http://www.jini.org.

[84] Java Media Framework (last visited: 2005-07-01).http://java.sun.com/products/java-media/jmf/index.jsp.

[85] MBone-DE (last visited: 2005-07-01). http://www.mbone.de.

[86] LBNL’s Network Research Group MBONE tools (last visited:2005-07-01). http://www-nrg.ee.lbl.gov.

[87] MPEG-4 Industry Forum (last visited: 2005-07-01).http://www.m4if.com.

[88] MPEG-4 LA (last visited: 2005-07-01). http://www.mpegla.com.

[89] The Open University (last visited: 2005-07-01).http://www.open.ac.uk/.

[90] Universidad Nacional de Educacion a Distancia (last visited: 2005-07-01).http://www.uned.es/.

[91] Wacom Inc (last visited: 2005-07-01). http://www.wacom.com.

Page 213: Adaptive Audio and Video Processing for Electronic Chalkboard ...

List of Figures

1.1 Organization of this document . . . . . . . . . . . . . . . . . . . 3

2.1 Internet broadcasting work flow suggested by RealNetworks, Inc. 72.2 Typical software architecture of Internet broadcasting software . 82.3 Prototype lecture room of eClass project . . . . . . . . . . . . . . 92.4 eClass lecture replay . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 A presentation with LecCorder . . . . . . . . . . . . . . . . . . . 112.6 Lecture replay with AOF . . . . . . . . . . . . . . . . . . . . . . 122.7 Lecture replay with Lecturnity . . . . . . . . . . . . . . . . . . . 132.8 Lecture replay with iLecture . . . . . . . . . . . . . . . . . . . . . 142.9 Lecture replay with Camtasia . . . . . . . . . . . . . . . . . . . . 152.10 A lecture recorded with tele-TASK. . . . . . . . . . . . . . . . . . 162.11 The Classroom Presenter in action . . . . . . . . . . . . . . . . . 172.12 FU PowerPoint Recorder . . . . . . . . . . . . . . . . . . . . . . 18

3.1 E-Chalk’s idea sketched with E-Chalk . . . . . . . . . . . . . . . 223.2 E-Chalk setup for larger lecture halls . . . . . . . . . . . . . . . . 243.3 Datawall at FU Berlin . . . . . . . . . . . . . . . . . . . . . . . . 253.4 E-Chalk as part of a videoconference . . . . . . . . . . . . . . . . 263.5 Exymen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 E-Chalk’s old server architecture . . . . . . . . . . . . . . . . . . 304.2 E-Chalk server architecture . . . . . . . . . . . . . . . . . . . . . 334.3 OSGi Bundle life cycle . . . . . . . . . . . . . . . . . . . . . . . . 364.4 SOPA’s node editor . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 E-Chalk Startup Wizard: audio panel . . . . . . . . . . . . . . . 474.6 Testing environment for SOPA video nodes . . . . . . . . . . . . 49

5.1 Live replay with E-Chalk’s Java client . . . . . . . . . . . . . . . 525.2 On-demand replay with E-Chalk’s Java client . . . . . . . . . . . 535.3 E-Chalk lecture replay in a browser . . . . . . . . . . . . . . . . . 545.4 E-Chalk lecture replay on a PDA . . . . . . . . . . . . . . . . . . 555.5 Instructor overlay using a Java client . . . . . . . . . . . . . . . . 565.6 E-Chalk’s slide-show component . . . . . . . . . . . . . . . . . . 575.7 JPEG artifacts on E-Chalk board images . . . . . . . . . . . . . 585.8 E-Chalk replay in Windows Media Player . . . . . . . . . . . . . 595.9 E-Chalk replay on mobile phone and iPod . . . . . . . . . . . . . 605.10 E-Chalk replay using MPEG-4 . . . . . . . . . . . . . . . . . . . 61

205

Page 214: Adaptive Audio and Video Processing for Electronic Chalkboard ...

206 LIST OF FIGURES

5.11 Antialiasing in MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . 625.12 E-Chalk lectures scaled in MPEG-4 player . . . . . . . . . . . . . 65

6.1 The E-Chalk lecture repair tool . . . . . . . . . . . . . . . . . . . 73

7.1 The steps of the audio diagnose wizard . . . . . . . . . . . . . . . 827.2 Audio wizard: report panel . . . . . . . . . . . . . . . . . . . . . 847.3 Active Recording processing chain . . . . . . . . . . . . . . . . . 857.4 Warning from level monitor . . . . . . . . . . . . . . . . . . . . . 867.5 With and without mixer control (short term) . . . . . . . . . . . 877.6 With and without mixer control (long term) . . . . . . . . . . . . 887.7 Spectral subtraction . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.1 Chalkboard lecture replay with RealVideo . . . . . . . . . . . . . 928.2 E-Chalk replay with additional video . . . . . . . . . . . . . . . . 938.3 E-Chalk Startup Wizard: video panel . . . . . . . . . . . . . . . 948.4 Visualization of E-Chalk Video’s motion compensation . . . . . . 95

9.1 The idea of the instructor extraction . . . . . . . . . . . . . . . . 999.2 Segmentation results using a stereo camera . . . . . . . . . . . . 1029.3 Instructor extraction: hardware setup . . . . . . . . . . . . . . . 1049.4 Instructor extraction: input signal . . . . . . . . . . . . . . . . . 1059.5 Instructor extraction: light problems . . . . . . . . . . . . . . . . 1069.6 Initial instructor extraction using motion statistics . . . . . . . . 1079.7 Initial instructor extraction using histograms . . . . . . . . . . . 1099.8 Gathering of a subset of the background . . . . . . . . . . . . . . 1129.9 Sample image and corresponding color signature . . . . . . . . . 1139.10 Two examples of extracted instructors . . . . . . . . . . . . . . . 1149.11 With and without board stroke suppression . . . . . . . . . . . . 1159.12 Instructor extraction: final results . . . . . . . . . . . . . . . . . 1169.13 E-Chalk replay on mobile phone with overlaid instructor . . . . . 117

10.1 SIOX vs Knockout 2 . . . . . . . . . . . . . . . . . . . . . . . . . 12010.2 Mapping from user selection to confidence matrix . . . . . . . . . 12210.3 Visualization of different color clustering strategies . . . . . . . . 12310.4 SIOX with and without post-processing . . . . . . . . . . . . . . 12410.5 SIOX as a tool in GIMP . . . . . . . . . . . . . . . . . . . . . . . 12610.6 Idea of the Detail Refinement Brush . . . . . . . . . . . . . . . . 12710.7 Detail Refinement Brush: sample results . . . . . . . . . . . . . . 12810.8 Multiobject extraction . . . . . . . . . . . . . . . . . . . . . . . . 12910.9 SIOX for videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13010.10SIOX for Robocup . . . . . . . . . . . . . . . . . . . . . . . . . . 13110.11 SIOX benchmark input . . . . . . . . . . . . . . . . . . . . . . . 13210.12 SIOX benchmark results . . . . . . . . . . . . . . . . . . . . . . . 13310.13 SIOX speed in GIMP . . . . . . . . . . . . . . . . . . . . . . . . 13610.14 SIOX vs Grabcut . . . . . . . . . . . . . . . . . . . . . . . . . . . 13710.15 Limits of SIOX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13810.16 SIOX in Blender . . . . . . . . . . . . . . . . . . . . . . . . . . . 13910.17 SIOX in Inkscape . . . . . . . . . . . . . . . . . . . . . . . . . . 140

11.1 The time-of-flight principle . . . . . . . . . . . . . . . . . . . . . 144

Page 215: Adaptive Audio and Video Processing for Electronic Chalkboard ...

LIST OF FIGURES 207

11.2 The SwissRanger camera . . . . . . . . . . . . . . . . . . . . . . . 14511.3 Raw depth-image segmentation . . . . . . . . . . . . . . . . . . . 14611.4 Enhancement with method by Diebel and Thrun . . . . . . . . . 14711.5 SIOX and 3D cameras . . . . . . . . . . . . . . . . . . . . . . . . 148

12.1 The instructor presented as one line . . . . . . . . . . . . . . . . 153

A.1 Conceptual overview of the E-Chalk system . . . . . . . . . . . . 159

E.1 E-Chalk’s VU meter . . . . . . . . . . . . . . . . . . . . . . . . . 171E.2 E-Chalk’s equalizer . . . . . . . . . . . . . . . . . . . . . . . . . . 172


Recommended