Penn Discourse Treebank: Complexity of Dependencies at The Discourse Level and at the Sentence Level

Aravind K. Joshi
Department of Computer and Information Science and
Institute for Research in Cognitive Science
University of Pennsylvania
Philadelphia, PA


First, I will describe the Penn Discourse Treebank (PDTB)*, a corpus in which we annotate the discourse connectives (explicit and implicit) and their arguments, together with "attributions" of the arguments and the relations denoted by the connectives, and also the senses of the connectives. I will then discuss some issues concerning the complexity of dependencies in terms of the elements that bear the dependency relations, the graph theoretic properties of these dependencies such as nested and crossed dependencies, dependencies with shared arguments, and finally, the attributions and their relationship to the dependencies, among others. We will compare these dependencies with those at the sentence level and then discuss some aspects that relate to the transition from the sentence level to the level of "immediate discourse" and propose some conjectures.

*This 1 million-word corpus is the same as the WSJ corpus used by the Penn Treebank (PTB) for syntactic annotation and by Propbank for predicate-argument annotation. PDTB 2.0 will be released by the Linguistic Data Consortium (LDC) in early February 2008.

Members of the PDTB project: Nikhil Dinesh, Aravind K. Joshi, Alan Lee, Eleni Miltsakaki, Rashmi Prasad, and Bonnie Webber (University of Edinburgh).

The 21st Sejong Project: With a Focus On Building of the SELK(Sejong Electronic Lexicon of Korean) and the Sejong Korean Corpora

Hyopil Shin
Associate Professor, Dept. of Linguistics, Seoul National University
Affiliated Professor, School of Computer Engineering, Seoul National University


The 21st Sejong Project started in 1998 with a 10 year plan. The project was funded by the Ministry of Culture and Tourism of the Korean government. The goal of the project was to promote technological expertise in Korean language research and technology. The project consists of 8 sub projects ranging from construction of Korean language resources to management and distribution of outputs from the work. The core part of the project is to compile an electronic lexical dictionary and to build a large-scale Korean corpus.

The SELK focuses on an exhaustive representation of Korean linguistic knowledge by harmonizing linguistic validity, psychological reality, and computational efficiency. The SELK is composed of various sub-dictionaries corresponding to the parts-of-speech-based word categories such as nouns, verbs, adverbs etc. The lexicon shows a considerable differentiation from other paperback or machine-readable dictionaries in Korean in its precise and comprehensive representation.

The Sejong Korean Corpora project has two sub-divisions, one for a general corpus and the other for a special corpus. The general corpus division collected a wide range of unconstrained materials and endeavored at annotating the data with parts-of-speech, syntactic, and semantic tags. The special data division, on the other hand, constructed a spoken Korean corpus, Korean-English and Korean-Japanese corpora, a historical corpus, and a corpus used by North Koreans and overseas Koreans.

The SELK and the Sejong Korean Copora are beginning to serve as important research tools for investigators in natural language processing as well as in theoretical linguistics. Annotated corpora and well-established electronic dictionaries promise to be valuable for enterprises such as the construction of statistical models for the grammar of written and spoken Korean, the development of software for Korean language processing, and even the publication of the paperback Korean dictionaries.

In this speech, I will introduce the 21st Sejong Project and review my experience with constructing one such large language resource - the SELK, consisting of about 600,000 lexical entries, and the Sejong Korean Corpora, consisting of about 150 million word collections. Considering the size and time needed to develop it, this project deserves great attention. We, however, also experienced a lot of difficulties through trial and error, inevitably originating from such a long work period and the large scale of the work. We hope sharing such experiences will help researchers with the same interests, to break through the obstacles and to avoid mistakes we have made for a decade.

Language Processing for the evolving Web

Srinivasan Sengamedu
Yahoo!, Bangalore


World Wide Web brings several new dimensions to language processing - social, multimodal, structural, etc. The social dimension arises from the tagging phenomenon, multimodal from the coexistence of images and videos with text in web documents, and structure from rich formatting of web pages. While the massive amounts of data available has made new approaches to translation, summarization, and extraction possible, next generation applications like semantic search require radically new theoretical ideas. The talk will outline the phenomena, summarize recent achievements, and pose the new challenges.

