The Second International Joint Conference on Natural Language Processing (IJCNLP-05)
Home People

Conference Program


General Information

Workshops Satellite Symposium Tutorials




Tutorial 1:

Statistical Machine Translation Part I: Hands-On Introduction

Time: 9:00 ~ 12:00
Place: at Grand Ballroom 2

Stephan VOGEL
Carnegie Mellon University
407 South Craig Street, Pittsburgh, PA 15213

Statistical machine translation (SMT) is currently one of the hot spots in natural language processing. Over the last few years dramatic improvements have been made, and a number of comparative evaluations have shown, that SMT gives competitive results to rule-based translation systems, requiring significantly less development time. This is particularly important when building translation systems for new language pairs or new domains.

This workshop is intended to give an introduction to statistical machine translation with a focus on practical considerations. Participants should be able, after attending this workshop, to set out building an SMT system themselves and achieving good baseline results in a short time.

The tutorial will cover the basics of SMT:

  • architecture of an SMT system
  • word alignment models, esp. IBM1 and HMM models
  • phrase alignment, from Viterbi path and direct phrase alignment models
  • decoder, including recombination, pruning, n-best list generation
  • integrating output from other MT engines (multi engine translation)
  • data processing: checking, cleaning, normalizing the data
  • evaluation, especially automatic evaluation (Bleu, NIST, ...), including significance analysis

Theory will be put into practice. STTK, a statistical machine translation tool kit, will be introduced and used to build a working translation system. STTK has been developed by the presenter and co-workers over a number of years and is currently used as the basis of CMU's SMT system. It has also successfully been coupled with rule-based and example based machine translation modules to build a multi engine machine translation system. The source code of the tool kit will be made available.


Statistical Machine Translation Part II: Tree-Based SMT (Cancelled)

Dekai WU
Human Language Technology Center
Hong Kong University of Science and Technology (HKUST)
Clear Water Bay, Hong Kong

One of the most active and promising areas of statistical machine translation (SMT) research are tree-based SMT approaches. Tree-based SMT has the potential to overcome the weaknesses of early SMT architectures which (a) do not handle long-distance dependencies well, and (b) are underconstrained in that they allow too much flexibility in word reordering.

In this tutorial, we will review the various possible approaches to tree-based SMT, ranging from the original Inversion Transduction Grammar (ITG) models to later models such as alignment templates, dependency models, tree-to-string models, tree-to-tree models, and also probabilistic EBMT models. We will discuss the theoretical relationships between approaches, with critical analysis of their strengths and weaknesses. Within this framework we will survey the emerging comparative results from intriguing new large-scale empirical studies across various language pairs. We will consider what kind of constraints and biases can or should be imposed by models on the variation between unrelated human languages, and how this can facilitate efficient algorithms for a wide range of tasks in machine learning and processing of language. We will consider both scientific and engineering implications, and investigate the potential relationships to cross-language universals.

Biography: Prof. Wu received his PhD in Computer Science from the University of California at Berkeley, and was a postdoctoral fellow at the University of Toronto (Ontario, Canada) prior to joining HKUST in 1992. He received a BS in Computer Engineering from the University of California at San Diego (Revelle College departmental award, cum laude, Phi Beta Kappa) in 1984 and an Executive MBA from Kellogg and HKUST in 2002. He has been a visiting researcher at Columbia University in 1995-96, Bell Laboratories in 1995, and the Technische Universität München (Munich, Germany) during 1986-87. Prof. Wu serves as Associate Editor of ACM Transactions on Speech and Language Processing, Machine Translation, Journal of Natural Language Engineering, and Communications of COLIPS. He has also served as Co-Chair for EMNLP-2004, on the Editorial Board of Computational Linguistics, the Organizing Committee of ACL-2000 and WVLC-5 (SIGDAT 1997), and the Executive Committee of the Association for Computational Linguistics (ACL).


Tutorial 2:

Automated Text Summarization

Time: 13:30 ~ 16:30
Place: at Grand Ballroom 4

Chin-Yew LIN
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Marina del Rey, CA 90292-6695

After lying dormant for over two decades, automated text summarization has experienced a tremendous resurgence of interest in the past few years. Research is being conducted in China, Europe, Japan, and North America, and industry has brought to market more than 30 summarization systems; most recently, a series of large-scale text summarization evaluations, Document Understanding Conference (DUC) and Text Summarization Challenge (TSC) have been held yearly in the United States and Japan.

In this tutorial, we will review the state of the art in automatic summarization, and will discuss and critically evaluate current approaches to the problem. We will first outline the major types of summary: indicative vs. informative; abstract vs. extract; generic vs. query-oriented; background vs. just-the-news; single-document vs. multi-document; and so on. We will describe the typical decomposition of summarization into three stages, and explain in detail the major approaches to each stage. For topic identification, we will outline techniques based on stereotypical text structure, cue words, high-frequency indicator phrases, intratext connectivity, and discourse structure centrality. For topic fusion, we will outline some ideas that have been proposed, including concept generalization and semantic association. For summary generation, we will describe the problems of sentence planning to achieve information compaction.

How good is a summary? Evaluation is a difficult issue. We will describe various suggested measures and discuss the adequacy of current evaluation methods including manual evaluation procedures used in DUC, the factoid and pyramid method reference summary creation procedures and fully automatic evaluation method such as ROUGE. The recently developed automatic evaluation method based on basic element (BE) will also be covered.

Throughout, we will highlight the strengths and weaknesses of statistical and symbolic/linguistic techniques in implementing efficient summarization systems. We will discuss ways in which summarization systems can interact with and/or complement natural language generation, discourse parsing, information extraction, and information retrieval systems.

Finally, we will present a set of open problems that we perceive as being crucial for immediate progress in automatic summarization.

Biography: Chin-Yew Lin is a senior research scientist at the Information Sciences Institute of the University of Southern California. He was the chief architect of SUMMARIST and NeATS. He also developed the automatic summarization evaluation package ROUGE that have been used in the DUC evaluations. He has co-chaired several text summarization and question answering workshops in ACL, NAACL, COLING.


Semi-automatic corpus annotation based on machine learning models (Cancelled)

Computational Linguistics Group
Keihanna Human Info-Communication Research Center
National Institute of Information and Communications Technology
3-5 Hikari-dai, Seika-cho, Soraku-gun, Kyoto 619-0289, Japan

Annotated corpora of high quality are required for linguistic analyses as well as many NLP applications such as machine translation, especially of lesser studied languages. Machine learning softwares freely available today can reduce the burden of annotation but there is still significant work required to construct large and sophisticated corpora. In this tutorial, I will introduce an efficient framework for human-aided corpus annotation based on machine learning models, and give examples of how to design and construct real corpora, for example, tagged with morphological or error information. The framework mainly consists of three parts; (1) the maintenance of a training corpus, (2) the problem reduction for machine learning models, and (3) the error reduction in a target corpus. I hope that this tutorial will help a user reduce labor costs and construct annotated corpora of good quality. The expected audience of this tutorial are students or researchers involved in constructing annotated corpora.

Biography: Kiyotaka Uchimoto, Ph.D., is a Senior Researcher of the National Institute of Information and Communications Technology, Japan. He received the B.E. and M.E. in Electrical Engineering, and the Ph.D. in Informatics from Kyoto University in 1994, 1996, and 2004, respectively. He is a member of the Association for Natural Language Processing, the Information Processing Society of Japan, and the Association for Computational Linguistics. His main research area is corpus-based natural language processing; specializing in Japanese sentence analysis and generation, information extraction, corpus construction and annotation, and machine translation. So far, he joined the five-year project "Spontaneous Speech: Corpus and Processing Technology", and was involved in construction of a large spontaneous Japanese speech corpus, "Corpus of Spontaneous Japanese", which is the largest spontaneous speech corpus in the world. He also joined the three-year project constructing the corpus of Japanese learner English tagged with learners' errors.


Tutorial 1:
- Statistical Machine Translation Part I: Hands-On Introduction
- Statistical Machine Translation Part II: Tree-Based SMT

Tutorial 2:
- Automated Text Summarization
- Semi-automatic corpus annotation based on machine learning models (Cancelled)