Active Learning for NLP (ALNLP)

Workshop at NAACL-HLT 2010

When: Sunday June 6, 2010
Location: Millennium Biltmore Hotel, Los Angeles

Overview | Schedule | Invited Talk | Organizers/PC


The ALNLP 2010 Proceedings are now available from ACL Press.


Labeled training data is required to achieve state-of-the-art performance for many machine learning solutions to NLP tasks. While traditional supervised methods rely exclusively on existing labeled data to induce a model, active learning allows the learner to select unlabeled data for labeling in an effort to reduce annotation costs without sacrificing performance. Thus, active learning appears promising for NLP applications where unlabeled data is readily available (e.g., web pages, audio recordings, minority language data), but obtaining labels is cost-prohibitive.

Ample recent work has demonstrated the effectiveness of active learning over a diverse range of applications. Despite these findings, active learning has not yet been widely adopted for many ongoing large-scale corpora annotation efforts -- resulting in a dearth of real-world case studies and copious research questions. Machine learning literature has primarily focused on active learning in the context of classification, devoting less attention to issues specific to NLP including annotation user studies, incorporation of semantic information, and more complex prediction tasks (e.g. parsing, machine translation). The aim of this workshop is to foster innovation and discussion that advances our understanding in these and other practical issues for active learning in NLP.

Also see the 2009 ALNLP workshop: program / proceedings.


Sunday, June 6, 2010

1:00-1:15Introduction by Burr Settles and Kevin Small
 Invited Talk
1:15-2:10Active and Proactive Machine Learning: From Fundamentals to Applications in Language Technologies and Beyond
Jaime Carbonell
 Research Papers I
2:10–2:35Using Variance as a Stopping Criterion for Active Learning of Frame Assignment
Masood Ghayoomi
2:35–3:00Active Semi-Supervised Learning for Improving Word Alignment
Vamshi Ambati, Stephan Vogel and Jaime Carbonell
 Research Papers II
3:30–3:55D-Confidence: An Active Learning Strategy which Efficiently Identifies Small Classes
Nuno Escudeiro and Alipio Jorge
3:55–4:20Domain Adaptation meets Active Learning
Piyush Rai, Avishek Saha, Hal Daume and Suresh Venkatasubramanian
4:20–4:55Parallel Active Learning: Eliminating Wait Time with Minimal Staleness
Robbie Haertel, Paul Felt, Eric Ringger and Kevin Seppi

Invited Talk

Jaime Carbonell
Language Technologies Institute
Carnegie Mellon University

Active and Proactive Machine Learning:
From Fundamentals to Applications in Language Technologies and Beyond

Active learning is a well-recognized paradigm for addressing the paucity of labeled data (e.g. topic labels on web pages, parallel text for rare languages, etc.) and the relative abundance of unlabeled data (e.g. web crawls, monolingual text). Recent advances include robust ensemble approaches and methods for active rank learning. However, although active learning has gained prominence, we advocate a major extension, relaxing restrictive assumptions such as the existence of a single omniscient oracle. Instead we investigate more realistic settings such as the presence of multiple potentially-fallible or reluctant external information sources with potentially variable cost and unknown reliability. Crowd sourcing and games-with-a-purpose are examples of setting with multiple fallible sources. Proactive learning reaches out to external sources and jointly optimizes learning source properties (e.g. accuracy, expertise area, minimal cost at which they are willing to label, etc.) and selects the most informative sources and instances for the learning task at hand. The proactive sampling methods trade off cost vs. information value vs reliability, and amortized benefit vs. immediate rewards, being largely agnostic to the base-level learning algorithms. We have applied these methods to synthetic data, benchmark test data, and most recently are applying them to new challenges such as low-resource machine translation, co-reference resolution, and inferring the human interactome (and host-pathogen interactomes.) The presentation will focus on the underlying active and proactive learning methods and touch on these applications.

Bio: Dr. Jaime Carbonell is the Director of the Language Technologies Institute and Allen Newell Professor of Computer Science at Carnegie Mellon University. He received SB degrees in Physics and Mathematics from MIT, and MS and PhD degrees in Computer Science from Yale University. His current research spans several areas, including virtually all aspects of Language Technologies: text mining, natural language processing, machine translation, and automated summarization (where he invented the MMR search-diversity method), question answering, etc. He is also an expert in Machine Learning, editing 3 books, and serving as editor-in-chief of the Machine Learning Journal for 4 years. He recently invented Proactive Machine Learning, including underlying decision-theoretic framework. Overall, he has published over 250 articles and books. His research includes text and data mining, corpus-based approaches to machine translation such as context-based MT, reasoning under uncertainty, computational proteomics where he investigates machine learning and language technologies to predict proteomic 3D structure (a.k.a. “the folding problem”) and function from protein primary sequences and biophysical constraints. He also contributes to autonomous agents that learn via observation and experience. Dr. Carbonell has served on multiple governmental advisory committees such as the Human Genome Committee of the National Institutes of Health, the Oakridge National Laboratories Scientific Advisory Board, the National Institute of Standards and Technology Interactive Systems Scientific Advisory Board, and the German National Artificial Intelligence (DFKI) Scientific Advisory Board. He is also the chairman of Carnegie Speech Company, which produces intelligent language tutoring software.


Program Committee