{"id":372044,"date":"2017-03-18T15:35:43","date_gmt":"2017-03-18T22:35:43","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-event&#038;p=372044"},"modified":"2022-08-08T09:28:27","modified_gmt":"2022-08-08T16:28:27","slug":"nips-2008","status":"publish","type":"msr-event","link":"https:\/\/www.microsoft.com\/en-us\/research\/event\/nips-2008\/","title":{"rendered":"Neural Information Processing Systems (NIPS) 2008"},"content":{"rendered":"\n\n\n\n\n<p>This workshop is intended for researchers interested in machine learning methods for speech and language processing and in unifying approaches to several outstanding speech and language processing issues.<\/p>\n\n\n\n<p>In the last few years, significant progress has been made in both research and commercial applications of speech and language processing. Despite the superior empirical results, however, there remain important theoretical issues to be addressed. Theoretical advancement is expected to drive greater system performance improvement, which in turn generates the new need of in-depth studies of emerging novel learning and modeling methodologies. The main goal of this workshop is to fill in the above need, with the main focus on the fundamental issues of new emerging approaches and empirical applications in speech and language processing. Another focus of this workshop is on the unification of learning approaches to speech and language processing problems. Many problems in speech processing and in language processing share a wide range of similarities (despite conspicuous differences), and techniques in speech and language processing fields can be successfully cross-fertilized. It is of great interest to study unifying modeling and learning approaches across these two fields.<\/p>\n\n\n\n<p>In summary, we hope that this workshop will present an opportunity for intensive discussions of emerging learning methods among speech processing, language processing, and machine learning researchers, and will inspire unifying approaches to problems across the speech and language processing fields.<\/p>\n\n\n\n<p>The workshop will feature world-class speakers, paper session, and lively discussions. We expect but not limited to the following topics in the workshop:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Discriminative training of statistical classifiers<\/li><li>Unsupervised\/semi-supervised learning algorithms for sequential pattern recognition<\/li><li>Model adaptation to new\/diverse conditions<\/li><li>Multi-engine approaches for sequential pattern recognition<\/li><li>Unifying approaches to speech processing and language processing.<\/li><\/ul>\n\n\n\n\n\n\n\n<p>Geoffrey Hinton, University of Toronto<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> I will describe recent developments in learning algorithms for multilevel nonlinear generative models of sequential data. The models are learned greedily, one layer of features at a time and each additional layer of nonlinear features improves the overall generative model of the data. In earlier work (Taylor et. al. 2006) the basic module used for learning each layer of representation was a restricted Boltzmann machine in which both the hidden and visible units have biases that are dynamically determined by previous frames of data. This simple learning module has now been generalized to allow more complicated, multiplicative interactions so that hidden variables at one level can control the interactions between variables at the level below. These models have not yet been applied to speech but they work well on other data such as broadcast video and sequences of joint-angles derived from motion capture markers.\u00a0 (Joint work with Roland Memisevic, Graham Taylor and Ilya Sutskever).<\/p>\n\n\n\n\n\n<p>Ralf Schl\u00fcter, RWTH Aachen University&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Schlueter-NIPS2008-DiscriminativeTraining.pdf\">slides<\/a><\/p>\n\n\n\n<p><strong>Abstract:<\/strong> The objective of this talk is to establish a log-linear modeling framework in the context of discriminative training criteria, with examples from automatic speech recognition and concept tagging. The talk covers three major aspects. First, the acoustic models of conventional state-of-the-art speech recognition systems conventionally use generative Gaussian HMMs. In the past few years, discriminative models like for example Conditional Random Fields (CRFs) have been proposed to refine acoustic models. This talk addresses to what extent such less restricted models add flexibility to the model compared with the generative counterpart. Certain equivalence relations between Gaussian and log-linear HMMs are established, including context conditional models. Second, it will be shown how conventional discriminative training criteria in speech recognition such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. As a result, large-margin training in speech recognition can be performed using the same efficient algorithms for accumulation and optimization and using the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Third, CRFs are often estimated using an entropy based criterion in combination with Generalized Iterative Scaling (GIS). GIS offers, upon others, the immediate advantages that it is locally convergent, completely parameter free, and guarantees an improvement of the criterion in each step. Here, GIS is extended to allow for training log-linear models with hidden variables and optimization of discriminative training criteria different from Maximum Entropy\/Maximum Mutual Information, including Minimum Phone Error (MPE). Finally, experimental results are provided for different tasks, including the European Parliament Plenary Sessions task as well as Mandarin Broadcasts.<\/p>\n\n\n\n\n\n<p>Mari Ostendorf, University of Washington<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Local learning methods, such as nearest-neighbor and variants, are known to be very powerful for many problems, particularly for problems where good models are not available. They can also be very useful for problems with a high degree of variability over the input space. In language modeling for speech recognition, local learning has not been particularly useful, in part because of the tremendous power of the n-gram when given large amounts of training data, and in part due to the difficulty of defining distance or similarity measures for word sequences.\u00a0 However, language is quite variable, depending on both topic and genre, such that a model trained in one domain may be of little use in another. With the large amount of data available on the web, and the large number of possible topic\/genre combinations, it is of interest to consider local learning for language model adaptation. In this talk, we look at leveraging the similarity function in language model adaptation to benefit from a small neighborhood without losing the power of a large training corpus.<\/p>\n\n\n\n\n\n<p>Yunxin Zhao, University of Missouri&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Zhao-NIPS08-EnsembleAcousticModeling.pdf\">slides<\/a><\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Improving recognition accuracy of human speech by computers has been a long standing challenge. Over the past few decades, tremendous research efforts have been made on the optimization of acoustic models. On the other hand, ensemble classifier design is becoming an important direction in machine learning. Different from the commonly adopted approach of optimizing a single classifier, ensemble methods achieve pattern discrimination through synergically combining many classifiers that are complementary in nature. Ensemble methods have shown advantages in classification accuracy and robustness in a variety of application contexts. Aligned with this direction, combining output word hypotheses from multiple speech recognition systems is being increasingly used in ASR for boosting the accuracy performance.\u00a0 Nonetheless, the complexity of speech sound distributions warrants the exploration of using ensemble methods to build robust and accurate acoustic models, where the component models of an ensemble can be combined in computing the acoustic scores during decoding search, for example, at the speech frame level, and thereby a single recognition system would suffice.\u00a0 Recently, some innovative progresses have been made in this direction, producing promising results and revealing attractive properties of ensemble acoustic models.\u00a0 This talk will address several basic issues in ensemble acoustic modeling, including constructing acoustic model ensembles, combining acoustic models in an ensemble, measuring the ensemble quality, etc. Experimental findings will be provided for a conversational speech recognition task, and a discussion will be made regarding research opportunities along this path.<\/p>\n\n\n\n\n\n<p>Jeff Bilmes, University of Washington<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Speech recognition systems have historically utilized essentially one of two decoding strategies. Stack decoding (also called asynchronous decoding) allows internal decoding hypotheses to exist that have an end-time that spans over a potentially wide range of time frames. Such strategies are amenable to techniques such as A*-search assuming one has available a reasonable continuation heuristic. An alternate decoding strategy is the time-synchronous approach, whereby every active hypothesis has a similar or identical ending time. In this talk, we relate these two decoding strategies to inference procedures in dynamic graphical models (which includes Dynamic Bayesian networks and hidden conditional random fields). In particular, we see that under a hybrid search\/belief-propagation inference scheme, the underlying triangulation of the graph determines which of the above two decoding strategies are active. The triangulation, moreover, also suggests decoding strategies that lie somewhere between strictly synchronous and asynchronous approaches.<\/p>\n\n\n\n\n\n<p>Pedro Domingos, University of Washington&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nwsl.pdf\">slides<\/a><\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Language processing systems typically have a pipeline architecture, where errors accumulate as information progresses through the pipeline. The ideal solution is to perform fully joint learning and inference across all stages of the pipeline (part-of-speech tagging, parsing, coreference resolution, semantic role labeling, etc.) To make this possible without collapsing under the weight of complexity, we need a modeling language that provides a common representation for all the stages and makes it easy to combine them. Markov logic networks accomplish this by attaching weights to formulas in first-order logic and viewing them as templates for features of Markov random fields. In this talk, I will describe some of the main inference and learning algorithms for Markov logic, show how Markov logic can be used to implement an end-to-end NLP system, and present the state-of-the-art results we have obtained with the components we have implemented so far.<\/p>\n\n\n\n\n\n<p>Robert Moore, Microsoft Research<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Bilingual word alignment is the task of identifying the word tokens that are translations of each other in a corpus of sentence pairs that are translations of each other.\u00a0 After being dominated by generative models since the early 1990s, beginning in 2005 this task has been addressed by a number of discriminative approaches, resulting in substantially reduced alignment error rates.\u00a0 In most cases, these discriminative approaches have used a few hundred parallel sentence pairs with word alignments annotated, plus hundreds of thousands of parallel sentence pairs with no word-level annotation, making this task a prime example of semi-supervised learning.\u00a0 In this talk, we will look in detail at some of the machine learning issues in one of the most successful efforts at discriminative word alignment, including benefits of stacking of learners and refinements of the averaged perceptron approach to learning classifiers with structured outputs.<\/p>\n\n\n\n\n\n<p>Andreas Stolcke, SRI International&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/NIPS08-workshop-stolcke-revised.pdf\">slides<\/a><\/p>\n\n\n\n<p><strong>Abstract:<\/strong> This talk will review some of the main ML techniques employed in state-of-the-art speaker recognition systems, in terms of both modeling and feature design.\u00a0 For modeling, the two main paradigms currently in use are Gaussian mixture models with joint factor analysis, and support vector machines.\u00a0 The latter in particular have enabled a wealth of approaches that model speakers via high dimensional feature vectors drawn from a wide range of observation spaces, including cepstral, phonetic, prosodic, and lexical features. A pervasive problem in feature design is how to collapse a variable-length stream of observations into a fixed-length feature vector. SVM kernels designed for this situation are based on features generated by polynomial expansion, N-gram frequencies, and GMM mixture weights. Miscellaneous other issues include parameter smoothing (prior modeling) and model combination. It is hoped that the talk will give a glimpse into a fascinating application domain for machine learning methods, and instigate ML researchers to contribute to advances in speaker recognition.<\/p>\n\n\n\n\n\n\n\n\n\n<p><strong>Authors:<\/strong>\u00a0 Dong Yu, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, and Alex Acero<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> We present the maximum entropy (MaxEnt) model with continuous features. We show that for the continuous features the weights should be continuous functions instead of single values. We propose a spline interpolation based solution to the optimization problem that contains continuous weights and illustrate that the optimization problem can be converted into a standard log-linear one without continuous weights at a higher-dimensional space.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_01.pdf\">[PDF]<\/a><\/p>\n\n\n\n\n\n<p><strong>Authors:<\/strong>\u00a0 Sangyun Hahn and Mari Ostendorf<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement\/disagreement classification of multi-party conversational speech, using discriminative models such as support vector machines and multi-layer perceptrons. We experimentally compare and discuss their advantages and weaknesses when used with different amounts of unlabeled data.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_02.pdf\">[PDF]<\/a><\/p>\n\n\n\n\n\n<p><strong>Authors:<\/strong>\u00a0 Hui Lin, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jdroppo\/\">Jasha Droppo<\/a>, Dong Yu, and Alex Acero<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> One key issue in developing learning methods for multilingual acoustic modeling in large vocabulary automatic speech recognition (ASR) applications is to maximize the benefit of boosting the acoustic training data from multiple source languages while minimizing the negative effects of data impurity arising from language \u201cmismatch\u201d. In this paper, we introduce two learning methods, semiautomatic unit selection and global phonetic decision tree, to address this issue via effective utilization of acoustic data from multiple languages. The semi-automatic unit selection is aimed to combine the merits of both data-driven and knowledgedriven approaches to identifying the basic units in multilingual acoustic modeling. The global decision-tree method allows clustering of cross-center phones and cross-center states in the HMMs, offering the potential to discover a better sharing structure beneath the mixed acoustic dynamics and context mismatch caused by the use of multiple languages\u2019 acoustic data. Our preliminary experiment results show that both of these learning methods improve the performance of multilingual speech recognition.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_03.pdf\">[PDF]<\/a><\/p>\n\n\n\n\n\n<p><strong>Authors:<\/strong>\u00a0 Matthew Miller and Alexander Stoytchev<\/p>\n\n\n\n<p><strong>Abstract:<\/strong> Human beings have an apparently innate ability to segment continuous audio speech into words, and that ability is present in infants as young as 8 months old. This propensity towards audio segmentation seems to lay the groundwork for language learning in human beings. To artificially reproduce this ability would be both practically useful and theoretically enlightening. In this paper we propose an algorithm for the unsupervised segmentation of audio speech, based on the Voting Experts (VE) algorithm, which was originally designed to segment sequences of discrete tokens into categorical episodes. We demonstrate that our procedure is capable of inducing breaks with an accuracy substantially greater than chance, and suggest possible avenues of exploration to further increase the segmentation quality. We also show that this algorithm can reproduce results obtained from segmentation experiments performed with 8-month-old infants.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_04.pdf\">[PDF]<\/a><\/p>\n\n\n\n\n\n<p><strong>Authors:<\/strong>\u00a0 Yong Zhao and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaohe\/\">Xiaodong He<\/a><\/p>\n\n\n\n<p><strong>Abstract:<\/strong> This paper proposes using n-gram posterior probabilities, which are estimated over translation hypotheses from multiple machine translation (MT) systems, to improve the performance of the system combination. Two ways using n-gram posteriors in confusion network decoding are presented. The first way is based on n-gram posterior language model per source sentence, and the second, called n-gram segment voting, is to boost word posterior probabilities with n-gram occurrence frequencies. The two n-gram posterior methods are incorporated in the confusion network as individual features of a log-linear combination model. Experiments on the Chinese-to-English MT task show that both methods yield significant improvements on the translation performance, and an combination of these two features produces the best translation performance.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_05.pdf\">[PDF]<\/a><\/p>\n\n\n\n\n","protected":false},"excerpt":{"rendered":"<p>This workshop is intended for researchers interested in machine learning methods for speech and language processing and in unifying approaches to several outstanding speech and language processing issues. In the last few years, significant progress has been made in both research and commercial applications of speech and language processing. Despite the superior empirical results, however, [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr_startdate":"2008-12-12","msr_enddate":"2008-12-12","msr_location":"Whistler, British Columbia, Canada","msr_expirationdate":"","msr_event_recording_link":"","msr_event_link":"https:\/\/nips.cc\/","msr_event_link_redirect":false,"msr_event_time":"","msr_hide_region":false,"msr_private_event":false,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[13545],"msr-region":[],"msr-event-type":[210063],"msr-video-type":[],"msr-locale":[268875],"msr-program-audience":[],"msr-post-option":[],"msr-impact-theme":[],"class_list":["post-372044","msr-event","type-msr-event","status-publish","hentry","msr-research-area-human-language-technologies","msr-event-type-workshop","msr-locale-en_us"],"msr_about":"<!-- wp:msr\/event-details {\"title\":\"Neural Information Processing Systems (NIPS) 2008\"} \/-->\n\n<!-- wp:msr\/content-tabs -->\n<!-- wp:msr\/content-tab {\"title\":\"About\"} -->\n<!-- wp:paragraph -->\n<p>This workshop is intended for researchers interested in machine learning methods for speech and language processing and in unifying approaches to several outstanding speech and language processing issues.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In the last few years, significant progress has been made in both research and commercial applications of speech and language processing. Despite the superior empirical results, however, there remain important theoretical issues to be addressed. Theoretical advancement is expected to drive greater system performance improvement, which in turn generates the new need of in-depth studies of emerging novel learning and modeling methodologies. The main goal of this workshop is to fill in the above need, with the main focus on the fundamental issues of new emerging approaches and empirical applications in speech and language processing. Another focus of this workshop is on the unification of learning approaches to speech and language processing problems. Many problems in speech processing and in language processing share a wide range of similarities (despite conspicuous differences), and techniques in speech and language processing fields can be successfully cross-fertilized. It is of great interest to study unifying modeling and learning approaches across these two fields.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In summary, we hope that this workshop will present an opportunity for intensive discussions of emerging learning methods among speech processing, language processing, and machine learning researchers, and will inspire unifying approaches to problems across the speech and language processing fields.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The workshop will feature world-class speakers, paper session, and lively discussions. We expect but not limited to the following topics in the workshop:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul><li>Discriminative training of statistical classifiers<\/li><li>Unsupervised\/semi-supervised learning algorithms for sequential pattern recognition<\/li><li>Model adaptation to new\/diverse conditions<\/li><li>Multi-engine approaches for sequential pattern recognition<\/li><li>Unifying approaches to speech processing and language processing.<\/li><\/ul>\n<!-- \/wp:list -->\n<!-- \/wp:msr\/content-tab -->\n\n<!-- wp:msr\/content-tab {\"title\":\"Invited Talks\"} -->\n<!-- wp:moray\/accordion -->\n<!-- wp:moray\/accordion-item {\"title\":\"New Multi-level Models for High-dimensional Sequential Data\"} -->\n<!-- wp:paragraph -->\n<p>Geoffrey Hinton, University of Toronto<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> I will describe recent developments in learning algorithms for multilevel nonlinear generative models of sequential data. The models are learned greedily, one layer of features at a time and each additional layer of nonlinear features improves the overall generative model of the data. In earlier work (Taylor et. al. 2006) the basic module used for learning each layer of representation was a restricted Boltzmann machine in which both the hidden and visible units have biases that are dynamically determined by previous frames of data. This simple learning module has now been generalized to allow more complicated, multiplicative interactions so that hidden variables at one level can control the interactions between variables at the level below. These models have not yet been applied to speech but they work well on other data such as broadcast video and sequences of joint-angles derived from motion capture markers.\u00a0 (Joint work with Roland Memisevic, Graham Taylor and Ilya Sutskever).<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Log-linear Approach to Discriminative Training\"} -->\n<!-- wp:paragraph -->\n<p>Ralf Schl\u00fcter, RWTH Aachen University&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Schlueter-NIPS2008-DiscriminativeTraining.pdf\">slides<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> The objective of this talk is to establish a log-linear modeling framework in the context of discriminative training criteria, with examples from automatic speech recognition and concept tagging. The talk covers three major aspects. First, the acoustic models of conventional state-of-the-art speech recognition systems conventionally use generative Gaussian HMMs. In the past few years, discriminative models like for example Conditional Random Fields (CRFs) have been proposed to refine acoustic models. This talk addresses to what extent such less restricted models add flexibility to the model compared with the generative counterpart. Certain equivalence relations between Gaussian and log-linear HMMs are established, including context conditional models. Second, it will be shown how conventional discriminative training criteria in speech recognition such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. As a result, large-margin training in speech recognition can be performed using the same efficient algorithms for accumulation and optimization and using the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Third, CRFs are often estimated using an entropy based criterion in combination with Generalized Iterative Scaling (GIS). GIS offers, upon others, the immediate advantages that it is locally convergent, completely parameter free, and guarantees an improvement of the criterion in each step. Here, GIS is extended to allow for training log-linear models with hidden variables and optimization of discriminative training criteria different from Maximum Entropy\/Maximum Mutual Information, including Minimum Phone Error (MPE). Finally, experimental results are provided for different tasks, including the European Parliament Plenary Sessions task as well as Mandarin Broadcasts.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"On the Role of Local Learning for Language Modeling\"} -->\n<!-- wp:paragraph -->\n<p>Mari Ostendorf, University of Washington<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Local learning methods, such as nearest-neighbor and variants, are known to be very powerful for many problems, particularly for problems where good models are not available. They can also be very useful for problems with a high degree of variability over the input space. In language modeling for speech recognition, local learning has not been particularly useful, in part because of the tremendous power of the n-gram when given large amounts of training data, and in part due to the difficulty of defining distance or similarity measures for word sequences.\u00a0 However, language is quite variable, depending on both topic and genre, such that a model trained in one domain may be of little use in another. With the large amount of data available on the web, and the large number of possible topic\/genre combinations, it is of interest to consider local learning for language model adaptation. In this talk, we look at leveraging the similarity function in language model adaptation to benefit from a small neighborhood without losing the power of a large training corpus.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Ensemble Machine Learning Methods for Acoustic Modeling of Speech\"} -->\n<!-- wp:paragraph -->\n<p>Yunxin Zhao, University of Missouri&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Zhao-NIPS08-EnsembleAcousticModeling.pdf\">slides<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Improving recognition accuracy of human speech by computers has been a long standing challenge. Over the past few decades, tremendous research efforts have been made on the optimization of acoustic models. On the other hand, ensemble classifier design is becoming an important direction in machine learning. Different from the commonly adopted approach of optimizing a single classifier, ensemble methods achieve pattern discrimination through synergically combining many classifiers that are complementary in nature. Ensemble methods have shown advantages in classification accuracy and robustness in a variety of application contexts. Aligned with this direction, combining output word hypotheses from multiple speech recognition systems is being increasingly used in ASR for boosting the accuracy performance.\u00a0 Nonetheless, the complexity of speech sound distributions warrants the exploration of using ensemble methods to build robust and accurate acoustic models, where the component models of an ensemble can be combined in computing the acoustic scores during decoding search, for example, at the speech frame level, and thereby a single recognition system would suffice.\u00a0 Recently, some innovative progresses have been made in this direction, producing promising results and revealing attractive properties of ensemble acoustic models.\u00a0 This talk will address several basic issues in ensemble acoustic modeling, including constructing acoustic model ensembles, combining acoustic models in an ensemble, measuring the ensemble quality, etc. Experimental findings will be provided for a conversational speech recognition task, and a discussion will be made regarding research opportunities along this path.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Relations Between Graph Triangulation, Stack Decoding, and Synchronous Decoding\"} -->\n<!-- wp:paragraph -->\n<p>Jeff Bilmes, University of Washington<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Speech recognition systems have historically utilized essentially one of two decoding strategies. Stack decoding (also called asynchronous decoding) allows internal decoding hypotheses to exist that have an end-time that spans over a potentially wide range of time frames. Such strategies are amenable to techniques such as A*-search assuming one has available a reasonable continuation heuristic. An alternate decoding strategy is the time-synchronous approach, whereby every active hypothesis has a similar or identical ending time. In this talk, we relate these two decoding strategies to inference procedures in dynamic graphical models (which includes Dynamic Bayesian networks and hidden conditional random fields). In particular, we see that under a hybrid search\/belief-propagation inference scheme, the underlying triangulation of the graph determines which of the above two decoding strategies are active. The triangulation, moreover, also suggests decoding strategies that lie somewhere between strictly synchronous and asynchronous approaches.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Markov Logic Networks: A Unified Approach to Language Processing\"} -->\n<!-- wp:paragraph -->\n<p>Pedro Domingos, University of Washington&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nwsl.pdf\">slides<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Language processing systems typically have a pipeline architecture, where errors accumulate as information progresses through the pipeline. The ideal solution is to perform fully joint learning and inference across all stages of the pipeline (part-of-speech tagging, parsing, coreference resolution, semantic role labeling, etc.) To make this possible without collapsing under the weight of complexity, we need a modeling language that provides a common representation for all the stages and makes it easy to combine them. Markov logic networks accomplish this by attaching weights to formulas in first-order logic and viewing them as templates for features of Markov random fields. In this talk, I will describe some of the main inference and learning algorithms for Markov logic, show how Markov logic can be used to implement an end-to-end NLP system, and present the state-of-the-art results we have obtained with the components we have implemented so far.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Some Machine Learning Issues in Discriminative Bilingual Word Alignment\"} -->\n<!-- wp:paragraph -->\n<p>Robert Moore, Microsoft Research<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Bilingual word alignment is the task of identifying the word tokens that are translations of each other in a corpus of sentence pairs that are translations of each other.\u00a0 After being dominated by generative models since the early 1990s, beginning in 2005 this task has been addressed by a number of discriminative approaches, resulting in substantially reduced alignment error rates.\u00a0 In most cases, these discriminative approaches have used a few hundred parallel sentence pairs with word alignments annotated, plus hundreds of thousands of parallel sentence pairs with no word-level annotation, making this task a prime example of semi-supervised learning.\u00a0 In this talk, we will look in detail at some of the machine learning issues in one of the most successful efforts at discriminative word alignment, including benefits of stacking of learners and refinements of the averaged perceptron approach to learning classifiers with structured outputs.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Machine Learning for Speaker Recognition\"} -->\n<!-- wp:paragraph -->\n<p>Andreas Stolcke, SRI International&nbsp; |&nbsp; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/NIPS08-workshop-stolcke-revised.pdf\">slides<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> This talk will review some of the main ML techniques employed in state-of-the-art speaker recognition systems, in terms of both modeling and feature design.\u00a0 For modeling, the two main paradigms currently in use are Gaussian mixture models with joint factor analysis, and support vector machines.\u00a0 The latter in particular have enabled a wealth of approaches that model speakers via high dimensional feature vectors drawn from a wide range of observation spaces, including cepstral, phonetic, prosodic, and lexical features. A pervasive problem in feature design is how to collapse a variable-length stream of observations into a fixed-length feature vector. SVM kernels designed for this situation are based on features generated by polynomial expansion, N-gram frequencies, and GMM mixture weights. Miscellaneous other issues include parameter smoothing (prior modeling) and model combination. It is hoped that the talk will give a glimpse into a fascinating application domain for machine learning methods, and instigate ML researchers to contribute to advances in speaker recognition.<\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n<!-- \/wp:moray\/accordion -->\n<!-- \/wp:msr\/content-tab -->\n\n<!-- wp:msr\/content-tab {\"title\":\"Accepted Papers\"} -->\n<!-- wp:moray\/accordion -->\n<!-- wp:moray\/accordion-item {\"title\":\"The Maximum Entropy Model with Continuous Features\"} -->\n<!-- wp:paragraph -->\n<p><strong>Authors:<\/strong>\u00a0 Dong Yu, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, and Alex Acero<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> We present the maximum entropy (MaxEnt) model with continuous features. We show that for the continuous features the weights should be continuous functions instead of single values. We propose a spline interpolation based solution to the optimization problem that contains continuous weights and illustrate that the optimization problem can be converted into a standard log-linear one without continuous weights at a higher-dimensional space.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_01.pdf\">[PDF]<\/a><\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"A Comparison of Discriminative EM-Based Semi-Supervised Learning Algorithms on Agreement\/Disagreement Classification\"} -->\n<!-- wp:paragraph -->\n<p><strong>Authors:<\/strong>\u00a0 Sangyun Hahn and Mari Ostendorf<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement\/disagreement classification of multi-party conversational speech, using discriminative models such as support vector machines and multi-layer perceptrons. We experimentally compare and discuss their advantages and weaknesses when used with different amounts of unlabeled data.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_02.pdf\">[PDF]<\/a><\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Learning Methods in Multilingual Speech Recognition\"} -->\n<!-- wp:paragraph -->\n<p><strong>Authors:<\/strong>\u00a0 Hui Lin, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jdroppo\/\">Jasha Droppo<\/a>, Dong Yu, and Alex Acero<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> One key issue in developing learning methods for multilingual acoustic modeling in large vocabulary automatic speech recognition (ASR) applications is to maximize the benefit of boosting the acoustic training data from multiple source languages while minimizing the negative effects of data impurity arising from language \u201cmismatch\u201d. In this paper, we introduce two learning methods, semiautomatic unit selection and global phonetic decision tree, to address this issue via effective utilization of acoustic data from multiple languages. The semi-automatic unit selection is aimed to combine the merits of both data-driven and knowledgedriven approaches to identifying the basic units in multilingual acoustic modeling. The global decision-tree method allows clustering of cross-center phones and cross-center states in the HMMs, offering the potential to discover a better sharing structure beneath the mixed acoustic dynamics and context mismatch caused by the use of multiple languages\u2019 acoustic data. Our preliminary experiment results show that both of these learning methods improve the performance of multilingual speech recognition.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_03.pdf\">[PDF]<\/a><\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"Unsupervised Audio Speech Segmentation Using the Voting Experts Algorithm\"} -->\n<!-- wp:paragraph -->\n<p><strong>Authors:<\/strong>\u00a0 Matthew Miller and Alexander Stoytchev<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> Human beings have an apparently innate ability to segment continuous audio speech into words, and that ability is present in infants as young as 8 months old. This propensity towards audio segmentation seems to lay the groundwork for language learning in human beings. To artificially reproduce this ability would be both practically useful and theoretically enlightening. In this paper we propose an algorithm for the unsupervised segmentation of audio speech, based on the Voting Experts (VE) algorithm, which was originally designed to segment sequences of discrete tokens into categorical episodes. We demonstrate that our procedure is capable of inducing breaks with an accuracy substantially greater than chance, and suggest possible avenues of exploration to further increase the segmentation quality. We also show that this algorithm can reproduce results obtained from segmentation experiments performed with 8-month-old infants.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_04.pdf\">[PDF]<\/a><\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n\n<!-- wp:moray\/accordion-item {\"title\":\"System Combination for Machine Translation Using N-Gram Posterior Probabilities\"} -->\n<!-- wp:paragraph -->\n<p><strong>Authors:<\/strong>\u00a0 Yong Zhao and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaohe\/\">Xiaodong He<\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Abstract:<\/strong> This paper proposes using n-gram posterior probabilities, which are estimated over translation hypotheses from multiple machine translation (MT) systems, to improve the performance of the system combination. Two ways using n-gram posteriors in confusion network decoding are presented. The first way is based on n-gram posterior language model per source sentence, and the second, called n-gram segment voting, is to boost word posterior probabilities with n-gram occurrence frequencies. The two n-gram posterior methods are incorporated in the confusion network as individual features of a log-linear combination model. Experiments on the Chinese-to-English MT task show that both methods yield significant improvements on the translation performance, and an combination of these two features produces the best translation performance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_05.pdf\">[PDF]<\/a><\/p>\n<!-- \/wp:paragraph -->\n<!-- \/wp:moray\/accordion-item -->\n<!-- \/wp:moray\/accordion -->\n<!-- \/wp:msr\/content-tab -->\n<!-- \/wp:msr\/content-tabs -->","tab-content":[{"id":0,"name":"About","content":"This workshop is intended for researchers interested in machine learning methods for speech and language processing and in unifying approaches to several outstanding speech and language processing issues.\r\n\r\nIn the last few years, significant progress has been made in both research and commercial applications of speech and language processing. Despite the superior empirical results, however, there remain important theoretical issues to be addressed. Theoretical advancement is expected to drive greater system performance improvement, which in turn generates the new need of in-depth studies of emerging novel learning and modeling methodologies. The main goal of this workshop is to fill in the above need, with the main focus on the fundamental issues of new emerging approaches and empirical applications in speech and language processing. Another focus of this workshop is on the unification of learning approaches to speech and language processing problems. Many problems in speech processing and in language processing share a wide range of similarities (despite conspicuous differences), and techniques in speech and language processing fields can be successfully cross-fertilized. It is of great interest to study unifying modeling and learning approaches across these two fields.\r\n\r\nIn summary, we hope that this workshop will present an opportunity for intensive discussions of emerging learning methods among speech processing, language processing, and machine learning researchers, and will inspire unifying approaches to problems across the speech and language processing fields.\r\n\r\nThe workshop will feature world-class speakers, paper session, and lively discussions. We expect but not limited to the following topics in the workshop:\r\n\r\n-Discriminative training of statistical classifiers\r\n-Unsupervised\/semi-supervised learning algorithms for sequential pattern recognition\r\n-Model adaptation to new\/diverse conditions\r\n-Multi-engine approaches for sequential pattern recognition\r\n-Unifying approaches to speech processing and language processing."},{"id":1,"name":"Invited Talks","content":"[accordion]\r\n[panel header=\"New Multi-level Models for High-dimensional Sequential Data\"]\r\nGeoffrey Hinton, University of Toronto\r\n\r\nAbstract: I will describe recent developments in learning algorithms for multilevel nonlinear generative models of sequential data. The models are learned greedily, one layer of features at a time and each additional layer of nonlinear features improves the overall generative model of the data. In earlier work (Taylor et. al. 2006) the basic module used for learning each layer of representation was a restricted Boltzmann machine in which both the hidden and visible units have biases that are dynamically determined by previous frames of data. This simple learning module has now been generalized to allow more complicated, multiplicative interactions so that hidden variables at one level can control the interactions between variables at the level below. These models have not yet been applied to speech but they work well on other data such as broadcast video and sequences of joint-angles derived from motion capture markers.\u00a0 (Joint work with Roland Memisevic, Graham Taylor and Ilya Sutskever).\r\n[\/panel]\r\n[panel header=\"Log-linear Approach to Discriminative Training\"]\r\nRalf Schl\u00fcter, RWTH Aachen University\u00a0 |\u00a0 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Schlueter-NIPS2008-DiscriminativeTraining.pdf\">slides<\/a>\r\n\r\nAbstract: The objective of this talk is to establish a log-linear modeling framework in the context of discriminative training criteria, with examples from automatic speech recognition and concept tagging. The talk covers three major aspects. First, the acoustic models of conventional state-of-the-art speech recognition systems conventionally use generative Gaussian HMMs. In the past few years, discriminative models like for example Conditional Random Fields (CRFs) have been proposed to refine acoustic models. This talk addresses to what extent such less restricted models add flexibility to the model compared with the generative counterpart. Certain equivalence relations between Gaussian and log-linear HMMs are established, including context conditional models. Second, it will be shown how conventional discriminative training criteria in speech recognition such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. As a result, large-margin training in speech recognition can be performed using the same efficient algorithms for accumulation and optimization and using the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Third, CRFs are often estimated using an entropy based criterion in combination with Generalized Iterative Scaling (GIS). GIS offers, upon others, the immediate advantages that it is locally convergent, completely parameter free, and guarantees an improvement of the criterion in each step. Here, GIS is extended to allow for training log-linear models with hidden variables and optimization of discriminative training criteria different from Maximum Entropy\/Maximum Mutual Information, including Minimum Phone Error (MPE). Finally, experimental results are provided for different tasks, including the European Parliament Plenary Sessions task as well as Mandarin Broadcasts.\r\n[\/panel]\r\n\r\n[panel header=\"On the Role of Local Learning for Language Modeling\"]\r\nMari Ostendorf, University of Washington\r\n\r\nAbstract: Local learning methods, such as nearest-neighbor and variants, are known to be very powerful for many problems, particularly for problems where good models are not available. They can also be very useful for problems with a high degree of variability over the input space. In language modeling for speech recognition, local learning has not been particularly useful, in part because of the tremendous power of the n-gram when given large amounts of training data, and in part due to the difficulty of defining distance or similarity measures for word sequences.\u00a0 However, language is quite variable, depending on both topic and genre, such that a model trained in one domain may be of little use in another. With the large amount of data available on the web, and the large number of possible topic\/genre combinations, it is of interest to consider local learning for language model adaptation. In this talk, we look at leveraging the similarity function in language model adaptation to benefit from a small neighborhood without losing the power of a large training corpus.\r\n[\/panel]\r\n\r\n[panel header=\"Ensemble Machine Learning Methods for Acoustic Modeling of Speech\"]\r\nYunxin Zhao, University of Missouri\u00a0 |\u00a0 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/Zhao-NIPS08-EnsembleAcousticModeling.pdf\">slides<\/a>\r\n\r\nAbstract: Improving recognition accuracy of human speech by computers has been a long standing challenge. Over the past few decades, tremendous research efforts have been made on the optimization of acoustic models. On the other hand, ensemble classifier design is becoming an important direction in machine learning. Different from the commonly adopted approach of optimizing a single classifier, ensemble methods achieve pattern discrimination through synergically combining many classifiers that are complementary in nature. Ensemble methods have shown advantages in classification accuracy and robustness in a variety of application contexts. Aligned with this direction, combining output word hypotheses from multiple speech recognition systems is being increasingly used in ASR for boosting the accuracy performance.\u00a0 Nonetheless, the complexity of speech sound distributions warrants the exploration of using ensemble methods to build robust and accurate acoustic models, where the component models of an ensemble can be combined in computing the acoustic scores during decoding search, for example, at the speech frame level, and thereby a single recognition system would suffice.\u00a0 Recently, some innovative progresses have been made in this direction, producing promising results and revealing attractive properties of ensemble acoustic models.\u00a0 This talk will address several basic issues in ensemble acoustic modeling, including constructing acoustic model ensembles, combining acoustic models in an ensemble, measuring the ensemble quality, etc. Experimental findings will be provided for a conversational speech recognition task, and a discussion will be made regarding research opportunities along this path.\r\n[\/panel]\r\n\r\n[panel header=\"Relations Between Graph Triangulation, Stack Decoding, and Synchronous Decoding\"]\r\nJeff Bilmes, University of Washington\r\n\r\nAbstract: Speech recognition systems have historically utilized essentially one of two decoding strategies. Stack decoding (also called asynchronous decoding) allows internal decoding hypotheses to exist that have an end-time that spans over a potentially wide range of time frames. Such strategies are amenable to techniques such as A*-search assuming one has available a reasonable continuation heuristic. An alternate decoding strategy is the time-synchronous approach, whereby every active hypothesis has a similar or identical ending time. In this talk, we relate these two decoding strategies to inference procedures in dynamic graphical models (which includes Dynamic Bayesian networks and hidden conditional random fields). In particular, we see that under a hybrid search\/belief-propagation inference scheme, the underlying triangulation of the graph determines which of the above two decoding strategies are active. The triangulation, moreover, also suggests decoding strategies that lie somewhere between strictly synchronous and asynchronous approaches.\r\n[\/panel]\r\n\r\n[panel header=\"Markov Logic Networks: A Unified Approach to Language Processing\"]\r\nPedro Domingos, University of Washington\u00a0 |\u00a0 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nwsl.pdf\">slides<\/a>\r\n\r\nAbstract: Language processing systems typically have a pipeline architecture, where errors accumulate as information progresses through the pipeline. The ideal solution is to perform fully joint learning and inference across all stages of the pipeline (part-of-speech tagging, parsing, coreference resolution, semantic role labeling, etc.) To make this possible without collapsing under the weight of complexity, we need a modeling language that provides a common representation for all the stages and makes it easy to combine them. Markov logic networks accomplish this by attaching weights to formulas in first-order logic and viewing them as templates for features of Markov random fields. In this talk, I will describe some of the main inference and learning algorithms for Markov logic, show how Markov logic can be used to implement an end-to-end NLP system, and present the state-of-the-art results we have obtained with the components we have implemented so far.\r\n[\/panel]\r\n\r\n[panel header=\"Some Machine Learning Issues in Discriminative Bilingual Word Alignment\"]\r\nRobert Moore, Microsoft Research\r\n\r\nAbstract: Bilingual word alignment is the task of identifying the word tokens that are translations of each other in a corpus of sentence pairs that are translations of each other.\u00a0 After being dominated by generative models since the early 1990s, beginning in 2005 this task has been addressed by a number of discriminative approaches, resulting in substantially reduced alignment error rates.\u00a0 In most cases, these discriminative approaches have used a few hundred parallel sentence pairs with word alignments annotated, plus hundreds of thousands of parallel sentence pairs with no word-level annotation, making this task a prime example of semi-supervised learning.\u00a0 In this talk, we will look in detail at some of the machine learning issues in one of the most successful efforts at discriminative word alignment, including benefits of stacking of learners and refinements of the averaged perceptron approach to learning classifiers with structured outputs.\r\n[\/panel]\r\n\r\n[panel header=\"Machine Learning for Speaker Recognition\"]\r\nAndreas Stolcke, SRI International\u00a0 |\u00a0 <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/NIPS08-workshop-stolcke-revised.pdf\">slides<\/a>\r\n\r\nAbstract: This talk will review some of the main ML techniques employed in state-of-the-art speaker recognition systems, in terms of both modeling and feature design.\u00a0 For modeling, the two main paradigms currently in use are Gaussian mixture models with joint factor analysis, and support vector machines.\u00a0 The latter in particular have enabled a wealth of approaches that model speakers via high dimensional feature vectors drawn from a wide range of observation spaces, including cepstral, phonetic, prosodic, and lexical features. A pervasive problem in feature design is how to collapse a variable-length stream of observations into a fixed-length feature vector. SVM kernels designed for this situation are based on features generated by polynomial expansion, N-gram frequencies, and GMM mixture weights. Miscellaneous other issues include parameter smoothing (prior modeling) and model combination. It is hoped that the talk will give a glimpse into a fascinating application domain for machine learning methods, and instigate ML researchers to contribute to advances in speaker recognition.\r\n[\/panel]\r\n\r\n[\/accordion]"},{"id":2,"name":"Accepted Papers","content":"[accordion]\r\n[panel header=\"The Maximum Entropy Model with Continuous Features\"]\r\nAuthors:\u00a0 Dong Yu, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, and Alex Acero\r\n\r\nAbstract: We present the maximum entropy (MaxEnt) model with continuous features. We show that for the continuous features the weights should be continuous functions instead of single values. We propose a spline interpolation based solution to the optimization problem that contains continuous weights and illustrate that the optimization problem can be converted into a standard log-linear one without continuous weights at a higher-dimensional space.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_01.pdf\">[PDF]<\/a>\r\n[\/panel]\r\n\r\n[panel header=\"A Comparison of Discriminative EM-Based Semi-Supervised Learning Algorithms on Agreement\/Disagreement Classification\"]\r\nAuthors:\u00a0 Sangyun Hahn and Mari Ostendorf\r\n\r\nAbstract: Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement\/disagreement classification of multi-party conversational speech, using discriminative models such as support vector machines and multi-layer perceptrons. We experimentally compare and discuss their advantages and weaknesses when used with different amounts of unlabeled data.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_02.pdf\">[PDF]<\/a>\r\n[\/panel]\r\n\r\n[panel header=\"Learning Methods in Multilingual Speech Recognition\"]\r\nAuthors:\u00a0 Hui Lin, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/deng\/\">Li Deng<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jdroppo\/\">Jasha Droppo<\/a>, Dong Yu, and Alex Acero\r\n\r\nAbstract: One key issue in developing learning methods for multilingual acoustic modeling in large vocabulary automatic speech recognition (ASR) applications is to maximize the benefit of boosting the acoustic training data from multiple source languages while minimizing the negative effects of data impurity arising from language \u201cmismatch\u201d. In this paper, we introduce two learning methods, semiautomatic unit selection and global phonetic decision tree, to address this issue via effective utilization of acoustic data from multiple languages. The semi-automatic unit selection is aimed to combine the merits of both data-driven and knowledgedriven approaches to identifying the basic units in multilingual acoustic modeling. The global decision-tree method allows clustering of cross-center phones and cross-center states in the HMMs, offering the potential to discover a better sharing structure beneath the mixed acoustic dynamics and context mismatch caused by the use of multiple languages\u2019 acoustic data. Our preliminary experiment results show that both of these learning methods improve the performance of multilingual speech recognition.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_03.pdf\">[PDF]<\/a>\r\n[\/panel]\r\n\r\n[panel header=\"Unsupervised Audio Speech Segmentation Using the Voting Experts Algorithm\"]\r\nAuthors:\u00a0 Matthew Miller and Alexander Stoytchev\r\n\r\nAbstract: Human beings have an apparently innate ability to segment continuous audio speech into words, and that ability is present in infants as young as 8 months old. This propensity towards audio segmentation seems to lay the groundwork for language learning in human beings. To artificially reproduce this ability would be both practically useful and theoretically enlightening. In this paper we propose an algorithm for the unsupervised segmentation of audio speech, based on the Voting Experts (VE) algorithm, which was originally designed to segment sequences of discrete tokens into categorical episodes. We demonstrate that our procedure is capable of inducing breaks with an accuracy substantially greater than chance, and suggest possible avenues of exploration to further increase the segmentation quality. We also show that this algorithm can reproduce results obtained from segmentation experiments performed with 8-month-old infants.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_04.pdf\">[PDF]<\/a>\r\n[\/panel]\r\n\r\n[panel header=\"System Combination for Machine Translation Using N-Gram Posterior Probabilities\"]\r\nAuthors:\u00a0 Yong Zhao and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/xiaohe\/\">Xiaodong He<\/a>\r\n\r\nAbstract: This paper proposes using n-gram posterior probabilities, which are estimated over translation hypotheses from multiple machine translation (MT) systems, to improve the performance of the system combination. Two ways using n-gram posteriors in confusion network decoding are presented. The first way is based on n-gram posterior language model per source sentence, and the second, called n-gram segment voting, is to boost word posterior probabilities with n-gram occurrence frequencies. The two n-gram posterior methods are incorporated in the confusion network as individual features of a log-linear combination model. Experiments on the Chinese-to-English MT task show that both methods yield significant improvements on the translation performance, and an combination of these two features produces the best translation performance.\r\n\r\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/03\/nips2008WSL1_05.pdf\">[PDF]<\/a>\r\n[\/panel]\r\n[\/accordion]"}],"msr_startdate":"2008-12-12","msr_enddate":"2008-12-12","msr_event_time":"","msr_location":"Whistler, British Columbia, Canada","msr_event_link":"https:\/\/nips.cc\/","msr_event_recording_link":"","msr_startdate_formatted":"December 12, 2008","msr_register_text":"Watch now","msr_cta_link":"https:\/\/nips.cc\/","msr_cta_text":"Watch now","msr_cta_bi_name":"Event Register","featured_image_thumbnail":null,"event_excerpt":"This workshop is intended for researchers interested in machine learning methods for speech and language processing and in unifying approaches to several outstanding speech and language processing issues. In the last few years, significant progress has been made in both research and commercial applications of speech and language processing. Despite the superior empirical results, however, there remain important theoretical issues to be addressed. Theoretical advancement is expected to drive greater system performance improvement, which in&hellip;","msr_research_lab":[],"related-researchers":[],"msr_impact_theme":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-opportunities":[],"related-publications":[],"related-videos":[],"related-posts":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/372044","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-event"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/372044\/revisions"}],"predecessor-version":[{"id":867924,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event\/372044\/revisions\/867924"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=372044"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=372044"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=372044"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=372044"},{"taxonomy":"msr-video-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video-type?post=372044"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=372044"},{"taxonomy":"msr-program-audience","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-program-audience?post=372044"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=372044"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=372044"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}