Semi-supervised Document Classification with a Mislabeling Error Model (2009)
Anastasia Krithara, Massih R. Amini, Jean-michel Renders, Cyril Goutte
Abstract. This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed...
Learning Machine Translation (2009)
Goutte, Cyril, Cancedda, Nicola, Dymetman, Marc, Foster, George
The Internet gives us access to a wealth of information in languages we don't understand, and instant online translation leaves much to be desired. The investigation of automated or semi-automated...
A Statistical Machine Translation Primer (2009)
Cancedda, Nicola, Dymetman, Marc, Foster, George, Goutte, Cyril
This first chapter is a short introduction to the main aspects of statistical machine translation (SMT). In particular, we cover the issues of automatic evaluation of machine translation output,...
Confidence Estimation for Machine Translation (2008)
Cyril Goutte, Erin Fitzgerald, Johns Hopkins, Alex Kulesza
We present a detailed study of confidence estimation for machine translation. Various methods for determining whether MT output is correct are investigated, for both whole sentences and words. Since...
Data cube approximation and mining using probabilistic modelling (2008)
Cyril Goutte, Rokia Missaoui, Ameur Boujenoui
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...
A Probabilistic Model for Data Cube Compression and Query Approximation ABSTRACT (2008)
Rokia Missaoui, Anicet Kouomou Choupo, Cyril Goutte, Ameur Boujenoui
Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....
Data cube approximation and mining using probabilistic modelling (2008)
Cyril Goutte, Rokia Missaoui, Ameur Boujenoui
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and different abstraction levels...
Active, Semi-Supervised Learning for Textual Information Access (2008)
Anastasia Krithara, Cyril Goutte, Massih-reza Amini, Jean-michel Renders
MACHINE learning techniques have been used for various tasks of document management and textual information access, such as categorisation, information extraction, or automatic organization of large...
Learning from partially labelled data—with confidence (2008)
In this paper, we propose a unifying treatment of several strategies for training mixture models from label-deficient data. After a review of different approaches to estimating classification models...
Semi-supervised Document Classification with a Mislabeling Error Model (2008)
Krithara, Anastasia, Amini, Massih, Renders, Jean-Michel, Goutte, Cyril
This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach...
A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data (2008)
Amini, Massih, Truong, Vinh, Goutte, Cyril
This paper presents a boosting based algorithm for learning a bipartite ranking function (BRF) with partially labeled data. Until now different attempts had been made to build a BRF in a...
Nicola Cancedda, Nicolo Cesa-bianchi, Alex Conconi, Cyril Goutte, Yaoyong Li, John Shawe-taylor
, Alexei Vinokourov y x CRII
Cyril Goutte, Herve Dejean, Eric Gaussier, Nicola Cancedda, Jean-michel Renders
We address the problem of using partially labelled data, eg large collections were only little data is annotated, for extracting biological entities. Our approach relies on a combination of...
Running title: Feature-space clustering Address for correspondence: (2007)
Lars Kai Hansen, Matthew G. Liptrot, Egill Rostrup, Cyril Goutte, Cyril Goutte
Clustering fMRI time series has emerged in recent years as a possible alternative to parametric modelling approaches. Most of the work has been so far concerned with clustering raw time series. In...
Extraction of the Relevant Delays for Temporal Modelling (2007)
When modelling temporal processes just like in pattern recognition, selecting the optimal number of inputs is a central concern. In this contribution, we take advantage of specific features of...
Modeling with flexible models, such as neural networks, requires careful control of the model complexity and generalization ability of the resulting model. Whereas general asymptotic estimators of...
Both authors wish to acknowledge stimulating discussions with Jan Larsen. Regularization with a pruning prior We investigate the use of a regularization prior and its pruning properties. We...
Pavel B. Dobrokhotov, Cyril Goutte, Anne-lise Veuthey
Received line Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and...
Cyril Goutte, Eric Gaussier, Nicola Cancedda, Jean-michel Renders
We address the problem of using partially labelled data, eg large collections were only little data is annotated, for extracting biological entities. Our approach relies on a combination of...
Semi-Supervised Document Classification with a Mislabeling Error Model (2007)
Krithara, Anastasia, Amini, Massih, Renders, Jean-Michel, Goutte, Cyril
This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach...
Data Cube Approximation and Mining using Probabilistic Modeling (2007)
Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...
Fast & Confident Probabilistic Categorization (2007)
We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...
Statistical Phrase-based Post-editing (2007)
Simard, Michel, Goutte, Cyril, Isabelle, Pierre
We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...
Data Cube Approximation and Mining using Probabilistic Modeling (2007)
Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...
Fast & Confident Probabilistic Categorization (2007)
We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...
Statistical Phrase-based Post-editing (2007)
Simard, Michel, Goutte, Cyril, Isabelle, Pierre
We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...
A Probabilistic Model for Data Cube Compression and Query Approximation (2007)
Missaoui, Rokia, Goutte, Cyril, Kouomou Choupo, Anicet, Boujenoui, Ameur
Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....
A Probabilistic Model for Data Cube Compression and Query Approximation (2007)
Missaoui, Rokia, Goutte, Cyril, Kouomou Choupo, Anicet, Boujenoui, Ameur
Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....
Data Cube Approximation and Mining using Probabilistic Modeling (2007)
Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...
Fast & Confident Probabilistic Categorization (2007)
We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...
Statistical Phrase-based Post-editing (2007)
Simard, Michel, Goutte, Cyril, Isabelle, Pierre
We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...
A Probabilistic Model for Fast and Confident Categorisation of Textual Documents (2007)
Detection/Text Mining competition organised at the Text Mining Workshop 2007. This entry relies on a straightforward implementation of a probabilistic categoriser described earlier [4]. This...
Reducing the Annotation Burden in Text Classification (2006)
Krithara, Anastasia, Goutte, Cyril, Amini, Massih, Renders, Jean-Michel
In this paper we describe a method which combines semi-supervised and active learning for the classification task. In particular, we propose a semi-supervised PLSA (Probabilistic Latent Semantic...
Active, Semi-Supervised Learning for Textual Information Access (2006)
Krithara, Anastasia, Goutte, Cyril, Amini, Massih, Renders, Jean-Michel
Machine learning techniques have been used for various tasks of document management and textual information access, such as categorisation, information extraction, or automatic organization of large...
A Resource-Light Approach to Cross-Language Information Retrieval (2006)
Renders, Jean-Michel, Gaussier, Eric, Goutte, Cyril
This paper aims at describing how the combination of light resources such as standard bilingual dictionaries and multilingual corpora can be formalized and exploited in the general and efficient...
Lexical Entailment for Information Retrieval (2006)
Clinchant, Stephane, Goutte, Cyril, Gaussier, Eric
Textual Entailment has recently been proposed as an application independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task in many...
Categorization in multiple category systems (2006)
Renders, Jean-Michel, Gaussier, Eric, Goutte, Cyril, Csurka, Gabriela, Pacull, Francois
We explore the situation in which documents have to be categorized into more than one category system, a situation we refer to as multiple-view categorization. More particularly, we address the case...
Lexical entailment for information retrieval (2006)
Stéphane Clinchant, Cyril Goutte, Eric Gaussier
Abstract. Textual Entailment has recently been proposed as an application independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task...
Automatic Evaluation of Machine Translation Quality (2006)
Any scientific endeavour must be evaluated in order to assess its correctness. In many applied sciences it is necessary to check that the theory adequately matches actual observations. In Machine...
Translating with non-contiguous phrases (2005)
Simard, Michel, Cancedda, Nicola, Cavestro, Bruno, Dymetman, Marc, Gaussier, Eric, Goutte, Cyril, ...
This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...
Co-occurrence Models in Music Genre Classification (2005)
Ahrendt, Peter, Goutte, Cyril, Larsen, Jan
Music genre classification has been investigated using many different methods, but most of them build on probabilistic models of feature vectors x_r which only represent the short time segment with...
Relation between PLSA and NMF and Implications (2005)
The techniques of Non-negative Matrix Factorisation (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been succesfully applied to a number of text analysis tasks such as document...
Learning from partially labelled data -- with confidence (2005)
In this paper, we propose a unifying treatment of several strategies for training mixture models from label-deficient data. After a review of different approaches to estimating classification models...
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score,...
Translating with non-contiguous phrases (2005)
Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, ...
This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...
Abstract. We address the problems of 1 / assessing the confidence of the standard point estimates, precision, recall and F-score, and 2 / comparing the results, in terms of precision, recall and...
Translating with non-contiguous phrases (2005)
Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, ...
This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...
Aligning words using matrix factorisation (2004)
Goutte, Cyril, Yamada, Kenji, Gaussier, Eric
Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...
A Geometric view on bilingual lexicon extraction from comparable corpora (2004)
Gaussier, Eric, Renders, Jean-Michel, Matveeva, Irina, Goutte, Cyril, Dejean, Herve
We adopt in this study a geometric view on bilingual lexicon extraction from comparable corpora. This view makes it possible to re-interpret the methods proposed so far and identify unresolved...
Corpus-Based vs. Model-Based Selection of Relevant Features (2004)
Goutte, Cyril, Dobrokhotov, Pavel, Gaussier, Eric, Veuthey, Anne-Lise
In this contribution, we review a number of approaches to feature selection, divided in two broad classes. Some are corpus-based, ie they use only the data to assess the relevance of each feature,...
Generative vs Discriminative approaches to entity Recognition from label deficient data (2004)
Goutte, Cyril, Gaussier, Eric, Cancedda, Nicola, Dejean, Herve
Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...
Cyril Goutte, Éric Gaussier, Nicola Cancedda, Hervé Déjean
Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...
Cyril Goutte, Eric Gaussier, Nicola Cancedda, Hervé Déjean
Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...
Aligning words using matrix factorisation (2004)
Cyril Goutte, Kenji Yamada, Eric Gaussier
Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...
Aligning words using matrix factorisation (2004)
Cyril Goutte, Kenji Yamada, Eric Gaussier
Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...
Reducing parameter space for word alignment (2003)
Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada
hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....
Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jaz K, Thomas Hofmann, Tomaso Poggio, ...
We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the...
Reducing parameter space for word alignment (2003)
Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada
hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....
A Probabilistic Information Retrieval Approach to Medical Annotation in Swiss-Prot (2003)
Pavel Dobrokhotov, Cyril Goutte, Eric Gaussier B
Both authors contributed equally to this work The goal of medical annotation of human proteins in SWISS-PROT is to add features specifically intended for researchers working on genetic diseases and...
Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jaz K, Thomas Hofmann, Tomaso Poggio, ...
We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the...
Nicola Cancedda Nicola, Eric Gaussier, Cyril Goutte, Jaz K, Thomas Hofmann, Tomaso Poggio, ...
We address the problem of categorising documents using kernel-based methods such as Support Vector Machines.
Kernel Methods for Document Filtering (2003)
Nicola Cancedda, Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile, Cyril Goutte, Thore Graepel, ...
This paper describes the algorithms implemented by the KerMIT consortium for its participation in the TREC 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...
Reducing parameter space for word alignment (2003)
Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada
hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....
Dobrokhotov, Pavel B., Goutte, Cyril, Veuthey, Anne-Lise, Gaussier, Eric
Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic...
Kernel Methods for Document Filtering (2002)
Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...
This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...
Kernel Methods for Document Filtering (2002)
Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...
This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...
Kernel Methods for Document Filtering (2002)
Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...
This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...
Abstract---We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally...
Modelling the fMRI response using smooth FIR filters (2000)
Finn Årup Nielsen, A. Nielsen, Lars Kai Hansen, Cyril Goutte, L. K. Hansen
Introduction We describe a flexible semi-parametric linear time-invariant convolution model for the fMRI response. It extends the FIR model [1] with a Gaussian process prior over the parameters (the...
On clustering fMRI time series (1999)
Peter Toft, Egill Rostrup, Finn A. Nielsen, Lars Kai Hansen, Cyril Goutte, Cyril Goutte
Running title: Clustering fMRI time series. Address for correspondence:
On Optimal Data Split For Generalization Estimation And Model Selection (1999)
this paper, we address a crucial problem of cross-validation estimators: how to split the data into various sets. The set D of all available data is usually split into two parts: the design set, E ,...
Optimal cross-validation split ratio: Experimental investigation (1998)
Cross-validation is a widespread method for assessing the generalisation ability of a model in order to tune a regularisation parameter or other hyper-parameters of a learning process. The use of...
Adaptive Regularization of Neural Networks Using Conjugate Gradient (1998)
Recently we suggested a regularization scheme which iteratively adapts regularization parameters by minimizing validation error using simple gradient descent. In this contribution we present an...
Adaptive metric kernel regression (1998)
Abstract. Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input...
Adaptive metric kernel regression (1998)
Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In...
Note on free lunches and cross-validation (1997)
Abstract. The "no free lunch " theorems (Wolpert and Macready, 1995) have sparked heated debate in the computational learning community. A recent communication, (Zhu and Rohwer,...
Extracting the relevant delays in time series modelling (1997)
Abstract. In this contribution, we suggest a convenient way to use generalisation error to extract the relevant delays from a timevarying process, i.e. the delays that lead to the best prediction...
Lag space estimation in time series modelling (1997)
The purpose of this contribution is to investigate some techniques for finding the relevant lag-space, i.e. input information, for time series modelling. This is an important aspect of time series...
Regularization with a Pruning Prior (1997)
We investigate the use of a regularization prior that we show has pruning properties. Analyses are conducted both using a Bayesian framework and with the generalization method, on a simple toy...
Ese De, Cyril Goutte, E Paris, E Paris, Anne Gu, ...
This document was written using a customised version of the L
On The Use Of A Pruning Prior For Neural Networks (1996)
. We adress the problem of using a regularization prior that prunes unnecessary weights in a neural network architecture. This prior provides a convenient alternative to traditional weight-decay. Two...
Some Computational Complexity Aspects of Neural Network Training (1996)
. We adress the problem of obtaining an estimate of the computational complexity of a training algorithm, as a function of the number of patterns and parameters. On these grounds, we compare two...
Overview of Connectionist Control Using MLP (1996)
Cyril Goutte, Corinne Ledoux, Inrets Maia
Keywords : Control of dynamic processes, Identification, Connectionist modelling, MultiLayer Perceptrons In this report, we investigate the application of connectionist techniques to identification...
John Blatz, Erin Fitzgerald, George Foster, Simona G, Cyril Goutte, Alex Kulesza, ...
1.1 Confidence Estimation...................... 9