Cyril Goutte

Semi-supervised Document Classification with a Mislabeling Error Model (2009)

Anastasia Krithara, Massih R. Amini, Jean-michel Renders, Cyril Goutte

Abstract. This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed...

Learning Machine Translation (2009)

Goutte, Cyril, Cancedda, Nicola, Dymetman, Marc, Foster, George

The Internet gives us access to a wealth of information in languages we don't understand, and instant online translation leaves much to be desired. The investigation of automated or semi-automated...

A Statistical Machine Translation Primer (2009)

Cancedda, Nicola, Dymetman, Marc, Foster, George, Goutte, Cyril

This first chapter is a short introduction to the main aspects of statistical machine translation (SMT). In particular, we cover the issues of automatic evaluation of machine translation output,...

Confidence Estimation for Machine Translation (2008)

Cyril Goutte, Erin Fitzgerald, Johns Hopkins, Alex Kulesza

We present a detailed study of confidence estimation for machine translation. Various methods for determining whether MT output is correct are investigated, for both whole sentences and words. Since...

Data cube approximation and mining using probabilistic modelling (2008)

Cyril Goutte, Rokia Missaoui, Ameur Boujenoui

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...

A Probabilistic Model for Data Cube Compression and Query Approximation ABSTRACT (2008)

Rokia Missaoui, Anicet Kouomou Choupo, Cyril Goutte, Ameur Boujenoui

Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....

Data cube approximation and mining using probabilistic modelling (2008)

Cyril Goutte, Rokia Missaoui, Ameur Boujenoui

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and different abstraction levels...

Active, Semi-Supervised Learning for Textual Information Access (2008)

Anastasia Krithara, Cyril Goutte, Massih-reza Amini, Jean-michel Renders

MACHINE learning techniques have been used for various tasks of document management and textual information access, such as categorisation, information extraction, or automatic organization of large...

Learning from partially labelled data—with confidence (2008)

Eric Gaussier, Cyril Goutte

In this paper, we propose a unifying treatment of several strategies for training mixture models from label-deficient data. After a review of different approaches to estimating classification models...

Semi-supervised Document Classification with a Mislabeling Error Model (2008)

Krithara, Anastasia, Amini, Massih, Renders, Jean-Michel, Goutte, Cyril

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach...

A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data (2008)

Amini, Massih, Truong, Vinh, Goutte, Cyril

This paper presents a boosting based algorithm for learning a bipartite ranking function (BRF) with partially labeled data. Until now different attempts had been made to build a BRF in a...

Combining labelled and unlabelled data: a case study on Fisher kernels and transductive inference for biological entity recognition (2007)

Cyril Goutte, Herve Dejean, Eric Gaussier, Nicola Cancedda, Jean-michel Renders

We address the problem of using partially labelled data, eg large collections were only little data is annotated, for extracting biological entities. Our approach relies on a combination of...

Running title: Feature-space clustering Address for correspondence: (2007)

Lars Kai Hansen, Matthew G. Liptrot, Egill Rostrup, Cyril Goutte, Cyril Goutte

Clustering fMRI time series has emerged in recent years as a possible alternative to parametric modelling approaches. Most of the work has been so far concerned with clustering raw time series. In...

Extraction of the Relevant Delays for Temporal Modelling (2007)

Cyril Goutte

When modelling temporal processes just like in pattern recognition, selecting the optimal number of inputs is a central concern. In this contribution, we take advantage of specific features of...

OVERVIEW (2007)

Jan Larsen, Cyril Goutte

Modeling with flexible models, such as neural networks, requires careful control of the model complexity and generalization ability of the resulting model. Whereas general asymptotic estimators of...

y (2007)

Cyril Goutte, Lars Kai Hansen

Both authors wish to acknowledge stimulating discussions with Jan Larsen. Regularization with a pruning prior We investigate the use of a regularization prior and its pruning properties. We...

BIOINFORMATICS Combining NLP and Probabilistic Categorisation for Document and Term Selection for Swiss-Prot Medical Annotation (2007)

Pavel B. Dobrokhotov, Cyril Goutte, Anne-lise Veuthey

Received line Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and...

Combining labelled and unlabelled data: a case study on Fisher kernels and transductive inference for biological entity recognition (2007)

Cyril Goutte, Eric Gaussier, Nicola Cancedda, Jean-michel Renders

We address the problem of using partially labelled data, eg large collections were only little data is annotated, for extracting biological entities. Our approach relies on a combination of...

Semi-Supervised Document Classification with a Mislabeling Error Model (2007)

Krithara, Anastasia, Amini, Massih, Renders, Jean-Michel, Goutte, Cyril

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach...

Data Cube Approximation and Mining using Probabilistic Modeling (2007)

Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...

Fast & Confident Probabilistic Categorization (2007)

Goutte, Cyril

We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...

Statistical Phrase-based Post-editing (2007)

Simard, Michel, Goutte, Cyril, Isabelle, Pierre

We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...

Data Cube Approximation and Mining using Probabilistic Modeling (2007)

Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...

Fast & Confident Probabilistic Categorization (2007)

Goutte, Cyril

We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...

Statistical Phrase-based Post-editing (2007)

Simard, Michel, Goutte, Cyril, Isabelle, Pierre

We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...

A Probabilistic Model for Data Cube Compression and Query Approximation (2007)

Missaoui, Rokia, Goutte, Cyril, Kouomou Choupo, Anicet, Boujenoui, Ameur

Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....

A Probabilistic Model for Data Cube Compression and Query Approximation (2007)

Missaoui, Rokia, Goutte, Cyril, Kouomou Choupo, Anicet, Boujenoui, Ameur

Databases and data warehouses contain an overwhelming volume of information that users must wade through in order to extract valuable and actionable knowledge to support the decision-making process....

Data Cube Approximation and Mining using Probabilistic Modeling (2007)

Goutte, Cyril, Missaoui, Rokia, Boujenoui, Ameur

On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction...

Fast & Confident Probabilistic Categorization (2007)

Goutte, Cyril

We describe NRC's submission to the Anomaly Detection/Text Mining competition organised at the Text Mining Workshop 2007. This submission relies on a straightforward implementation of the...

Statistical Phrase-based Post-editing (2007)

Simard, Michel, Goutte, Cyril, Isabelle, Pierre

We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and...

A Probabilistic Model for Fast and Confident Categorisation of Textual Documents (2007)

Goutte C, Cyril Goutte

Detection/Text Mining competition organised at the Text Mining Workshop 2007. This entry relies on a straightforward implementation of a probabilistic categoriser described earlier [4]. This...

Reducing the Annotation Burden in Text Classification (2006)

Krithara, Anastasia, Goutte, Cyril, Amini, Massih, Renders, Jean-Michel

In this paper we describe a method which combines semi-supervised and active learning for the classification task. In particular, we propose a semi-supervised PLSA (Probabilistic Latent Semantic...

Active, Semi-Supervised Learning for Textual Information Access (2006)

Krithara, Anastasia, Goutte, Cyril, Amini, Massih, Renders, Jean-Michel

Machine learning techniques have been used for various tasks of document management and textual information access, such as categorisation, information extraction, or automatic organization of large...

A Resource-Light Approach to Cross-Language Information Retrieval (2006)

Renders, Jean-Michel, Gaussier, Eric, Goutte, Cyril

This paper aims at describing how the combination of light resources such as standard bilingual dictionaries and multilingual corpora can be formalized and exploited in the general and efficient...

Lexical Entailment for Information Retrieval (2006)

Clinchant, Stephane, Goutte, Cyril, Gaussier, Eric

Textual Entailment has recently been proposed as an application independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task in many...

Categorization in multiple category systems (2006)

Renders, Jean-Michel, Gaussier, Eric, Goutte, Cyril, Csurka, Gabriela, Pacull, Francois

We explore the situation in which documents have to be categorized into more than one category system, a situation we refer to as multiple-view categorization. More particularly, we address the case...

Lexical entailment for information retrieval (2006)

Stéphane Clinchant, Cyril Goutte, Eric Gaussier

Abstract. Textual Entailment has recently been proposed as an application independent task of recognising whether the meaning of one text may be inferred from another. This is potentially a key task...

Automatic Evaluation of Machine Translation Quality (2006)

Cyril Goutte

Any scientific endeavour must be evaluated in order to assess its correctness. In many applied sciences it is necessary to check that the theory adequately matches actual observations. In Machine...

Translating with non-contiguous phrases (2005)

Simard, Michel, Cancedda, Nicola, Cavestro, Bruno, Dymetman, Marc, Gaussier, Eric, Goutte, Cyril, ...

This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...

Co-occurrence Models in Music Genre Classification (2005)

Ahrendt, Peter, Goutte, Cyril, Larsen, Jan

Music genre classification has been investigated using many different methods, but most of them build on probabilistic models of feature vectors x_r which only represent the short time segment with...

Relation between PLSA and NMF and Implications (2005)

Goutte, Cyril, Gaussier, Eric

The techniques of Non-negative Matrix Factorisation (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been succesfully applied to a number of text analysis tasks such as document...

Learning from partially labelled data -- with confidence (2005)

Gaussier, Eric, Goutte, Cyril

In this paper, we propose a unifying treatment of several strategies for training mixture models from label-deficient data. After a review of different approaches to estimating classification models...

A Probabilistic Interpretation of Precision, Recall and F-score, with Implication for Evaluation (2005)

Goutte, Cyril, Gaussier, Eric

We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score,...

Translating with non-contiguous phrases (2005)

Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, ...

This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...

A probabilistic interpretation of precision, recall and F-score, with implication for evaluation (2005)

Cyril Goutte, Eric Gaussier

Abstract. We address the problems of 1 / assessing the confidence of the standard point estimates, precision, recall and F-score, and 2 / comparing the results, in terms of precision, recall and...

Translating with non-contiguous phrases (2005)

Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, ...

This paper presents a phrase-based statistical machine translation method, based on non-contiguous phrases, i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is...

Aligning words using matrix factorisation (2004)

Goutte, Cyril, Yamada, Kenji, Gaussier, Eric

Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...

A Geometric view on bilingual lexicon extraction from comparable corpora (2004)

Gaussier, Eric, Renders, Jean-Michel, Matveeva, Irina, Goutte, Cyril, Dejean, Herve

We adopt in this study a geometric view on bilingual lexicon extraction from comparable corpora. This view makes it possible to re-interpret the methods proposed so far and identify unresolved...

Corpus-Based vs. Model-Based Selection of Relevant Features (2004)

Goutte, Cyril, Dobrokhotov, Pavel, Gaussier, Eric, Veuthey, Anne-Lise

In this contribution, we review a number of approaches to feature selection, divided in two broad classes. Some are corpus-based, ie they use only the data to assess the relevance of each feature,...

Generative vs Discriminative approaches to entity Recognition from label deficient data (2004)

Goutte, Cyril, Gaussier, Eric, Cancedda, Nicola, Dejean, Herve

Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...

Generative vs Discriminative Approaches to Entity Extraction from Label Deficient Data.” JADT 2004, 7es Journ‘ees internationales d’Analyse statistique des Donn‘ees Textuelles, Louvain-la-Neuve (2004)

Cyril Goutte, Éric Gaussier, Nicola Cancedda, Hervé Déjean

Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...

Generative vs Discriminative Approaches to Entity Extraction from Label Deficient Data.” JADT 2004, 7es Journ‘ees internationales d’Analyse statistique des Donn‘ees Textuelles, Louvain-la-Neuve (2004)

Cyril Goutte, Eric Gaussier, Nicola Cancedda, Hervé Déjean

Annotating biomedical text for Named Entity Recognition (NER) is usually a tedious and expensive process, while unannotated data is freely available in large quantities. It therefore seems relevant...

Aligning words using matrix factorisation (2004)

Cyril Goutte, Kenji Yamada, Eric Gaussier

Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...

Aligning words using matrix factorisation (2004)

Cyril Goutte, Kenji Yamada, Eric Gaussier

Aligning words from sentences which are mutual translations is an important problem in different settings, such as bilingual terminology extraction, Machine Translation, or projection of linguistic...

Reducing parameter space for word alignment (2003)

Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada

hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....

Word-sequence kernels (2003)

Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jaz K, Thomas Hofmann, Tomaso Poggio, ...

We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the...

Reducing parameter space for word alignment (2003)

Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada

hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....

A Probabilistic Information Retrieval Approach to Medical Annotation in Swiss-Prot (2003)

Pavel Dobrokhotov, Cyril Goutte, Eric Gaussier B

Both authors contributed equally to this work The goal of medical annotation of human proteins in SWISS-PROT is to add features specifically intended for researchers working on genetic diseases and...

Word-sequence kernels (2003)

Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jaz K, Thomas Hofmann, Tomaso Poggio, ...

We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the...

Kernel Methods for Document Filtering (2003)

Nicola Cancedda, Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile, Cyril Goutte, Thore Graepel, ...

This paper describes the algorithms implemented by the KerMIT consortium for its participation in the TREC 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...

Reducing parameter space for word alignment (2003)

Herve Dejean, Eric Gaussier, Cyril Goutte, Kenji Yamada

hdejean,gaussier,cgoutte,kyamada¡ This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline....

Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation (2003)

Dobrokhotov, Pavel B., Goutte, Cyril, Veuthey, Anne-Lise, Gaussier, Eric

Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic...

Kernel Methods for Document Filtering (2002)

Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...

This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...

Kernel Methods for Document Filtering (2002)

Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...

This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...

Kernel Methods for Document Filtering (2002)

Shawe-Taylor, John, Cancedda, Nicola, Cesa-Bianchi, Nicolo, Conconi, Alex, Gentile, Claudio, Goutte, Cyril, ...

This paper describes the algorithms inplemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM,...

Probabilistic models for hierarchical clustering and categorisation : Applications in the information society (2002)

Eric Gaussier, Cyril Goutte

Abstract---We propose a new hierarchical generative model for textual data, where words may be generated by topic specific distributions at any level in the hierarchy. This model is naturally...

Modelling the fMRI response using smooth FIR filters (2000)

Finn Årup Nielsen, A. Nielsen, Lars Kai Hansen, Cyril Goutte, L. K. Hansen

Introduction We describe a flexible semi-parametric linear time-invariant convolution model for the fMRI response. It extends the FIR model [1] with a Gaussian process prior over the parameters (the...

On clustering fMRI time series (1999)

Peter Toft, Egill Rostrup, Finn A. Nielsen, Lars Kai Hansen, Cyril Goutte, Cyril Goutte

Running title: Clustering fMRI time series. Address for correspondence:

On Optimal Data Split For Generalization Estimation And Model Selection (1999)

Jan Larsen, Cyril Goutte

this paper, we address a crucial problem of cross-validation estimators: how to split the data into various sets. The set D of all available data is usually split into two parts: the design set, E ,...

Optimal cross-validation split ratio: Experimental investigation (1998)

Cyril Goutte, Jan Larsen

Cross-validation is a widespread method for assessing the generalisation ability of a model in order to tune a regularisation parameter or other hyper-parameters of a learning process. The use of...

Adaptive Regularization of Neural Networks Using Conjugate Gradient (1998)

Cyril Goutte, Jan Larsen

Recently we suggested a regularization scheme which iteratively adapts regularization parameters by minimizing validation error using simple gradient descent. In this contribution we present an...

Adaptive metric kernel regression (1998)

Cyril Goutte, Jan Larsen

Abstract. Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input...

Adaptive metric kernel regression (1998)

Cyril Goutte, Jan Larsen

Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In...

Note on free lunches and cross-validation (1997)

Cyril Goutte

Abstract. The "no free lunch " theorems (Wolpert and Macready, 1995) have sparked heated debate in the computational learning community. A recent communication, (Zhu and Rohwer,...

Extracting the relevant delays in time series modelling (1997)

Cyril Goutte

Abstract. In this contribution, we suggest a convenient way to use generalisation error to extract the relevant delays from a timevarying process, i.e. the delays that lead to the best prediction...

Lag space estimation in time series modelling (1997)

Cyril Goutte

The purpose of this contribution is to investigate some techniques for finding the relevant lag-space, i.e. input information, for time series modelling. This is an important aspect of time series...

Regularization with a Pruning Prior (1997)

Cyril Goutte, Lars Kai Hansen

We investigate the use of a regularization prior that we show has pruning properties. Analyses are conducted both using a Bayesian framework and with the generalization method, on a simple toy...

On The Use Of A Pruning Prior For Neural Networks (1996)

Cyril Goutte

. We adress the problem of using a regularization prior that prunes unnecessary weights in a neural network architecture. This prior provides a convenient alternative to traditional weight-decay. Two...

Some Computational Complexity Aspects of Neural Network Training (1996)

Cyril Goutte

. We adress the problem of obtaining an estimate of the computational complexity of a training algorithm, as a function of the number of patterns and parameters. On these grounds, we compare two...

Overview of Connectionist Control Using MLP (1996)

Cyril Goutte, Corinne Ledoux, Inrets Maia

Keywords : Control of dynamic processes, Identification, Connectionist modelling, MultiLayer Perceptrons In this report, we investigate the application of connectionist techniques to identification...

Process and content (1989)

John Blatz, Erin Fitzgerald, George Foster, Simona G, Cyril Goutte, Alex Kulesza, ...

1.1 Confidence Estimation...................... 9