Giuseppe Carenini

Professor

Relevant Degree Programs

 

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2019)
Visual text analytics for online conversations (2017)

With the proliferation of Web-based social media, asynchronous conversations have become very common for supporting online communication and collaboration. Yet the increasing volume and complexity of conversational data often make it very difficult to get insights about the discussions. This dissertation posits that by integrating natural language processing and information visualization techniques in a synergistic way, we can better support the user's task of exploring and analyzing conversations. Unlike most previous systems, which do not consider the specific characteristics of online conversations; we applied design study methodologies from the visualization literature to uncover the data and task abstractions that guided the development of a novel set of visual text analytics systems.The first of such systems is ConVis, that supports users in exploring an asynchronous conversation, such as a blog. ConVis offers a visual overview of a conversation by presenting topics, authors, and the thread structure of a conversation, as well as various interaction techniques such as brushing and linked highlighting. Broadening from a single conversation to a collection of conversations, MultiConVis combines a novel hierarchical topic modeling with multi-scale exploration techniques. A series of user studies revealed the significant improvements in user performance and subjective measures when these two systems were compared to traditional blog interfaces.Based on the lessons learned from these studies, this dissertation introduced an interactive topic modeling framework specifically for asynchronous conversations. The resulting systems empower the user in revising the underlying topic models through an intuitive set of interactive features when the current models are noisy and/or insufficient to support their information seeking tasks. Two summative studies suggested that these systems outperformed their counterparts that do not support interactive topic modeling along several subjective and objective measures.Finally, to demonstrate the generality and applicability of our approach, we tailored our previous systems to support information seeking in community question answering forums. The prototype was evaluated through a large-scale Web-based study, which suggests that our approach can be adapted to a specific conversational genre among a diverse range of users.The dissertation concludes with a critical reflection on our approach and considerations for future research.

View record

Discourse analysis of asynchronous conversations (2014)

A well-written text is not merely a sequence of independent and isolated sentences, but instead a sequence of structured and related sentences. It addresses a particular topic, often covering multiple subtopics, and is organized in a coherent way that enables the reader to process the information. Discourse analysis seeks to uncover such underlying structures, which can support many applications including text summarization and information extraction.This thesis focuses on building novel computational models of different discourse analysis tasks in asynchronous conversations; i.e., conversations where participants communicate with each other at different times (e.g., emails, blogs). Effective processing of these conversations can be of great strategic value for both organizations and individuals. We propose novel computational models for topic segmentation and labeling, rhetorical parsing and dialog act recognition in asynchronous conversation. Our approaches rely on two related computational methodologies: graph theory and probabilistic graphical models.The topic segmentation and labeling models find the high-level discourse structure; i.e., the global topical structure of an asynchronous conversation. Our graph-based approach extends state-of-the-art methods by integrating a fine-grained conversational structure with other conversational features. On the other hand, the rhetorical parser captures the coherence structure, a finer discourse structure, by identifying coherence relations between the discourse units within each comment of the conversation. Our parser applies an optimal parsing algorithm to probabilities inferred from a discriminative graphical model which allows us to represent the structure and the label of a discourse tree constituent jointly, and to capture the sequential and hierarchical dependencies between the constituents. Finally, the dialog act model allows us to uncover the underlying dialog structure of the conversation. We present unsupervised probabilistic graphical models that capture the sequential dependencies between the acts, and show how these models can be trained more effectively based on the fine-grained conversational structure. Together, these structures provide a deep understanding of an asynchronous conversation that can be exploited in the above-mentioned applications. For each discourse processing task, we evaluate our approach on different datasets, and show that our models consistently outperform the state-of-the-art by a wide margin. Often our results are highly correlated with human annotations.

View record

Master's Student Supervision (2010 - 2018)
A study of methods for learning phylogenies of cancer cell populations from binary single nucleotide variant profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Detecting dementia from written and spoken language (2018)

This thesis makes three main contributions to existing work on the automatic detection of dementia from language. First we introduce a new set of biologically motivated spatial neglect features, and show their inclusion achieves a new state of the art in classifying Alzheimer's disease (AD) from recordings of patients undergoing the Boston Diagnostic Aphasia Examination. Second we demonstrate how a simple domain adaptation algorithm can be used to leveraging AD data to improve classification of mild cognitive impairment (MCI), a condition characterized by a slight-but-noticeable decline in cognition that does not meet the criteria for dementia, and a condition for which reliable data is scarce. Third, we investigate whether dementia can be detected from written rather than spoken language, and show a range of classifiers achieve a performance far above baseline. Additionally, we create a new corpus of blog posts written by authors with and without dementia and make it publicly available for future researchers.

View record

Summarization of partial email threads : silver standards and bayesian surprise (2018)

We define and motivate the problem of summarizing partial email threads. This problem introduces the challenge of generating reference summaries for these partial threads when extractive human annotation is only available for the threads as a whole, since gold standard annotation intended to summarize a completed email thread may not always be equally applicable to each of its partial threads, particularly when the human-selected sentences are not uniformly distributed within the threads. We propose a framework for generating these reference summaries with arbitrary length in an oracular manner by exploiting existing gold standard summaries for completed email threads. We also propose and evaluate two sentence scoring functions that can be used in this "silver standard" framework, and we are making the resulting datasets publicly available. In addition, we apply a recent unsupervised method based on Bayesian Surprise that incorporates background knowledge to partial thread summarization, extend that method with conversational features, and modify the mechanism by which it handles information redundancy. Experiments with our partial thread summarizers indicate comparable or improved performance relative to a state-of-the-art unsupervised full thread summarizer baseline in most cases; and we have identified areas in which potential vulnerabilities in our methods can be avoided or accounted for. Furthermore, our results suggest that the potential benefits of background knowledge to partial thread summarization should be further investigated with larger datasets.

View record

A semi-joint neural model for sentence level discourse parsing and sentiment analysis (2017)

Discourse Parsing and Sentiment Analysis are two fundamental tasks in Natural Language Processing that have been shown to be mutually beneficial. In this work, we design and compare two Neural Based models for jointly learning both tasks. In the proposed approach, we first create a vector representation for all the segments in the input sentence. Next, we apply three different Recursive Neural Net models: one for discourse structure prediction, one for discourse relation prediction and one for sentiment analysis. Finally, we combine these Neural Nets in two different joint models: Multi-tasking and Pre-training. Our results on two standard corpora indicate that both methods result in improvements in each task but Multi-tasking has a bigger impact than Pre-training.

View record

Improve classification on infrequent discourse relations via training data enrichment (2017)

Discourse parsing is a popular technique widely used in text understanding, sentiment analysis, and other NLP tasks. However, for most discourse parsers, the performance varies significantly across different discourse relations. In this thesis, we first validate the underfitting hypothesis, i.e., the less frequent a relation is in the training data, the poorer the performance on that relation. We then explore how to increase the number of positive training instances, without resorting to manually creating additional labeled data. We propose a training data enrichment framework that relies on co-training of two different discourse parsers on unlabeled documents. Importantly, we show that co-training alone is not sufficient. The framework requires a filtering step to ensure that only “good quality” unlabeled documents can be used for enrichment and re-training. We propose and evaluate two ways to perform the filtering. The first is to use an agreement score between the two parsers. The second is to use only the confidence score of the faster parser. Our empirical results show that agreement score can help to boost the performance on infrequent relations, and that the confidence score is a viable approximation of the agreement score for infrequent relations.

View record

A study of methods for learning phylogenies of cancer cell populations from binary single nucleotide variant profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Exploring machine learning design options in discourse parsing (2015)

Discourse parsing recently attracts increasing interest among researchers since it is very helpful for text understanding, sentiment analysis and other NLP tasks. In a well-written text, authors often use discourse to better organize the text, and sentences (or clauses) tend to interact with neighboring sentences (or clauses). Each piece of text locally exhibits a finer discourse structure called rhetorical structure. And a document can be organized to a discourse tree (this process is called discourse parsing), which seeks to capture the discourse structure and logically binds the sentences (or clauses) together.However, despite the fact that discourse parsing is very useful, although intra-sentential level discourse parsing already achieves high performance, multi-sentential level discourse parsing remains a big challenge in terms of both accuracy and efficiency. In addition, machine learning techniques are proved to be successful in many NLP tasks including discourse parsing. Thus, in this thesis, we try to enhance the performance (e.g., accuracy, efficiency) of discourse parsing by using machine learning techniques. To this aim, we propose a novel two-step discourse parsing system, which first builds a discourse tree for a given text by applying optimal probabilistic parsing to probabilities inferred from learned conditional random fields (CRFs), then uses learned log-linear models to tag all discourse relations to the nodes in the discourse tree.We analyze different aspects of the problem (e.g., sequential v.s. non-sequential model, greedy v.s. optimal parsing, joint v.s. separate model) and discuss their trade-offs. We also carried out extensive experiments to study the usefulness of different feature families and over-fitting. Consequently, we find out that the most effective feature sets for different tasks are different: part-of-speech (POS) and context features are the most effective for intra and multi-sentential structure prediction respectively, while ngram features are the most effective for both intra and multi-sentential relation labeling. Moreover, over-fitting does occur in our experiments, so we need proper regularization. Final result shows that our system achieves state-of-the-art F-scores of 86.2, 72.2 and 59.2 in structure, nuclearity and relation. And it is more efficient than Joty's (training: 40 times faster; test: 3 times faster).

View record

Automatic abstractive summarization of meeting conversations (2014)

Nowadays, there are various ways for people to share and exchange information. Phone calls, E-mails, and social networking applications are tools which have made it much easier for us to communicate. Despite the existence of these convenient methods for exchanging ideas, meetings are still one of the most important ways for people to collaborate, share information, discuss their plans, and make decisions for their organizations. However, some drawbacks exist to them as well. Generally, meetings are time consuming and require the participation of all members. Taking meeting minutes for the benefit of those who miss meetings also requires considerable time and effort.To this end, there has been increasing demand for the creation of systems to automatically summarize meetings. So far, most summarization systems have applied extractive approaches whereby summaries are simply created by extracting important phrases or sentences and concatenating them in sequence. However, considering that meeting transcripts consist of spontaneous utterances containing speech disfluencies such as repetitions and filled pauses, traditional extractive summarization approaches do not work effectively in this domain. To address these issues, we present a novel template-based abstractive meeting summarization system requiring less annotated data than that needed for previous abstractive summarization approaches. In order to generate abstract and robust templates that can guide the summarization process, our system extends a novel multi-sentence fusion algorithm and utilizes lexico-semantic information. It also leverages the relationship between human-authored summaries and their source meeting transcripts to select the best templates for generating abstractive summaries of meetings. In our experiment, we use the AMI corpus to instantiate our framework and compare it with state-of-the-art extractive and abstractive systems as well as human extractive and abstractive summaries. Our comprehensive evaluations, based on both automatic and manual approaches, have demonstrated that our system outperforms all baseline systems and human extractive summaries in terms of both readability and informativeness. Furthermore, it has achieved a level of quality nearly equal to that of human abstracts based on a crowd-sourced manual evaluation.

View record

Evaluating open relation extraction over conversational texts (2014)

In this thesis, for the first time the performance of Open IE systems on conversational data has been studied. Due to lack of test datasets in this domain, a method for creating the test dataset covering a wide range of conversational data has been proposed. Conversational text is more complex and challenging for relation extraction because of its cryptic content and ungrammatical colloquial language. As a consequence text simplification has been used as a remedy to empower Open IE tools for relation extraction. Experimental results show that text simplification helps OLLIE, a state of the art for relation extraction, find new relations, extract more accurate relations and assign higher confidence scores to correct relations and lower confidence scores to incorrect relations for most datasets. Results also show some conversational modalities such as emails and blogs are easier for relation extraction task while people reviews on products is the most difficult modality.

View record

Blog comments classification using tree structured conditional random fields (2013)

The Internet provides a variety of ways for people to easily share, socialize, and interact with each other. One of the most popular platforms is the online blog. This causes a vast amount of new text data in the form of blog comments and opinions about news, events and products being generated everyday. However, not all comments have equal quality. Informative or high quality comments have greater impact on the readers’ opinions about the original post content, such as the benefits of the product discussed in the post, or the interpretation of a political event. Therefore, developing an efficient and effective mechanism to detect the most informative comments is highly desirable. For this purpose, sites like Slashdot, where users volunteer to rate comments based on their informativeness, can be a great resource to build such automated system using supervised machine learning techniques. Our research concerns building an automatic comment classification system leveraging these freely available valuable resources. Specifically, we discuss how comments in blogs can be detected using Conditional Random Fields (CRFs). Blog conversations typically have a tree-like structure in which an initial post is followed by comments, and each comment can be followed by other comments. In this work, we present our approach using Tree-structured Conditional Random Fields (TCRFs) to capture the dependencies in a tree-like conversational structure. This is in contrast with previous work [5] in which results produced by linear-chain CRF models had to be aggregated heuristically. As an additional contribution, we present a new blog corpus consisting of conversations of different genres from 6 different blog websites. We use this corpus to train and test our classifiers based on TCRFs.

View record

Domain adaptation for summarizing conversations (2011)

The goal of summarization in natural language processing is to create abridged and informative versions of documents. A popular approach is supervised extractive summarization: given a training source corpus of documents with sentences labeled with their informativeness, train a model to select sentences from a target document and produce an extract. Conversational text is challenging to summarize because it is less formal, its structure depends on the modality or domain, and few annotated corpora exist. We use a labeled corpus of meeting transcripts as the source, and attempt to summarize a different target domain, threaded emails. We study two domain adaptation scenarios: a supervised scenario in which some labeled target domain data is available for training, and an unsupervised scenario with only unlabeled data in the target and labeled data available in a related but different domain. We implement several recent domain adaptation algorithms and perform a comparative study of their performance. We also compare the effectiveness of using a small set of conversation-specific features with a large set of raw lexical and syntactic features in domain adaptation. We report significant improvements of the algorithms over their baselines. Our results show that in the supervised case, given the amount of email data available and the set of features specific to conversations, training directly in-domain and ignoring the out-of-domain data is best. With only the more domain-specific lexical features, though overall performance is lower, domain adaptation can effectively leverage the lexical features to improve in both the supervised and unsupervised scenarios.

View record

 

Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations

 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.