Prospective Graduate Students / Postdocs
This faculty member is currently not looking for graduate students or Postdoctoral Fellows. Please do not contact the faculty member with any such requests.
This faculty member is currently not looking for graduate students or Postdoctoral Fellows. Please do not contact the faculty member with any such requests.
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
Data provenance is any information about the originof a piece of data and the process that led to its creation. Mostdatabase provenance work has focused on creating models andsemantics to query and generate this provenance information.While comprehensive, provenance information remains large andoverwhelming, which can make it hard for data provenancesystems to support data exploration or any meaningful applications. This thesis is focused on facilitating the use of database provenance through visual interfaces, summarization techniques, and curation techniques for real world applications.In the first part, we present visualization techniques for provenance information in relational databases. Our visualizations address every part of provenance information to facilitate user exploration. Through a user experiment, we show that our approach improves the accuracy and efficiency of performing exploration tasks.The next part addresses the challenge of volume of provenance information. Specifically, in the case of aggregation queries. The volume increases with the size of the database and creates a "needle in a haystack problem". We present novel summarization techniques that build on existing summarization literature. Our techniques work to support exploration for users who are not familiar with the data or its provenance.The final part shows our use of our summarization techniques to address the problem of refining aggregate queries. Aggregate queries pose a challenge in that they present ambiguous results to inexperienced users. Query refinement can help users realize their query errors and help them fix them. Through user experiment, we present evidence of the usefulness, and usability of our methods.Overall, the goal of this thesis is to facilitate the use of provenance information in relational databases. Through the use of novel techniques and user-centric evaluation, we present novel solutions and user interaction methods to enable new applications in this domain.
View record
In many domains, users interact with data stored in large, and often structured, data sources. This thesis addresses three phases of user interaction: (1) data exploration, (2) query composition, (3) and query answer analysis. It provides methods to assist in each of these phases, though, of course, no single thesis could be broad enough to cover all possible user interaction in these phases.The first part of the thesis focuses on improving data exploration with recommender systems. Standard recommendation models are biased toward popular items in their suggestions. Our approach is to analyze past interaction logs to estimate user preference for exploration and novelty. We present a generic framework that increases the novelty of recommendations based on each user's novelty preference. The next part of the thesis examines ways of facilitating query composition. We study models that analyze past query logs to model and estimate query properties, such as answer size or error type. By predicting these properties prior to query execution, we can help the user tune and optimize their query. Empirical results show that the data-driven machine learning models can accurately perform several of the prediction tasks. The final part of this thesis studies methods for improving the analysis of large or conflicting query answers. This problem is common in integration contexts where data is segmented across several sources with overlapping and conflicting data values. Depending on which combination of sources and values are used, a simple query can have an overwhelming number of correct and conflicting answers. The approach presented is based on efficiently estimating a query answer distribution. Further, it offers a suite of methods for extracting statistics that convey meaningful information about the answer set. Overall, the solutions developed in this thesis aim to increase the efficiency and decision quality of users. Empirical results on real-world datasets show that the proposed problems and solutions are important steps in the general direction of making information easily accessible to users.
View record
Data coordination is the problem of updating a contingent database C as a result of changes to a database B on which it depends. For example, a general contractor’s construction cost estimate (C) needs to be updated in response to changes made by an architect to a building design (B). Although these two databases are related in a very specific way, they contain information about fundamentally different types of objects: the cost estimate is composed of items which represent work results, and the building design is composed of physical objects. Motivated by scenarios such as design-cost coordination, we propose an approach to coordinating data between autonomous, heterogeneous sources which have a base-contingent relationship. We propose the use of declarative mappings to express exact relationships between the two. Using materialized views to maintain state, we give an overall approach for coordinating sets of updates from B to C through view differencing and view update translation. We adopt ideas from data exchange and incomplete information to generate the set of all possible updates which satisfy a mapping. We propose methods for assisting a user (C’s administrator) in choosing amongst the possible updates, and experimentally evaluate these methods, as well as the overall benefit of semiautomatic data coordination, in a usability study. We then discuss practical challenges in applying our general techniques to the domain of architecture, engineering and construction, by interviewing practitioners and analyzing data from two construction projects.
View record
A SEMantic Integration System (SemIS) allows a query over one database to be answered using the knowledge managed in multiple databases in the system. It does so by translating a query across the collaborative databases in which data is autonomously managed in heterogeneous schemas. In this thesis, we investigate the challenges that arise in enabling domain heterogeneous (DH) databases to collaborate in a SemIS. In such a setting, distributed databases modeled as independent data sources are pairwise mapped to form the semantic overlay network (SON) of the SemIS. We study two problems we believe are foremost to allow a SemIS to integrate DH data sources.The first problem tackled in this thesis is to efficiently organize data sources so that query answering is efficient despite the increased level of source heterogeneity. This problem is modeled as an “Acquaintance Selection” problem and our solution helps data sources to choose appropriate acquaintances to create schema mappings with and therefore allows a SemIS to have a single-layered and flexible SON.The second problem tackled in this thesis is to allow aggregate queries to be translated across domain heterogeneous (DH) data sources where objects are usually represented and managed at different granularity. We focus our study on relational databases and propose novel techniques that allow a (non-aggregate) query to be answered by aggregations over objects at a finer granularity. The new query answering framework, named “decomposition aggregation query (DAQ)” processing, integrates data sources holding information in different domains and different granularity. New challenges are identified and tackled in a systematic way. We studied query optimizations for DAQ to provide efficient and scalable query processing.The solutions for both problems are evaluated empirically using real-life data and synthetic data sets. The empirical studies verified our theoretical claims and showed the feasibility, applicability (for real-life applications) and scalability of the techniques and solutions.
View record
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Metadata helps users understand the data contents in a table. Metadata tags can describe the contents in the table and allow a user to easily browse, search, and filter data. However, metadata is less useful when there is heterogeneity and incompleteness in a table. It's difficult to find all related tables to the given table by only examining the tags, because the user is typically looking for overlap of tags between two or more tables and there are no such overlaps in the heterogeneous metadata. We use Open Data tables in a case study and develop strategies to augment the tags in table metadata to increase the number of the tag overlaps among metadata of different tables. As an initialization step, we perform semantic enrichment of words in attributes of table schema and in tags, and perform schema matching between attributes and tags of a table to create semantic labeling, where an attribute is labeled with zero or more tags. We provide one base table, and search for tables using the semantic labeling we created to quickly find related tables. We integrate the table searching step and a schema matching step into an iterative framework, which incrementally add additional tags to a table's metadata for all the tables related to the base table. The additional tags added to the metadata are discovered by semantics overlap during the schema matching step in the iterative framework, based on a composite score with evidence from multiple pairwise value comparison criteria. We evaluate two approaches using a gold standard we created, and compare the accuracy of the augmented tags and the runtime with the two baseline approaches. We show that the case of augmented tags has relatively high accuracy and the runtime of our iterative approach is reasonable. We argue that an approach that creates approximate matching in a pay-as-you-go fashion has good precision and recall, and is the more realistic option in a real-world scenario.
View record
Provenance refers to information about the origin of a piece of data and the process that led to its creation. Provenance information has been a focus of database research for quite some time. In this field, most of the focus has been on the sub-problem of finding the source data that contributed to the results of a query. More formally, the problem is defined as follows: given a query q and a tuple t in the results of q, which tuples from the relation R accessed by q caused t to appear in the results of q. The most studied aspect of this problem has been on developing models and semantics that allow this provenance information to be generated and queried. The motivations for studying provenance in databases vary across domains; provenance information is relevant to curated databases, data integration systems, and data warehouses for updating and maintaining views.In this thesis, I look extensively at provenance models as well as different system implementations.I compare the different approaches, analyze them, and point out the advantages and disadvantages of each approach. Based on my findings, I develop a provenance system based on the most attractive features of the previous systems, built on top of a relational database management system. My focus is on identifying areas that could potentially make provenance information easier to understand for users, using visualization techniques to extend the system with a provenance browsing component.I provide a case study using my provenance explorer, looking at a large dataset of financial data that comes from multiple sources. Provenance information helps with tracking the sources and transformations this data went through and explains them to the users in a way they can trust and reason about. There has not been much work focused on presenting and explaining provenance information to database users. Some of the current approaches support limited facilities for visualizing and reporting provenance information. Other approaches simply rely on the user to query and explore the results via different data manipulation languages. My approach presents novel techniques for the user to interact with provenance information.
View record
XML is a markup language popularly used for data exchange across different applications. Its flexibility and simplicity has made it easy to use. However, thisflexibility makes it difficult for large XML files to be easily comprehensible. MostXML files have complex schemas and these schemas differ across domains. Inthis work, we have taken a specific type of XML files - ifcXML. IfcXML files aredomain specific XML files generated from building information models (BIM).The organization of ifcXML files is hard to follow; elements in the ifcXML fileare identified through unique identifiers, which are used to connect one element toanother. This results in long chains of connections. Currently there is no effectivemethod of extracting and understanding these connections. The only way auser can see how one element is connected to another is by following the path ofconnections through the ifcXML file. We address this gap by introducing ifcXMLExplorer.IfcXMLExplorer is a visualization tool that enables users to better understandthe different systems in a BIM model along with the connections withinelements of the system by extracting necessary information from the ifcXML file.
View record
Cleaning data (i.e., making sure data contains no errors) can take a large part of a project’s lifetime and cost. As dirty data can be introduced into a system through user actions (e.g., accidental rewrite of a value or simply incorrect information), or through the process of data integration, datasets require a constant iterative process of collecting, transforming, storing, and cleaning. In fact, it has been estimated that 80% of a project’s development and cost is spent on data cleaning. The research we are undertaking seeks to improve this process for users who are using a centralized database. While expert users may be able to write a script or use a database to help manage, verify, and correct their data, non-computer experts often lack these skills and thus, trawling through a large dataset is no easy feat for them. Non-expert users may lack the skills to effectively find what they need and often may not even be able to efficiently find the starting point of their data exploration task. They may look at a piece of data and be unsure of whether or not this piece of data is worth trusting (i.e., how reliable and accurate is it?). This thesis focuses on a system that facilitates this data verification and update process to help minimize the amount of effort and time put in to help clean the data. Most of our effort concentrated on building this system and working on the details needed to make it work. The system has a small visualization component designed to help users determine the transformation process that a piece of data has gone through. We want to show users when a piece of data was created along with what changes users have made to it along the way. To evaluate this system, an accuracy test was run on the system to determine if it could successfully manage updates. A user study was run to evaluate the visualization portion of the system.
View record
Dealing with dirty data is an expensive and time consuming task. Estimates suggest that up to 80% of the total cost of large data projects is spent on data cleaning alone. This work is often done manually by domain experts in data applications, working with data copies and limited database access. We propose a new system of update propagation to manage data cleaning transformations in such data sharing scenarios. By spreading the changes made by one user to all users working with the same data, we hope to reduce repeated manual labour and improve overall data quality. We describe a modular system design, drawing from different research areas of data management, and highlight system requirements and challenges for implementation. Our goal is not to achieve full synchronization, but to propagate updates that individual users consider valuable to their operation.
View record
Integrating building design data (in the form of Building Information Models - BIMs) and 3D City Models is a promising solution to retrieving both detailed building information and exploring the relationships among building components and buildings in a city area. It could also help to facilitate decision making in building facility management and maintenance operations. Among other challenges, performing such an integration is difficult because BIMs and 3D city models vary in both semantic and geometric representations.Our first attempt was to try to convert the BIM into the city model, but the results were very disappointing. It is infeasible to convert every necessary component in the BIM into the city model, and the converted files cannot be visualized by the current city model applications. Next, we implemented a data integration system to incorporate information from both models. We used a novel approach to apply arithmetic expressions to express the overlapping information, not only for the semantic representation, but also on the geometric representation of the building components. This approach improved query answering on components from both models. Finally, we describe future challenges that will be needed to improve the accuracy of our current approach.
View record
There is a huge body of domain-specific knowledge embedded in free-text repositories such as engineering documents, instruction manuals, medical references and legal files.Extracting ontological relationships (e.g., ISA and HASA) from this kind of corpus can improve users’ queries and improve navigation through the corpus, as well as benefiting applications built for these domains.Current methods to extract ontological relationships from text data usually fail to capture many meaningful relationships because they concentrate on single-word-terms or very short phrases. This is particularly problematic in a smaller corpus, where it is harder to find statistically meaningful relationships.We propose a novel pattern-based algorithm that finds ontological relationships between complex concepts by exploiting parsing information to extract concepts consisting of multi-word and nested phrases.Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. We compare our algorithm with previous representative ontology extraction algorithms on four real data sets and achieveconsistently and significantly better results.
View record
Organizations of all sizes collect a vast quantity of data every year. A data warehouse facilitates strategic multidimensional analysis on these data, by providing a single integrated queryable interface for multiple heterogeneous data sources. However, this interface is at the logical level schema, which fails to take advantage of the fundamental concepts of multidimensional data analysis like facts, dimensions, hierarchies, etc. In this thesis, we discuss a conceptual modeling language from the Conceptual Integration Modeling framework that serves multidimensional data to the users at a much higher level of abstraction. We not only provide the formal semantics for the language, but we also supply conceptual models depicting real world data analysis scenarios. These models are backed up with rigorous mathematical definitions to dispel any ambiguity in their interpretation. We developed a fully functional graphical editor, called the Model manager, to enable users to draw conceptual models visually.
View record
The volume of disseminated digital spatial data has exploded, generating demand for tools to support interoperability and the extraction of usable knowledge. Previous work on spatial interoperability has focused on semi-automatically generating the mappings to mediate multi-modal spatial data. We present a case study in the Architecture, Engineering and Construction (AEC) domain that demonstrates that even after this level of semantic interoperability has been achieved, mappings from the integrated spatial data to concepts desired by the domain experts must be articulated. We propose the Semantic Spatial Interoperability Framework to provide the next layer of semantic interoperability: GML provides the syntactic glue for spatial and non-spatial data integration, and an ontology provides the semantic glue for domain-specific knowledge extraction. Mappings between the two are created by extending XQuery with spatial query predicates.
View record
Ontologies are core building block of the emerging semantic web, and taxonomies which contain class-subclass relationships between concepts are a key component of ontologies. A taxonomy that relates the tags in a collaborative tagging system makes the collaborative tagging system's underlying structure easier to understand. Automatic construction of taxonomies from various data sources such as text data and collaborative tagging systems has been an interesting topic in the field of data mining.This thesis introduces a new algorithm for building a taxonomy of keywords from tags in collaborative tagging systems. This algorithm is also capable of detecting has-a relationships between tags. Proposed method - the TECTAS algorithm - uses association rule mining to detect is-a relationships between tags and can be used in an automatic or semi-automatic framework. TECTAS algorithm is based on the hypothesis that users tend to assign both "child" and "parent" tags to a resource. Proposed method leverages association rule mining algorithms, bi-gram pruning using search engines, discovering relationships when pairs of tags have a common child, and lexico-syntactic patterns to detect meronyms.In addition to proposing the TECTAS algorithm, several experiments are reported using four real data sets: Del.icio.us, LibraryThing, CiteULike, and IMDb. Based on these experiments, the following topics are addressed in this thesis: (1) Verify the necessity of building domain specific taxonomies (2) Analyze tagging behavior of users in collaborative tagging systems (3) Verify the effectiveness of our algorithm compared to previous approaches (4) Use of additional quality and richness metrics for evaluation of automatically extracted taxonomies.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.