Rachel Pottinger

 
Prospective Graduate Students / Postdocs

This faculty member is currently not looking for graduate students or Postdoctoral Fellows. Please do not contact the faculty member with any such requests.

Associate Professor

Research Classification

Computer Science and Statistics

Research Interests

databases
data management
data integration
metadata management

Relevant Degree Programs

 

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2019)
Data coordination (2013)

Data coordination is the problem of updating a contingent database C as a result of changes to a database B on which it depends. For example, a general contractor’s construction cost estimate (C) needs to be updated in response to changes made by an architect to a building design (B). Although these two databases are related in a very specific way, they contain information about fundamentally different types of objects: the cost estimate is composed of items which represent work results, and the building design is composed of physical objects. Motivated by scenarios such as design-cost coordination, we propose an approach to coordinating data between autonomous, heterogeneous sources which have a base-contingent relationship. We propose the use of declarative mappings to express exact relationships between the two. Using materialized views to maintain state, we give an overall approach for coordinating sets of updates from B to C through view differencing and view update translation. We adopt ideas from data exchange and incomplete information to generate the set of all possible updates which satisfy a mapping. We propose methods for assisting a user (C’s administrator) in choosing amongst the possible updates, and experimentally evaluate these methods, as well as the overall benefit of semiautomatic data coordination, in a usability study. We then discuss practical challenges in applying our general techniques to the domain of architecture, engineering and construction, by interviewing practitioners and analyzing data from two construction projects.

View record

Supporting domain heterogeneous data sources for semantic integration (2011)

A SEMantic Integration System (SemIS) allows a query over one database to be answered using the knowledge managed in multiple databases in the system. It does so by translating a query across the collaborative databases in which data is autonomously managed in heterogeneous schemas. In this thesis, we investigate the challenges that arise in enabling domain heterogeneous (DH) databases to collaborate in a SemIS. In such a setting, distributed databases modeled as independent data sources are pairwise mapped to form the semantic overlay network (SON) of the SemIS. We study two problems we believe are foremost to allow a SemIS to integrate DH data sources.The first problem tackled in this thesis is to efficiently organize data sources so that query answering is efficient despite the increased level of source heterogeneity. This problem is modeled as an “Acquaintance Selection” problem and our solution helps data sources to choose appropriate acquaintances to create schema mappings with and therefore allows a SemIS to have a single-layered and flexible SON.The second problem tackled in this thesis is to allow aggregate queries to be translated across domain heterogeneous (DH) data sources where objects are usually represented and managed at different granularity. We focus our study on relational databases and propose novel techniques that allow a (non-aggregate) query to be answered by aggregations over objects at a finer granularity. The new query answering framework, named “decomposition aggregation query (DAQ)” processing, integrates data sources holding information in different domains and different granularity. New challenges are identified and tackled in a systematic way. We studied query optimizations for DAQ to provide efficient and scalable query processing.The solutions for both problems are evaluated empirically using real-life data and synthetic data sets. The empirical studies verified our theoretical claims and showed the feasibility, applicability (for real-life applications) and scalability of the techniques and solutions.

View record

Master's Student Supervision (2010 - 2018)
A study of provenance in databases and improving the usability of provenance database systems (2016)

Provenance refers to information about the origin of a piece of data and the process that led to its creation. Provenance information has been a focus of database research for quite some time. In this field, most of the focus has been on the sub-problem of finding the source data that contributed to the results of a query. More formally, the problem is defined as follows: given a query q and a tuple t in the results of q, which tuples from the relation R accessed by q caused t to appear in the results of q. The most studied aspect of this problem has been on developing models and semantics that allow this provenance information to be generated and queried. The motivations for studying provenance in databases vary across domains; provenance information is relevant to curated databases, data integration systems, and data warehouses for updating and maintaining views.In this thesis, I look extensively at provenance models as well as different system implementations.I compare the different approaches, analyze them, and point out the advantages and disadvantages of each approach. Based on my findings, I develop a provenance system based on the most attractive features of the previous systems, built on top of a relational database management system. My focus is on identifying areas that could potentially make provenance information easier to understand for users, using visualization techniques to extend the system with a provenance browsing component.I provide a case study using my provenance explorer, looking at a large dataset of financial data that comes from multiple sources. Provenance information helps with tracking the sources and transformations this data went through and explains them to the users in a way they can trust and reason about. There has not been much work focused on presenting and explaining provenance information to database users. Some of the current approaches support limited facilities for visualizing and reporting provenance information. Other approaches simply rely on the user to query and explore the results via different data manipulation languages. My approach presents novel techniques for the user to interact with provenance information.

View record

IfcXMLExplorer : a visualization tool for exploring and understanding ifcXML data (2016)

XML is a markup language popularly used for data exchange across different applications. Its flexibility and simplicity has made it easy to use. However, thisflexibility makes it difficult for large XML files to be easily comprehensible. MostXML files have complex schemas and these schemas differ across domains. Inthis work, we have taken a specific type of XML files - ifcXML. IfcXML files aredomain specific XML files generated from building information models (BIM).The organization of ifcXML files is hard to follow; elements in the ifcXML fileare identified through unique identifiers, which are used to connect one element toanother. This results in long chains of connections. Currently there is no effectivemethod of extracting and understanding these connections. The only way auser can see how one element is connected to another is by following the path ofconnections through the ifcXML file. We address this gap by introducing ifcXMLExplorer.IfcXMLExplorer is a visualization tool that enables users to better understandthe different systems in a BIM model along with the connections withinelements of the system by extracting necessary information from the ifcXML file.

View record

Managing data updates and transformations : a study of the what and how (2016)

Cleaning data (i.e., making sure data contains no errors) can take a large part of a project’s lifetime and cost. As dirty data can be introduced into a system through user actions (e.g., accidental rewrite of a value or simply incorrect information), or through the process of data integration, datasets require a constant iterative process of collecting, transforming, storing, and cleaning. In fact, it has been estimated that 80% of a project’s development and cost is spent on data cleaning. The research we are undertaking seeks to improve this process for users who are using a centralized database. While expert users may be able to write a script or use a database to help manage, verify, and correct their data, non-computer experts often lack these skills and thus, trawling through a large dataset is no easy feat for them. Non-expert users may lack the skills to effectively find what they need and often may not even be able to efficiently find the starting point of their data exploration task. They may look at a piece of data and be unsure of whether or not this piece of data is worth trusting (i.e., how reliable and accurate is it?). This thesis focuses on a system that facilitates this data verification and update process to help minimize the amount of effort and time put in to help clean the data. Most of our effort concentrated on building this system and working on the details needed to make it work. The system has a small visualization component designed to help users determine the transformation process that a piece of data has gone through. We want to show users when a piece of data was created along with what changes users have made to it along the way. To evaluate this system, an accuracy test was run on the system to determine if it could successfully manage updates. A user study was run to evaluate the visualization portion of the system.

View record

Managing updates and transformations in data sharing systems (2014)

Dealing with dirty data is an expensive and time consuming task. Estimates suggest that up to 80% of the total cost of large data projects is spent on data cleaning alone. This work is often done manually by domain experts in data applications, working with data copies and limited database access. We propose a new system of update propagation to manage data cleaning transformations in such data sharing scenarios. By spreading the changes made by one user to all users working with the same data, we hope to reduce repeated manual labour and improve overall data quality. We describe a modular system design, drawing from different research areas of data management, and highlight system requirements and challenges for implementation. Our goal is not to achieve full synchronization, but to propagate updates that individual users consider valuable to their operation.

View record

Integrating building-level and campus-level data (2013)

Integrating building design data (in the form of Building Information Models - BIMs) and 3D City Models is a promising solution to retrieving both detailed building information and exploring the relationships among building components and buildings in a city area. It could also help to facilitate decision making in building facility management and maintenance operations. Among other challenges, performing such an integration is difficult because BIMs and 3D city models vary in both semantic and geometric representations.Our first attempt was to try to convert the BIM into the city model, but the results were very disappointing. It is infeasible to convert every necessary component in the BIM into the city model, and the converted files cannot be visualized by the current city model applications. Next, we implemented a data integration system to incorporate information from both models. We used a novel approach to apply arithmetic expressions to express the overlapping information, not only for the semantic representation, but also on the geometric representation of the building components. This approach improved query answering on components from both models. Finally, we describe future challenges that will be needed to improve the accuracy of our current approach.

View record

Efficient extraction of ontologies from domain specific text corpora (2012)

There is a huge body of domain-specific knowledge embedded in free-text repositories such as engineering documents, instruction manuals, medical references and legal files.Extracting ontological relationships (e.g., ISA and HASA) from this kind of corpus can improve users’ queries and improve navigation through the corpus, as well as benefiting applications built for these domains.Current methods to extract ontological relationships from text data usually fail to capture many meaningful relationships because they concentrate on single-word-terms or very short phrases. This is particularly problematic in a smaller corpus, where it is harder to find statistically meaningful relationships.We propose a novel pattern-based algorithm that finds ontological relationships between complex concepts by exploiting parsing information to extract concepts consisting of multi-word and nested phrases.Our procedure is iterative: we tailor the constrained sequential pattern mining framework to discover new patterns. We compare our algorithm with previous representative ontology extraction algorithms on four real data sets and achieveconsistently and significantly better results.

View record

The model manager: a multidimensional conceptual modeling tool in the CIM framework (2011)

Organizations of all sizes collect a vast quantity of data every year. A data warehouse facilitates strategic multidimensional analysis on these data, by providing a single integrated queryable interface for multiple heterogeneous data sources. However, this interface is at the logical level schema, which fails to take advantage of the fundamental concepts of multidimensional data analysis like facts, dimensions, hierarchies, etc. In this thesis, we discuss a conceptual modeling language from the Conceptual Integration Modeling framework that serves multidimensional data to the users at a much higher level of abstraction. We not only provide the formal semantics for the language, but we also supply conceptual models depicting real world data analysis scenarios. These models are backed up with rigorous mathematical definitions to dispel any ambiguity in their interpretation. We developed a fully functional graphical editor, called the Model manager, to enable users to draw conceptual models visually.

View record

Semantic spatial interoperability framework : a case study in the architecture, engineering and construction (AEC) domain (2010)

The volume of disseminated digital spatial data has exploded, generating demand for tools to support interoperability and the extraction of usable knowledge. Previous work on spatial interoperability has focused on semi-automatically generating the mappings to mediate multi-modal spatial data. We present a case study in the Architecture, Engineering and Construction (AEC) domain that demonstrates that even after this level of semantic interoperability has been achieved, mappings from the integrated spatial data to concepts desired by the domain experts must be articulated. We propose the Semantic Spatial Interoperability Framework to provide the next layer of semantic interoperability: GML provides the syntactic glue for spatial and non-spatial data integration, and an ontology provides the semantic glue for domain-specific knowledge extraction. Mappings between the two are created by extending XQuery with spatial query predicates.

View record

TECTAS : bridging the gap between collaborative tagging systems and structured data (2010)

Ontologies are core building block of the emerging semantic web, and taxonomies which contain class-subclass relationships between concepts are a key component of ontologies. A taxonomy that relates the tags in a collaborative tagging system makes the collaborative tagging system's underlying structure easier to understand. Automatic construction of taxonomies from various data sources such as text data and collaborative tagging systems has been an interesting topic in the field of data mining.This thesis introduces a new algorithm for building a taxonomy of keywords from tags in collaborative tagging systems. This algorithm is also capable of detecting has-a relationships between tags. Proposed method - the TECTAS algorithm - uses association rule mining to detect is-a relationships between tags and can be used in an automatic or semi-automatic framework. TECTAS algorithm is based on the hypothesis that users tend to assign both "child" and "parent" tags to a resource. Proposed method leverages association rule mining algorithms, bi-gram pruning using search engines, discovering relationships when pairs of tags have a common child, and lexico-syntactic patterns to detect meronyms.In addition to proposing the TECTAS algorithm, several experiments are reported using four real data sets: Del.icio.us, LibraryThing, CiteULike, and IMDb. Based on these experiments, the following topics are addressed in this thesis: (1) Verify the necessity of building domain specific taxonomies (2) Analyze tagging behavior of users in collaborative tagging systems (3) Verify the effectiveness of our algorithm compared to previous approaches (4) Use of additional quality and richness metrics for evaluation of automatically extracted taxonomies.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.