Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Complete these steps before you reach out to a faculty member!
- Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
- Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
- Identify specific faculty members who are conducting research in your specific area of interest.
- Establish that your research interests align with the faculty member’s research interests.
- Read up on the faculty members in the program and the research being conducted in the department.
- Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
- Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
- Do not send non-specific, mass emails to everyone in the department hoping for a match.
- Address the faculty members by name. Your contact should be genuine rather than generic.
- Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
- Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
- Demonstrate that you are familiar with their research:
- Convey the specific ways you are a good fit for the program.
- Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
- Be enthusiastic, but don’t overdo it.
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
Graduate Student Supervision
Doctoral Student Supervision
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
We study the problem of decision-making under uncertainty in the bandit setting. This thesis goes beyond the well-studied multi-armed bandit model to consider structured bandit settings and their applications. In particular, we learn to make better decisions by leveraging the application-specific problem-structure in the form of features or graph information. We investigate the use of structured bandits in two practical applications: online recommender systems with an available network of users and viral marketing in social networks. For each of these applications, we design efficient bandit algorithms and theoretically characterize their performance. We experimentally evaluate the efficiency and effectiveness of these algorithms on real-world datasets. For applications that require modelling complex non-linear feature-reward relationships, we propose a bootstrapping approach and establish theoretical regret bounds for it. Furthermore, we consider the application of multi-class classification with bandit feedback as a test-bed for evaluating our bootstrapping approach.
As social websites like Twitter greatly influence people's digital life,unstructured social streams become prevalent, which are fast surging textualpost streams without formal structure or schema between posts or inside the postcontent. Modeling and miningunstructured social streams in Twitter become a challenging and fundamentalproblem in social web analysis, which leads to numerous applications, e.g., recommending social feeds like "what's happening right now?" or "what arerelated stories?".Current social stream analysis in response to queriesmerely return an overwhelming list of posts, with little aggregation or semantics. The design ofthe next generation social stream mining algorithms faces various challenges,especially, the effective organization of meaningful information from noisy, unstructured,and streaming social content.The goal of this dissertation is to address the most critical challenges insocial stream mining using graph-based techniques.We model a social stream as a post network, and use "event" and"story" to capture a group of aggregated social posts presenting similarcontent in different granularities, where an event may containa series of stories.We highlight our contributions on social stream mining from a structuralperspective as follows. We first model a story as a quasi-clique, which iscohesion-persistent regardless of the story size, and propose two solutions, DIMand SUM, to search the largest story containing given query posts, bydeterministic and stochastic means, respectively. To detect all stories in the time window of asocial stream and support the context-aware story-telling, we propose CAST,which defines a story as a (k,d)-Core in post network and tracks therelatedness between stories.We propose Incremental Cluster Evolution Tracking (ICET),which is an incremental computation framework for event evolution onevolving post networks, with the ability to track evolution patterns of socialevents as time rolls on. Approaches in this dissertation are based on twohypotheses: users prefer correlated posts to individual posts in poststream modeling, and a structural approach is better thanfrequency/LDA-based approaches in event and story modeling. We verify these hypotheses bycrowdsourcing based user studies.
Social influence is a ubiquitous phenomenon in human life. Fueled by the extreme popularity of online social networks and social media, computational social influence has emerged as a subfield of data mining whose goal is to analyze and understand social influence using computational frameworks such as theoretical modeling and algorithm design. It also entails substantial application potentials for viral marketing, recommender systems, social media analysis, etc. In this dissertation, we present our research achievements that take significant steps toward bridging the gap between elegant theories in computational social influence and the needs of two real-world applications: viral marketing and recommender systems. In Chapter 2, we extend the classic Linear Thresholds model to incorporate price and valuation to model the diffusion process of new product adoption; we design a greedy-style algorithm that finds influential users from a social network as well as their corresponding personalized discounts to maximize the expected total profit of the advertiser. In Chapter 3, we propose a novel business model for online social network companies to sell viral marketing as a service to competing advertisers, for which we tackle two optimization problems: maximizing total influence spread of all advertisers and allocating seeds to advertisers in a fair manner. In Chapter 4, we design a highly expressive diffusion model that can capture arbitrary relationship between two propagating entities to arbitrary degrees. We then study the influence maximization problem in a novel setting consisting of two complementary entities and design efficient approximation algorithms. Next, in Chapter 5, we apply social influence into recommender systems. We model the dynamics of user interest evolution using social influence, as well as attraction and aversion effects. As a result, making effective recommendations are substantially more challenging and we apply semi-definite programming techniques to achieve near-optimal solutions. Chapter 6 concludes the dissertation and outlines possible future research directions.
Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or DVD. However, many applications can benefit from a system which is capable of recommending packages of items. Sample applications include travel planning, e-commerce, and course recommendation. In these contexts, there is a need for a system that can recommend the most relevant packages for the user to choose from.In this thesis we highlight our research achievements for the composite recommendation problem. We first consider the problem of composite recommendation under hard constraint, e.g., budget. It is clear that this is a very common paradigm for the composite recommendation problem. In Chapter 3, we first discuss how given a fixed package schema, we can efficiently find the top-k most relevant packages with hard constraints. The proposed algorithm is shown to be instance optimal, which means that no algorithm in a reasonable class can perform more than a constant times better, for some fixed constant. And we also propose relaxed solutions based on probabilistic reasoning. In Chapter 4, we lift the constraint on the package schema, and discuss how efficient algorithms can be derived to solve the more general problem with a flexible package schema. For this problem, again we propose both instance optimal algorithm and heuristics-based solution which have been verified to be effective and efficient through our extensive empirical study. Then in Chapter 5, motivated by the fact that hard constraints sometimes might lead to unfavorable results, and following the recent paradigm on “softening” the constraints, we study the problem of how to handle top-k query processing with soft constraints. Finally, in Chapter 6, we discuss a general performance tuning solution based on cached views which can be leveraged to further optimize the various algorithms proposed in this thesis.
Social influence occurs when one's actions are affected by others. If leveraged carefully, social influence can be exploited in many applications like viral marketing (or targeted advertising in general), recommender systems, social network analysis, events detection, experts finding, link prediction, ranking of feeds etc. One of the fundamental problems in this fascinating field is the problem of influence maximization, primarily motivated by the application of viral marketing. The objective is to identify a small set of users in a social network, who when convinced to adopt a product will influence others in the network leading to a large number of adoptions. The vision of our work is to take the algorithmic and data mining aspects of viral marketing out of the lab. We organize ours goals and contributions into four categories: (i) With the ever-increasing scale of online social networks, it is extremely important to develop efficient algorithms for influence maximization. We propose two algorithms -- CELF++ and SIMPATH that significantly improve the scalability. (ii) We remark that previous studies often make unrealistic assumptions and rely on simulations, instead of validating models against real world data. For instance, they assume an arbitrary assignment of influence probabilities in their studies, which focused more on algorithms than on validity with respect to real data. We attack the problem of learning influence probabilities. In another work, we propose a novel data driven approach to influence models and show that it predicts influence diffusion with much better accuracy. (iii) Next, we propose alternative problem formulations -- MINTSS and MINTIME and show interesting theoretical results. These problem formulations capture the problem of deploying viral campaigns on budget and time constraints. In an additional work, we take a fresh perspective on identifying community leaders using a pattern mining approach. (iv) Finally, we examine applications of social influence. We begin with the application of viral marketing. We show that product adoption is not exactly influence. Given this, we develop a product adoption model and study the problem of maximizing product adoption. Furthermore, we propose and investigate a novel problem in recommender systems, for targeted advertising -- RECMAX.
Privacy preservation is a key issue in outsourcing of data mining.When we seek approaches to protect the sensitive information containedin the original data, it is also important to preserve the mining outcome.We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially.We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guaranteewhen the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and proposea variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results showthat the two approaches are effective to counter powerful attack models.In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise.We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases.
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Geospatial databases are gaining tremendous popularity, driven by the need to have digital maps to support various individual and industrial use cases like navigation, logistics, fleet management, ride-sharing, and food delivery. Clearly, maps need to be accurate and up-to-date. While most geospatial databases are proprietary, OpenStreetMap (OSM) is a collaborative good quality crowd-sourced database. The quality of OSM is comparable with many private datasets: OSM is currently being used by many companies worldwide, including Facebook, Apple, Amazon and Microsoft. The contributors volunteer to "draw" the missing spatial features on OSM. Despite continuous meticulous work for more than 15 years now, OSM data still lacks good coverage in many parts of the world. Extending its coverage is a laborious and time-consuming task. There is a pressing need for an automated approach to conflate the missing features from other spatial databases and satellite images. We address road network conflation between two vector geospatial databases and propose MAYUR, solving the major subtasks -- map matching and map merging. We represent geospatial databases as a graph of road intersections (vertices) and lines (edges) with spatial attributes. We propose a novel map matching framework by adapting the Rank Join in databases, where each edge of the reference database graph is a relation. Our algorithm finds the target database graph that best matches the reference database, respecting the connectivity of the road network. Classic Rank Join in databases has been tested on very few relations, and it gets quickly inefficient on instances with many relations. We introduce several optimizations that boost the algorithm's efficiency, making it scalable for our problem setting featuring hundreds to thousands of relations. We establish the node mappings between the two databases using the mapping obtained from the matching step. Our map merging uses rubbersheeting to merge the missing features. We demonstrate the performance of our map conflation approach on sidewalks in OSM and Boston Open Data. Manual evaluation shows that our map matching achieves an impressive 98.65% precision and 99.55% recall, with our map merging causing very small perturbation in the merged features, outperforming Hootenanny, the leading map conflation approach.
The proliferation of financial news sources reporting on companies, markets, currencies and stocks presents an opportunity for strategic decision making by mining data with the goal of extracting structured representations about financial entities and their inter-relations. These representations can be conveniently stored as (subject, predicate, object) triples in a knowledge graph that can be used drive new in-sights through answering complex queries using high level declarative languages.Towards this goal, we develop a high precision knowledge extraction pipeline tailored for the financial domain. This pipeline combines multiple information ex-traction techniques with a financial dictionary that we built, all working together to produce over 342,000 compact extractions from over 288,000 financial news articles, with a precision of 78% at the top-100 extractions. These extractions are stored in a knowledge graph readily available for use in downstream applications. Our pipeline outperforms existing work in terms of precision, the total number of extractions and the coverage of financial predicates.
Multidimensional data is prevalent in data warehousing and OLAP. Changes to the data in a data warehouse are of particular interest to analysts as well as knowledge workers since they may be instrumental in understanding trends in a business or enterprise. At the same time, an explicit enumeration of all changes to the fact tuples in a data warehouse is too verbose to be useful; instead, a summary of the changes is more desirable since it allows the user to quickly understand the trends and patterns. In this thesis, we study the problem of summarizing changes in hierarchical multidimensional data with non-overlapping but containable hyperrectangles. An advantage of such summaries is that they naturally follow a tree structure and therefore are arguably easy to interpret. Hierarchies are naturally present in data warehouses and our constructed summaries are designed to leverage the hierarchical structure along each dimension to provide concise summaries of the changes.We study the problem of generating lossless as well as lossy summaries of changes. While lossless summaries allow the exact changes to be recovered, lossy summaries trade accuracy of reconstruction for conciseness. In a lossy summary, the maximum amount of lossiness per tuple can be regulated with a single parameter α. We provide a detailed analysis of the algorithm, then we empirically evaluate its performance, compare it to existing alternative methods under various settings and demonstrate with a detailed set of experiments on real and synthetic data that our algorithm outperforms the baselines in terms of conciseness of summaries or accuracy of reconstruction w.r.t. the size, dimensionality and level of correlation in data.
For tackling the well known cold-start user problem in collaborative filtering recommender systems, one approach is to recommend a few items to a cold-start user and use the feedback to learn her preferences. This can then be used to make good recommendations to the cold user. In the absence of a good initial estimate of the preferences, the recommendations are like random probes. If the items are not chosen judiciously, both bad recommendations and too many recommendations may turn off a user. We study the cold-start user problem for the two main types of collaborative filtering methods -- neighbourhood-based and matrix factorization. We formalize the problem by asking what are the b (typically a small number) items we should recommend to a cold-start user, in order to learn her preferences best, and define what it means to do that under each framework. We cast the problem as a discrete optimization problem, called the optimal interview design (OID) problem, and study two variants -- OID-NB and OID-MF -- for the two frameworks. We present multiple non-trivial results, including NP-hardness as well as hardness of approximation for both. We further study supermodularity/submodularity and monotonicity properties for the objective functions of the two variants. Finally, we propose efficient algorithms and comprehensively evaluate them. For OID-NB, we propose a greedy algorithm, and experimentally evaluate it on 2 real datasets, where it outperforms all the baselines. For OID-MF, we discuss several scalable heuristic approaches for identifying the b best items to recommend to the user and experimentally evaluate their performance on 4 real datasets. Our experiments show that our proposed accelerated algorithms significantly outperform the prior art, while obtaining similar or lower error in the learned user profile as well as in the rating predictions made, on all the datasets tested.
There is often a need to recover the “missing” query that produced a particular outputfrom a data stream. As an example, since a data stream is constantly evolving,the analyst may be curious about using the query from the past to evaluate it on thecurrent state of the data stream, for further analysis. Previous research has studiedthe problem of reverse engineering a query that would produce a given result at aparticular database state.We study the following problem. Given a streaming database D=
The objective of viral marketing is to leverage a social network to spread awareness about a specific product in the market through information propagation via word-of-mouth. A closely related problem is that of influence maximization which aims to identify the `best' set of `influential' users (seeds) to give discounts or free products to, such that awareness about the product is maximized. We study two relatively unexplored variants of influence maximization (IM) in social networks.Conventional IM algorithms assume the structure of the social network and edge weights to be known. Though the structure of the network is usually known, edge weights need to be estimated from past data. In the first part of this thesis, we tackle the real but difficult problems of (i) learning these edge weights online and (ii) maximizing influence in the network when no past data is available as input. We adopt a combinatorial multi-armed bandit (CMAB) paradigm and formulate the above problems respectively as (i) network exploration, i.e. incrementally minimizing the error in learned edge weights and (ii) regret minimization i.e. minimizing the loss in influence from choosing suboptimal seed sets over multiple attempts. Most previous work on influence maximization in social networks is limited to the non-adaptive setting in which the marketer is supposed to select all of the seed users up front. A disadvantage of this setting is that she is forced to select all the seeds based solely on a diffusion model. If the selected seeds do not perform well, there is no opportunity to course-correct. A more practical setting is the adaptive setting in which the marketer initially selects a batch of users and observes how seeding those users leads to a diffusion of product adoptions. Based on this market feedback, she formulates a policy for choosing the remaining seeds. We study adaptive offline strategies for two problems: (a) MAXSPREAD - given a budget on number of seeds and a time horizon, maximize the spread of influence and (b) MINTSS - given a time horizon and an expected number of target users, minimize the number of required seeds.
Online platforms, such as Meetup and Plancast, have recently become popular for planning gatherings and event organization. However, there is a surprising lack of studies on how to effectively and efficiently organize social events for a large group of people through those platforms. This thesis provides the first systematic study on key computational problem involved in organization of social events. We understand the Social Event Organization problem as assigning a set of events for a group of users to attend, where the users are socially connected with each other and have innate levels of interest in those events. We then introduce a set of formal definition of a restricted version of the problem and show that it is NP-hard and is hard to approximate. We propose efficient heuristic algorithms that improve upon simple greedy algorithms by incorporating the notion of phantom events and by using look-ahead estimation. Using synthetic datasets and three real datasets including those from the platforms Meetup and Plancast, we experimentally demonstrate that our greedy heuristics are scalable and furthermore outperform the baseline algorithms significantly in terms of achieving superior social welfare.
Nowadays, more and more websites are providing users with the functionality to create item lists. For example, in Yelp, users can create restaurant lists he/she likes. In Goodreads, users can create booklists consisting of books they like. These user-generated item lists complement the main functionality of the corresponding application and intuitively become an alternative way for users to browse and discover interesting items. Existing recommender systems are not designed for recommending user-generated item lists directly. In this work, we study properties of these user-generated item lists and propose two different models for recommending user-generated lists. We first model the problem as a recommendation problem and propose a Bayesian ranking model, called Lire. The proposed model takes into consideration users' previous interactions with both item lists and with individual items. Furthermore, we propose in Lire a novel way of weighting items within item lists based on both positions of items, and personalized list consumption pattern. Through extensive experiments on a real item list dataset from Goodreads, we demonstrate the effectiveness of our proposed Lire model. We then summarize the lessons learned from the recommendation model and modelled the problem as an optimization problem. We show that the list recommendation optimization can be reduced to Maximum k-Coverage Problem, which is an NP-hard problem. We derive three different problem variants and propose algorithms for each of them. We then conduct experiments to compare their results. Lessons learned and possible application scenarios are also discussed in the hope of helping us and more people improve the work. Furthermore, we propose several potential directions for future work.
Traditional search focuses on keyword matching and document ranking, thus users will only get a overload of related news articles or videos, with little semantics aggregation, when they input keywords among which they are interested in exploring potential connections. To capture these connections, one good way is to model news articles into a complex network of events, where an event itself is a complex network of interrelated actions (or assertions). Event search enables users to get a big picture of essential actions involved without going through overwhelming documents. Further on, Query-driven event search, in which users can access key actions that occurred in various events with respect to input query and construct a new event capturing corresponding actions and connections between them, is able to inherently revolutionize search from its traditional keyword matching to a higher level of abstraction. Thus, we propose a natural and useful paradigm, Query- driven event search in News Information Network, to allow users to search through news articles and obtain events as answers. We construct the news information network with nodes representing actions and edges indicating relatedness between nodes. We define a good answer as a structurally densely linked and semantically related subgraph, describing a concise and informative event. Thus our objective function combines both the linkage structure of graph and semantic content relatedness. We then formally define our problem to build a query-driven event search engine that processes users query and generates a subgraph satisfying following conditions: 1) covering all keywords but without noisy nodes 2) maximizes the objective function. We prove this problem is NP-complete and propose heuristics algorithms. Our experimental evaluation on real-life news datasets demonstrates algorithms efficiency and meaningful solutions we obtain.
Cricket is a popular sport played in over a hundred countries, is the second most watched sport in the world after soccer, and enjoys a multi-million dollar industry. There is tremendous interest in simulating cricket and more importantly in predicting the outcome of games, particularly in their one-day international format. The complex rules governing the game, along with the numerous natural phenomena affecting the outcome of a cricket match present significant challenges for accurate prediction. Multiple diverse parameters, including but not limited to cricketing skills and performances, match venues and even weather conditions can significantly affect the outcome of a game. The sheer number of parameters, along withtheir interdependence and variance create a non-trivial challenge to create an accurate quantitative model of a game. Unlike other sports such as basketball and baseball which are well researched from a sports analytics perspective, for cricket, these tasks have yet to be investigated in depth. The goal of this work is to predict the game progression and winner of a yet-to-begin or an ongoing game. The game is modeled using a subset of match parameters, using a combination of linearregression and nearest-neighbor classification-aided attribute bagging algorithm. The prediction system takes in historical match data as well as the instantaneous state of a match, and predicts the score at key points in the future, culminatingin a prediction of victory or loss. Runs scored at the end of an innings, the key factor in determining the winner, are predicted at various points in the game. Our experiments based on actual cricket game data, shows that our method predicts the winner with an accuracy of approximately 70%.
As users increasingly rely on collaborative rating sites to achieve mundanetasks such as purchasing a product or renting a movie, they are facing thedata deluge of ratings and reviews. Traditionally, the exploration of rateddata sets has been enabled by rating averages that allow user-centric, itemcentricand top-k exploration. More speci cally, canned queries on userdemographics aggregate opinion for an item or a collection of items such as18-29 year old males in CA rated the movie The Social Network at 8:2 onaverage. Combining ratings, demographics, and item attributes is a powerfulexploration mechanism that allows operations such as comparing theopinion of the same users for two items, comparing two groups of users ontheir opinion for a given class of items, and nding a group whose ratingdistribution is nearly unanimous for an item. To enable those operations,it is necessary to (i) adopt the right measure to compare ratings, and to(ii) develop e cient algorithms to nd relevant
Recommender systems exploit user feedback over items they have experienced formaking recommendations of other items that are most likely to appeal to them.However, users and items are but two of the three types of entities participating inthis ecosystem of recommender systems. The third type of entities are the manufacturersof the products, and users are really their customers. Traditional recommendersystems research ignores the role of this third entity type and exclusivelyfocuses on the other two. What might item producers bring to recommender systemsresearch? Their objectives are related to their business and are captured byquestions such as “what kind of (new) products should I manufacture that willmaximize their popularity?” These questions are not asked in a vacuum: manufacturershave constraints, e.g., a budget. The idea is that the user feedback data (e.g.,ratings) capture users’ preferences. The question is whether we can learn enoughintelligence from it, so as to recommend new products to manufacturers that willhelp meet their business objectives.We propose the novel problem of new product recommendation for manufacturers.We collect real data by crawling popular e-commerce websites, and modelcost and popularity as a function of product attributes and their values. We incorporatecost constraints into our problem formulation: the cost of the new productsshould fall within the desired range while maximizing the popularity. We showthat the above problem is NP-hard and develop a pseudo-polynomial time algorithmfor the recommendations generation. Finally, we conduct a comprehensiveexperimental analysis where we compare our algorithm with several natural heuristicson three real data sets and perform scalability experiments on a synthetic dataset.
Keyword based search interfaces are extremely popular as a means for efficiently discovering items of interest from a huge collection, as evidenced by the success of search engines like Google and Bing. However, most of the current search services still return results as a flat ranked list of items. Considering the huge number of items which can match a query, this list based interface can be very difficult for the user to explore and find important items relevant to their search needs. In this work, we consider a search scenario in which each item is annotated with a set of keywords. E.g., in Web 2.0 enabled systems such as flickr and del.icio.us, it is common for users to tag items with keywords. Based on this annotation information, we can automatically group query result items into different expansions of the query corresponding to subsets of keywords. We formulate and motivate this problem within a top-k query processing framework, but as that of finding the top-k most important expansions. Then we study additional desirable properties for the set of expansions returned, and formulate the problem as an optimization problem of finding the best k expansions satisfying all the desirable properties. We propose several efficient algorithms for this problem. Our problem is similar in spirit to recent works on automatic facets generation, but has the important difference and advantage that we don’t need to assume the existence of pre-defined categorical hierarchy which is critical for these works. Through extensive experiments on both real and synthetic datasets, we show our proposed algorithms are both effective and efficient.
Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.
Social relatedness measures such as the Katz score and the commute time betweenpairs of nodes have been subject of significant research effort motivated by socialnetwork problems including link prediction, anomalous link detection, and collaborativefiltering.In this thesis, we are interested in computing: (1) the score for a given pair ofnodes, and (2) the top-k nodes with the highest scores from a specific source node.Unlike most traditional approaches, ours scale to large networks with hundreds ofthousands of nodes.We introduce an efficient iterative algorithm which calculates upper and lowerbounds for the pairwise measures. For the top-k problem, we propose an algorithmthat only has access to a small subset of nodes. Our approaches rely on techniquesdeveloped in numerical linear algebra and personalized PageRank computing. Usingthree real-world networks, we examine scalability and accuracy of our algorithmsas in a short time as milliseconds to seconds.We also hypothesize that incorporating item based tags into a recommendersystem will improve its performance. We model such a system as a tri-partite graphof users, items and tags and use this graph to define a scoring function making useof graph-based proximity measures.Exactly calculating the item scores is computationally expensive, so we use theproposed top-k algorithm to calculate the scores. The usefulness and efficiency ofthe approaches are compared to a simple, non-graph based, approach. We evaluatethese approaches on a combination of the Netflix ratings data and the IMDb tagdata.