Laks Lakshmanan

Professor

Research Interests

data management and data cleaning
data warehousing and OLAP
data and text mining
analytics on big graphs and news
social networks and media
recommender systems

Relevant Thesis-Based Degree Programs

 
 

Recruitment

Master's students
Doctoral students
2020
I am open to hosting Visiting International Research Students (non-degree, up to 12 months).

Complete these steps before you reach out to a faculty member!

Check requirements
  • Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
  • Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
Focus your search
  • Identify specific faculty members who are conducting research in your specific area of interest.
  • Establish that your research interests align with the faculty member’s research interests.
    • Read up on the faculty members in the program and the research being conducted in the department.
    • Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
Make a good impression
  • Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
    • Do not send non-specific, mass emails to everyone in the department hoping for a match.
    • Address the faculty members by name. Your contact should be genuine rather than generic.
  • Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
  • Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
  • Demonstrate that you are familiar with their research:
    • Convey the specific ways you are a good fit for the program.
    • Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
  • Be enthusiastic, but don’t overdo it.
Attend an information session

G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.

 

ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS

These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Methods for design of efficient on-device natural language processing architectures (2024)

Deep learning based models often achieve state-of-the-art performance in a wide range of natural language processing (NLP) tasks, which include open-ended tasks (e.g., story generation, brainstorming, and chat) and closed-ended tasks (e.g., summarization, question answering, and rewriting). To further enhance quality, there is a growing interest in scaling the model size and the amount of data used for training. These research efforts often overlook the impact of footprint metrics, such as high latency, high memory usage, and high energy consumption, on these deep learning models. A high footprint makes these models significantly inefficient for deployment on servers and devices such as tablets, handhelds, and wearables. Methods for improving model efficiency often come at the cost of degrading model quality.In this dissertation, we address the central question: how can we push the envelope in improving the efficiency-quality tradeoff of deep learning models for on-device NLP tasks? To this end, we propose methods that take on-device efficiency constraints (e.g., ≤ 16 MB memory or ≤ 200 ms latency) to inform the design of the model architecture. We propose methods for the manual design of architecture for the auto-completion task (generate continuations for user-written prompts) that enjoy a better memory-accuracy tradeoff than existing auto-completion models (Chapter 2). Additionally, we introduce methods that can directly take efficiency constraints to automatically search for efficient sparsely activated architectures for machine translation tasks (Chapter 3) and efficient pretrained (task-agnostic) language modeling architectures (Chapter 4). Finally, in Chapter 5, we explore a novel use case of employing large language models to speed up architecture search, while maintaining the efficiency and quality of state-of-the-art neural architecture search algorithms.

View record

Towards alleviating human supervision for document-level relation extraction (2024)

Motivated by various downstream applications, there is tremendous interest in the automatic construction of knowledge graphs (KG) by extracting relations from text corpora. Relation Extraction (RE) from unstructured data sources is a key component for building large-scale KG. In this thesis, I focus on the research centered on Document Level Relation Extraction. One challenge of Document Level Relation Extraction is the lack of labeled training data since the construction of a large in-domain labeled dataset would require a large amount of human labor. To alleviate human supervision on documentlevel relation extraction, I propose 1) an unsupervised RE method CIFRE which enhances the recall of pipeline-based approaches while keeping highprecision; 2) a semi-supervised RE method DuRE when few labeled data are available, by leveraging self-training to generate pseudo text. In order to improve the quality of pseudo text, I also propose two methods (DuNST and KEST) to improve the controllability and diversity of semi-supervised text generation, solving the challenges of inadequate unlabeled data, overexploitation, and training deceleration. Comprehensive experiments on real datasets demonstrate that our proposed methods significantly outperform all baselines, proving the effectiveness of our methods in unsupervised and semi-supervised document-level relation extraction.

View record

Efficient computations on uncertain graphs by group testing, streaming and recycling (2022)

Uncertain graphs, where the presence of connections between nodes is probabilistic, have received a great deal of attention in a wide range of fields. Despite the progress made by prior works, the application of existing algorithms to solve the important problems of coverage maximization, also known as Influence Maximization (IM), and reachability estimation, also known as reliability, on uncertain graphs remains constrained by their computational costs in the form of both the running time and memory needed for one to achieve high quality solutions. In Chapter 2 we address the issue that when performing sampling on large networks the majority of random draws of edges are being effectively wasted as the majority of edges will not be live. We resolve this by introducing an approach that through the application of group testing of edges scales only logarithmically with the number of failed edges in the worst case as opposed to linearly. In Chapter 3 we tackle the problem of the exorbitant memory required to store the Reverse Influence Sample (RIS) collection used by existing approaches to solve the IM problem becoming prohibitive. We avoid this by developing a new non-greedy approach that avoids storing the RIS collection by streaming it instead. In Chapter 4 we note that a key cause of Monte Carlo simulation, along with existing approaches that build upon it, becoming ineffective for estimating low probability reachability is caused by the number of samples it needs to attain a desired relative error depending on the probability that is being estimated. We remedy this by developing a technique of recycling of random draws which enables us to develop an algorithm whose relative error does not depend on the probability being estimated and as such does not suffer from this limitation. In all cases we perform experiments on real datasets to empirically validate the effectiveness of the algorithms and techniques we develop.

View record

Welfare maximization for complementary and competing items and their applications: analysis and algorithm using a utility driven model for achieving better social influence (2022)

Motivated by applications such as viral marketing, the problem of influence maximization (IM) has been extensively studied in the literature. The goal is to select a small number of users to adopt an item such that it results in a large cascade of adoptions by others. Existing works have three key limitations. (1) They do not account for the economic considerations of a user in buying/adopting items. (2) They cannot model the complex interactions between multiple items. (3) For the network owner, maximizing social welfare is important to ensure customer loyalty,which is not addressed in prior work in the IM literature. In this work, we address all three limitations and propose a novel model called Utility driven Independent Cascade (UIC) that combines utility-driven item adoption with influence propagation over networks. We focus on several types of items such as mutually complementary only, competing only, and a mix of the two in the context of the filter bubble problem. We formulate the problem of social welfare maximization under each of these settings. We show that while the objective function is neither submodular nor supermodular, a constant or instance-dependent approximation can still be achieved. With comprehensive experiments on real and synthetic datasets, we demonstrate that our algorithms significantly outperform all baselines on large real social networks.

View record

Structured bandits and applications : exploiting problem structure for better decision-making under uncertainty (2019)

We study the problem of decision-making under uncertainty in the bandit setting. This thesis goes beyond the well-studied multi-armed bandit model to consider structured bandit settings and their applications. In particular, we learn to make better decisions by leveraging the application-specific problem-structure in the form of features or graph information. We investigate the use of structured bandits in two practical applications: online recommender systems with an available network of users and viral marketing in social networks. For each of these applications, we design efficient bandit algorithms and theoretically characterize their performance. We experimentally evaluate the efficiency and effectiveness of these algorithms on real-world datasets. For applications that require modelling complex non-linear feature-reward relationships, we propose a bootstrapping approach and establish theoretical regret bounds for it. Furthermore, we consider the application of multi-class classification with bandit feedback as a test-bed for evaluating our bootstrapping approach.

View record

Mining unstructured social streams: cohesion, context and evolution (2017)

As social websites like Twitter greatly influence people's digital life,unstructured social streams become prevalent, which are fast surging textualpost streams without formal structure or schema between posts or inside the postcontent. Modeling and miningunstructured social streams in Twitter become a challenging and fundamentalproblem in social web analysis, which leads to numerous applications, e.g., recommending social feeds like "what's happening right now?" or "what arerelated stories?".Current social stream analysis in response to queriesmerely return an overwhelming list of posts, with little aggregation or semantics. The design ofthe next generation social stream mining algorithms faces various challenges,especially, the effective organization of meaningful information from noisy, unstructured,and streaming social content.The goal of this dissertation is to address the most critical challenges insocial stream mining using graph-based techniques.We model a social stream as a post network, and use "event" and"story" to capture a group of aggregated social posts presenting similarcontent in different granularities, where an event may containa series of stories.We highlight our contributions on social stream mining from a structuralperspective as follows. We first model a story as a quasi-clique, which iscohesion-persistent regardless of the story size, and propose two solutions, DIMand SUM, to search the largest story containing given query posts, bydeterministic and stochastic means, respectively. To detect all stories in the time window of asocial stream and support the context-aware story-telling, we propose CAST,which defines a story as a (k,d)-Core in post network and tracks therelatedness between stories.We propose Incremental Cluster Evolution Tracking (ICET),which is an incremental computation framework for event evolution onevolving post networks, with the ability to track evolution patterns of socialevents as time rolls on. Approaches in this dissertation are based on twohypotheses: users prefer correlated posts to individual posts in poststream modeling, and a structural approach is better thanfrequency/LDA-based approaches in event and story modeling. We verify these hypotheses bycrowdsourcing based user studies.

View record

Computational Social Influence: Models, Algorithms, and Applications (2016)

Social influence is a ubiquitous phenomenon in human life. Fueled by the extreme popularity of online social networks and social media, computational social influence has emerged as a subfield of data mining whose goal is to analyze and understand social influence using computational frameworks such as theoretical modeling and algorithm design. It also entails substantial application potentials for viral marketing, recommender systems, social media analysis, etc. In this dissertation, we present our research achievements that take significant steps toward bridging the gap between elegant theories in computational social influence and the needs of two real-world applications: viral marketing and recommender systems. In Chapter 2, we extend the classic Linear Thresholds model to incorporate price and valuation to model the diffusion process of new product adoption; we design a greedy-style algorithm that finds influential users from a social network as well as their corresponding personalized discounts to maximize the expected total profit of the advertiser. In Chapter 3, we propose a novel business model for online social network companies to sell viral marketing as a service to competing advertisers, for which we tackle two optimization problems: maximizing total influence spread of all advertisers and allocating seeds to advertisers in a fair manner. In Chapter 4, we design a highly expressive diffusion model that can capture arbitrary relationship between two propagating entities to arbitrary degrees. We then study the influence maximization problem in a novel setting consisting of two complementary entities and design efficient approximation algorithms. Next, in Chapter 5, we apply social influence into recommender systems. We model the dynamics of user interest evolution using social influence, as well as attraction and aversion effects. As a result, making effective recommendations are substantially more challenging and we apply semi-definite programming techniques to achieve near-optimal solutions. Chapter 6 concludes the dissertation and outlines possible future research directions.

View record

Composite Recommendation: Semantics and Efficiency (2015)

Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or DVD. However, many applications can benefit from a system which is capable of recommending packages of items. Sample applications include travel planning, e-commerce, and course recommendation. In these contexts, there is a need for a system that can recommend the most relevant packages for the user to choose from.In this thesis we highlight our research achievements for the composite recommendation problem. We first consider the problem of composite recommendation under hard constraint, e.g., budget. It is clear that this is a very common paradigm for the composite recommendation problem. In Chapter 3, we first discuss how given a fixed package schema, we can efficiently find the top-k most relevant packages with hard constraints. The proposed algorithm is shown to be instance optimal, which means that no algorithm in a reasonable class can perform more than a constant times better, for some fixed constant. And we also propose relaxed solutions based on probabilistic reasoning. In Chapter 4, we lift the constraint on the package schema, and discuss how efficient algorithms can be derived to solve the more general problem with a flexible package schema. For this problem, again we propose both instance optimal algorithm and heuristics-based solution which have been verified to be effective and efficient through our extensive empirical study. Then in Chapter 5, motivated by the fact that hard constraints sometimes might lead to unfavorable results, and following the recent paradigm on “softening” the constraints, we study the problem of how to handle top-k query processing with soft constraints. Finally, in Chapter 6, we discuss a general performance tuning solution based on cached views which can be leveraged to further optimize the various algorithms proposed in this thesis.

View record

Social influence and its applications: an algorithmic and data mining study (2013)

Social influence occurs when one's actions are affected by others. If leveraged carefully, social influence can be exploited in many applications like viral marketing (or targeted advertising in general), recommender systems, social network analysis, events detection, experts finding, link prediction, ranking of feeds etc. One of the fundamental problems in this fascinating field is the problem of influence maximization, primarily motivated by the application of viral marketing. The objective is to identify a small set of users in a social network, who when convinced to adopt a product will influence others in the network leading to a large number of adoptions. The vision of our work is to take the algorithmic and data mining aspects of viral marketing out of the lab. We organize ours goals and contributions into four categories: (i) With the ever-increasing scale of online social networks, it is extremely important to develop efficient algorithms for influence maximization. We propose two algorithms -- CELF++ and SIMPATH that significantly improve the scalability. (ii) We remark that previous studies often make unrealistic assumptions and rely on simulations, instead of validating models against real world data. For instance, they assume an arbitrary assignment of influence probabilities in their studies, which focused more on algorithms than on validity with respect to real data. We attack the problem of learning influence probabilities. In another work, we propose a novel data driven approach to influence models and show that it predicts influence diffusion with much better accuracy. (iii) Next, we propose alternative problem formulations -- MINTSS and MINTIME and show interesting theoretical results. These problem formulations capture the problem of deploying viral campaigns on budget and time constraints. In an additional work, we take a fresh perspective on identifying community leaders using a pattern mining approach. (iv) Finally, we examine applications of social influence. We begin with the application of viral marketing. We show that product adoption is not exactly influence. Given this, we develop a product adoption model and study the problem of maximizing product adoption. Furthermore, we propose and investigate a novel problem in recommender systems, for targeted advertising -- RECMAX.

View record

Patterns and privacy preservation with prior knowledge for classification (2010)

Privacy preservation is a key issue in outsourcing of data mining.When we seek approaches to protect the sensitive information containedin the original data, it is also important to preserve the mining outcome.We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially.We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guaranteewhen the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and proposea variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results showthat the two approaches are effective to counter powerful attack models.In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise.We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Exploring the potential of LLMs for biomedical relation extraction (2024)

Though the current state-of-the-art models for biomedical relation extraction rely on encoder-only LLMs that are pre-trained on domain-specific data and then fine-tuned using large amounts of annotated data, there is a compelling argument to explore decoder-only LLMs and alternative task formulation paradigms to investigate if similar or superior results can be achieved for this task without the need for large amounts of annotated data and computational resources. Surprisingly, there has been limited exploration of decoder-only LLMs for biomedical relation extraction. This study aims to address this gap, presenting BioREPS, a novel method that enhances decoder-only LLMs through semantic similarity and chain-of-thought prompting, yielding promising outcomes. This study highlights the effectiveness of instruction training and self-generated chain-of-thought prompts in enhancing the reasoning abilities of decoder-only LLMs for biomedical relation extraction. Additionally, this research investigates various task formulation paradigms and the empirical advantages of domain-specific training for biomedical relation extraction through a series of experiments. The results confirm that ”general” decoder-only Language Models (LLMs) holdimmense potential for the task of biomedical relation extraction and come close to state-of-the-art performance with a fractional amount of data and no additional training.

View record

Improving language models with novel contrastive learning objectives (2024)

Contrastive learning (CL) has recently emerged as an effective technique in natural language processing, especially in the important area of language modeling. In this work, we offer novel methods for deploying CL in both pretraining and finetuning of language models. First, we present PACT (Pretraining with Adversarial Contrastive Learning for Text Classification), a novel self-supervised framework for text classification. Instead of contrasting against in-batch negatives, a popular approach in the literature, PACT mines negatives closer to the anchor representation. PACT operates by endowing the standard pretraining mechanisms of BERT with adversarial contrastive learning objectives, allowing for effective joint optimization of token- and sentence-level pretraining of the BERT model. Our experiments on 13 diverse datasets including token-level, single-sentence, and sentence-pair text classification tasks show that PACT achieves consistent improvements over SOTA baselines. We further show that PACT regularizes both token-level and sentence-level embedding spaces into more uniform representations, thereby alleviating the undesirable anisotropic phenomenon of language models. Subsequently, in the context of finetuning, we apply CL in tackling cross-platform abusive language detection. The prevalence of abusive language on different online platforms has been a major concern that raises the need for automated cross-platform abusive language detection. However, prior works focus on concatenating data from multiple platforms, inherently adopting Empirical Risk Minimization (ERM) method. In our work, we address this challenge from the perspective of domain generalization objective. We design SCL-Fish, a supervised contrastive learning integrated meta-learning algorithm to detect abusive language on unseen platforms. Our experimental analysis shows that SCL-Fish achieves better performance over ERM and the existing state-of-the-art models. We also show that SCL-Fish is data-efficient and achieves comparable performance with the large-scale pretrained models upon finetuning for the abusive language detection task.

View record

Map conflation via knowledge graph representations of digital maps (2023)

Digital maps play a crucial role in various applications such as navigation, fleet management, and ride-sharing, necessitating their accuracy and timely updates. While the majority of geospatial databases (GDBs) provide high-quality information, their data is (i) limited to specific regions and/or (ii) missing some entities even in their covered areas. Map conflation is the process of the augmentation of a GDB using another GDB to conflate missing spatial features. Existing map conflation methods suffer from two main limitations: (1) They are designed for the conflation of linear objects (e.g., road networks) and cannot simply be extended to non-linear objects, thus missing information about most entities in the map. (2) They are heuristic algorithmic approaches that are based on pre-defined rules, unable to learn entities matching in a data-driven manner. To address these limitations, we design MINOR (Map Conflation via Knowledge Graph Representation), a machine-learning approach consisting of three parts: (1) Knowledge Graph Construction: where each GDB is represented by a knowledge graph (2) Map Matching: where we use a knowledge graph alignment method as well as a geospatial feature encoder to match entities in obtained knowledge graphs. (3) Map Merging: to consistently merge matched entities in the previous modules, we use a mixed integer linear programming formulation that fully merges the GDBs without adding any inconsistencies. Our experimental evaluation shows that not only MINOR achieves outstanding performance compared to state-of-the-art and baselines in map conflation tasks, but each of its modules (e.g., Map Matching and Map Merging) separately outperforms traditional matching and merging methods.

View record

A principled approach to automated road network conflation (2021)

Geospatial databases are gaining tremendous popularity, driven by the need to have digital maps to support various individual and industrial use cases like navigation, logistics, fleet management, ride-sharing, and food delivery. Clearly, maps need to be accurate and up-to-date. While most geospatial databases are proprietary, OpenStreetMap (OSM) is a collaborative good quality crowd-sourced database. The quality of OSM is comparable with many private datasets: OSM is currently being used by many companies worldwide, including Facebook, Apple, Amazon and Microsoft. The contributors volunteer to "draw" the missing spatial features on OSM. Despite continuous meticulous work for more than 15 years now, OSM data still lacks good coverage in many parts of the world. Extending its coverage is a laborious and time-consuming task. There is a pressing need for an automated approach to conflate the missing features from other spatial databases and satellite images. We address road network conflation between two vector geospatial databases and propose MAYUR, solving the major subtasks -- map matching and map merging. We represent geospatial databases as a graph of road intersections (vertices) and lines (edges) with spatial attributes. We propose a novel map matching framework by adapting the Rank Join in databases, where each edge of the reference database graph is a relation. Our algorithm finds the target database graph that best matches the reference database, respecting the connectivity of the road network. Classic Rank Join in databases has been tested on very few relations, and it gets quickly inefficient on instances with many relations. We introduce several optimizations that boost the algorithm's efficiency, making it scalable for our problem setting featuring hundreds to thousands of relations. We establish the node mappings between the two databases using the mapping obtained from the matching step. Our map merging uses rubbersheeting to merge the missing features. We demonstrate the performance of our map conflation approach on sidewalks in OSM and Boston Open Data. Manual evaluation shows that our map matching achieves an impressive 98.65% precision and 99.55% recall, with our map merging causing very small perturbation in the merged features, outperforming Hootenanny, the leading map conflation approach.

View record

Financial knowledge graph construction (2020)

The proliferation of financial news sources reporting on companies, markets, currencies and stocks presents an opportunity for strategic decision making by mining data with the goal of extracting structured representations about financial entities and their inter-relations. These representations can be conveniently stored as (subject, predicate, object) triples in a knowledge graph that can be used drive new in-sights through answering complex queries using high level declarative languages.Towards this goal, we develop a high precision knowledge extraction pipeline tailored for the financial domain. This pipeline combines multiple information ex-traction techniques with a financial dictionary that we built, all working together to produce over 342,000 compact extractions from over 288,000 financial news articles, with a precision of 78% at the top-100 extractions. These extractions are stored in a knowledge graph readily available for use in downstream applications. Our pipeline outperforms existing work in terms of precision, the total number of extractions and the coverage of financial predicates.

View record

Hierarchical summaries of change in multidimensional data (2019)

Multidimensional data is prevalent in data warehousing and OLAP. Changes to the data in a data warehouse are of particular interest to analysts as well as knowledge workers since they may be instrumental in understanding trends in a business or enterprise. At the same time, an explicit enumeration of all changes to the fact tuples in a data warehouse is too verbose to be useful; instead, a summary of the changes is more desirable since it allows the user to quickly understand the trends and patterns. In this thesis, we study the problem of summarizing changes in hierarchical multidimensional data with non-overlapping but containable hyperrectangles. An advantage of such summaries is that they naturally follow a tree structure and therefore are arguably easy to interpret. Hierarchies are naturally present in data warehouses and our constructed summaries are designed to leverage the hierarchical structure along each dimension to provide concise summaries of the changes.We study the problem of generating lossless as well as lossy summaries of changes. While lossless summaries allow the exact changes to be recovered, lossy summaries trade accuracy of reconstruction for conciseness. In a lossy summary, the maximum amount of lossiness per tuple can be regulated with a single parameter α. We provide a detailed analysis of the algorithm, then we empirically evaluate its performance, compare it to existing alternative methods under various settings and demonstrate with a detailed set of experiments on real and synthetic data that our algorithm outperforms the baselines in terms of conciseness of summaries or accuracy of reconstruction w.r.t. the size, dimensionality and level of correlation in data.

View record

Combating the cold-start user problem in collaborative filtering recommender systems (2017)

For tackling the well known cold-start user problem in collaborative filtering recommender systems, one approach is to recommend a few items to a cold-start user and use the feedback to learn her preferences. This can then be used to make good recommendations to the cold user. In the absence of a good initial estimate of the preferences, the recommendations are like random probes. If the items are not chosen judiciously, both bad recommendations and too many recommendations may turn off a user. We study the cold-start user problem for the two main types of collaborative filtering methods -- neighbourhood-based and matrix factorization. We formalize the problem by asking what are the b (typically a small number) items we should recommend to a cold-start user, in order to learn her preferences best, and define what it means to do that under each framework. We cast the problem as a discrete optimization problem, called the optimal interview design (OID) problem, and study two variants -- OID-NB and OID-MF -- for the two frameworks. We present multiple non-trivial results, including NP-hardness as well as hardness of approximation for both. We further study supermodularity/submodularity and monotonicity properties for the objective functions of the two variants. Finally, we propose efficient algorithms and comprehensively evaluate them. For OID-NB, we propose a greedy algorithm, and experimentally evaluate it on 2 real datasets, where it outperforms all the baselines. For OID-MF, we discuss several scalable heuristic approaches for identifying the b best items to recommend to the user and experimentally evaluate their performance on 4 real datasets. Our experiments show that our proposed accelerated algorithms significantly outperform the prior art, while obtaining similar or lower error in the learned user profile as well as in the rating predictions made, on all the datasets tested.

View record

Validation of SQL queries over streaming warehouses (2017)

There is often a need to recover the “missing” query that produced a particular outputfrom a data stream. As an example, since a data stream is constantly evolving,the analyst may be curious about using the query from the past to evaluate it on thecurrent state of the data stream, for further analysis. Previous research has studiedthe problem of reverse engineering a query that would produce a given result at aparticular database state.We study the following problem. Given a streaming database D=,a result Rout , and a set of candidate queries Q, efficiently find all queries Qi ϵ Qsuch that for some state Dji of the stream, Qi(Dji) = Rout , and report the pair(Q,witQ) where witQ is the witness of (in)validity. A witness for a valid queryQval is a state Di s.t. Qval(Di) = Rout. For an invalid query Qinval , a witness is a pairof consecutive states (Di, Di+1) such that Rout \ Qinval (Di) ≠ Ø ≠ Qinval (Di+1) \ Rout.We allow any PTIME computable monotone query to be included in Q. Whiletechniques developed in previous research can be used to generate the candidatequery set Q, we focus on developing a scalable strategy for quickly determiningthe witness. We establish theoretical worst-case performance guarantees for ourproposed approach and show that it is no more than a factor of O(log |DRDS|) of theoptimal “Lucky guess” strategy, where Q(DRDS) = Rout. We empirically evaluateour technique and compare with natural baselines inspired from previous research.We show that the baselines either fail to scale or incur an inordinate amount ofoverhead by failing to take advantage of natural properties of a data stream. Bycontrast, our strategy scales effortlessly for very large data streams. Moreover,it never performs more than a small constant times the optimal amount of work,regardless of the state of the data stream that may have led to Rout.

View record

Influence Maximization in Bandit and Adaptive Settings (2016)

The objective of viral marketing is to leverage a social network to spread awareness about a specific product in the market through information propagation via word-of-mouth. A closely related problem is that of influence maximization which aims to identify the `best' set of `influential' users (seeds) to give discounts or free products to, such that awareness about the product is maximized. We study two relatively unexplored variants of influence maximization (IM) in social networks.Conventional IM algorithms assume the structure of the social network and edge weights to be known. Though the structure of the network is usually known, edge weights need to be estimated from past data. In the first part of this thesis, we tackle the real but difficult problems of (i) learning these edge weights online and (ii) maximizing influence in the network when no past data is available as input. We adopt a combinatorial multi-armed bandit (CMAB) paradigm and formulate the above problems respectively as (i) network exploration, i.e. incrementally minimizing the error in learned edge weights and (ii) regret minimization i.e. minimizing the loss in influence from choosing suboptimal seed sets over multiple attempts. Most previous work on influence maximization in social networks is limited to the non-adaptive setting in which the marketer is supposed to select all of the seed users up front. A disadvantage of this setting is that she is forced to select all the seeds based solely on a diffusion model. If the selected seeds do not perform well, there is no opportunity to course-correct. A more practical setting is the adaptive setting in which the marketer initially selects a batch of users and observes how seeding those users leads to a diffusion of product adoptions. Based on this market feedback, she formulates a policy for choosing the remaining seeds. We study adaptive offline strategies for two problems: (a) MAXSPREAD - given a budget on number of seeds and a time horizon, maximize the spread of influence and (b) MINTSS - given a time horizon and an expected number of target users, minimize the number of required seeds.

View record

On Social Event Organization (2016)

Online platforms, such as Meetup and Plancast, have recently become popular for planning gatherings and event organization. However, there is a surprising lack of studies on how to effectively and efficiently organize social events for a large group of people through those platforms. This thesis provides the first systematic study on key computational problem involved in organization of social events. We understand the Social Event Organization problem as assigning a set of events for a group of users to attend, where the users are socially connected with each other and have innate levels of interest in those events. We then introduce a set of formal definition of a restricted version of the problem and show that it is NP-hard and is hard to approximate. We propose efficient heuristic algorithms that improve upon simple greedy algorithms by incorporating the notion of phantom events and by using look-ahead estimation. Using synthetic datasets and three real datasets including those from the platforms Meetup and Plancast, we experimentally demonstrate that our greedy heuristics are scalable and furthermore outperform the baseline algorithms significantly in terms of achieving superior social welfare.

View record

Recommending user-generated item lists (2016)

Nowadays, more and more websites are providing users with the functionality to create item lists. For example, in Yelp, users can create restaurant lists he/she likes. In Goodreads, users can create booklists consisting of books they like. These user-generated item lists complement the main functionality of the corresponding application and intuitively become an alternative way for users to browse and discover interesting items. Existing recommender systems are not designed for recommending user-generated item lists directly. In this work, we study properties of these user-generated item lists and propose two different models for recommending user-generated lists. We first model the problem as a recommendation problem and propose a Bayesian ranking model, called Lire. The proposed model takes into consideration users' previous interactions with both item lists and with individual items. Furthermore, we propose in Lire a novel way of weighting items within item lists based on both positions of items, and personalized list consumption pattern. Through extensive experiments on a real item list dataset from Goodreads, we demonstrate the effectiveness of our proposed Lire model. We then summarize the lessons learned from the recommendation model and modelled the problem as an optimization problem. We show that the list recommendation optimization can be reduced to Maximum k-Coverage Problem, which is an NP-hard problem. We derive three different problem variants and propose algorithms for each of them. We then conduct experiments to compare their results. Lessons learned and possible application scenarios are also discussed in the hope of helping us and more people improve the work. Furthermore, we propose several potential directions for future work.

View record

Query-Driven Event Search in News Information Network (2014)

Traditional search focuses on keyword matching and document ranking, thus users will only get a overload of related news articles or videos, with little semantics aggregation, when they input keywords among which they are interested in exploring potential connections. To capture these connections, one good way is to model news articles into a complex network of events, where an event itself is a complex network of interrelated actions (or assertions). Event search enables users to get a big picture of essential actions involved without going through overwhelming documents. Further on, Query-driven event search, in which users can access key actions that occurred in various events with respect to input query and construct a new event capturing corresponding actions and connections between them, is able to inherently revolutionize search from its traditional keyword matching to a higher level of abstraction. Thus, we propose a natural and useful paradigm, Query- driven event search in News Information Network, to allow users to search through news articles and obtain events as answers. We construct the news information network with nodes representing actions and edges indicating relatedness between nodes. We define a good answer as a structurally densely linked and semantically related subgraph, describing a concise and informative event. Thus our objective function combines both the linkage structure of graph and semantic content relatedness. We then formally define our problem to build a query-driven event search engine that processes users query and generates a subgraph satisfying following conditions: 1) covering all keywords but without noisy nodes 2) maximizes the objective function. We prove this problem is NP-complete and propose heuristics algorithms. Our experimental evaluation on real-life news datasets demonstrates algorithms efficiency and meaningful solutions we obtain.

View record

Towards a Time-Lapse Prediction System for Cricket Matches (2014)

Cricket is a popular sport played in over a hundred countries, is the second most watched sport in the world after soccer, and enjoys a multi-million dollar industry. There is tremendous interest in simulating cricket and more importantly in predicting the outcome of games, particularly in their one-day international format. The complex rules governing the game, along with the numerous natural phenomena affecting the outcome of a cricket match present significant challenges for accurate prediction. Multiple diverse parameters, including but not limited to cricketing skills and performances, match venues and even weather conditions can significantly affect the outcome of a game. The sheer number of parameters, along withtheir interdependence and variance create a non-trivial challenge to create an accurate quantitative model of a game. Unlike other sports such as basketball and baseball which are well researched from a sports analytics perspective, for cricket, these tasks have yet to be investigated in depth. The goal of this work is to predict the game progression and winner of a yet-to-begin or an ongoing game. The game is modeled using a subset of match parameters, using a combination of linearregression and nearest-neighbor classification-aided attribute bagging algorithm. The prediction system takes in historical match data as well as the instantaneous state of a match, and predicts the score at key points in the future, culminatingin a prediction of victory or loss. Runs scored at the end of an innings, the key factor in determining the winner, are predicted at various points in the game. Our experiments based on actual cricket game data, shows that our method predicts the winner with an accuracy of approximately 70%.

View record

Flexible and efficient exploration of rated datasets (2013)

As users increasingly rely on collaborative rating sites to achieve mundanetasks such as purchasing a product or renting a movie, they are facing thedata deluge of ratings and reviews. Traditionally, the exploration of rateddata sets has been enabled by rating averages that allow user-centric, itemcentricand top-k exploration. More speci cally, canned queries on userdemographics aggregate opinion for an item or a collection of items such as18-29 year old males in CA rated the movie The Social Network at 8:2 onaverage. Combining ratings, demographics, and item attributes is a powerfulexploration mechanism that allows operations such as comparing theopinion of the same users for two items, comparing two groups of users ontheir opinion for a given class of items, and nding a group whose ratingdistribution is nearly unanimous for an item. To enable those operations,it is necessary to (i) adopt the right measure to compare ratings, and to(ii) develop e cient algorithms to nd relevant groups.We argue that rating average is a weak measure for capturing such comparisons.We propose contrasting and querying rating distributions, instead,using the Earth Mover's Distance (EMD), a measure that intuitively reectsthe amount of work needed to transform one distribution into another. Weshow that the problem of nding groups matching given rating distributionsis NP-hard under di erent settings and develop appropriate heuristics.Our experiments on real and synthetic datasets validate the utility of ourapproach and demonstrate the scalability of our algorithms.

View record

Revisiting recommendations: From customers to manufacturers (2013)

Recommender systems exploit user feedback over items they have experienced formaking recommendations of other items that are most likely to appeal to them.However, users and items are but two of the three types of entities participating inthis ecosystem of recommender systems. The third type of entities are the manufacturersof the products, and users are really their customers. Traditional recommendersystems research ignores the role of this third entity type and exclusivelyfocuses on the other two. What might item producers bring to recommender systemsresearch? Their objectives are related to their business and are captured byquestions such as “what kind of (new) products should I manufacture that willmaximize their popularity?” These questions are not asked in a vacuum: manufacturershave constraints, e.g., a budget. The idea is that the user feedback data (e.g.,ratings) capture users’ preferences. The question is whether we can learn enoughintelligence from it, so as to recommend new products to manufacturers that willhelp meet their business objectives.We propose the novel problem of new product recommendation for manufacturers.We collect real data by crawling popular e-commerce websites, and modelcost and popularity as a function of product attributes and their values. We incorporatecost constraints into our problem formulation: the cost of the new productsshould fall within the desired range while maximizing the popularity. We showthat the above problem is NP-hard and develop a pseudo-polynomial time algorithmfor the recommendations generation. Finally, we conduct a comprehensiveexperimental analysis where we compare our algorithm with several natural heuristicson three real data sets and perform scalability experiments on a synthetic dataset.

View record

Elevating search results from flat lists to structured expansions (2012)

Keyword based search interfaces are extremely popular as a means for efficiently discovering items of interest from a huge collection, as evidenced by the success of search engines like Google and Bing. However, most of the current search services still return results as a flat ranked list of items. Considering the huge number of items which can match a query, this list based interface can be very difficult for the user to explore and find important items relevant to their search needs. In this work, we consider a search scenario in which each item is annotated with a set of keywords. E.g., in Web 2.0 enabled systems such as flickr and del.icio.us, it is common for users to tag items with keywords. Based on this annotation information, we can automatically group query result items into different expansions of the query corresponding to subsets of keywords. We formulate and motivate this problem within a top-k query processing framework, but as that of finding the top-k most important expansions. Then we study additional desirable properties for the set of expansions returned, and formulate the problem as an optimization problem of finding the best k expansions satisfying all the desirable properties. We propose several efficient algorithms for this problem. Our problem is similar in spirit to recent works on automatic facets generation, but has the important difference and advantage that we don’t need to assume the existence of pre-defined categorical hierarchy which is critical for these works. Through extensive experiments on both real and synthetic datasets, we show our proposed algorithms are both effective and efficient.

View record

The SHARE system: A semantic web based approach for evaluating queries across distributed bioinformatics databases and software (2011)

Many bioinformatics studies require combined use of data sets and software developed by different research labs. At the current time, accomplishing such studies requires the development of custom scripts that act as “glue” for the independent resources, performing transformations on the data sets that will allow them to be loaded into a single database and/or shuttled through different pieces of software. Due to the tedium and inefficiency of manual data/software integration, many institutions and research groups have sought to find a more reliable and automatic approach. The most significant integration project in recent years has been the Semantic Web activity of the World Wide Web Consortium (W3C), which aims to automate data integration not only in bioinformatics, but on the WWW as a whole. The goal of the Semantic Web is to interlink data on the web in a manner that is similar to the way that HTML pages are linked, while at the same time making the data available in a universal form that can be easily processed by software. In this thesis, the author describes a distributed query system called SHARE (Semantic Health and Research Environment) which demonstrates how the available standards and tools of the Semantic Web can be assembled into a framework for automating data and software integration in bioinformatics. We find that while SHARE has a similar architecture to existing query systems, the use of Semantic Web technologies has important advantages for the implementation, maintenance, and addition of new data sources to the system. After reviewing the mechanics of SHARE, we examine the crucial problem of optimizing queries in an environment where statistics about the data sources are typically not available. A query evaluation procedure called GREEDY is presented that addresses this challenge by: i) interleaving the planning and execution phases of a query, and ii) learning statistics from the execution of previous queries. We conclude by highlighting the unique strengths of SHARE and GREEDY in relation to other integration systems, and review important areas for future work.

View record

Efficient Approximation of Social Relatedness over Large Social Networks and Application to Query-Enabled Recommended Systems (2010)

Social relatedness measures such as the Katz score and the commute time betweenpairs of nodes have been subject of significant research effort motivated by socialnetwork problems including link prediction, anomalous link detection, and collaborativefiltering.In this thesis, we are interested in computing: (1) the score for a given pair ofnodes, and (2) the top-k nodes with the highest scores from a specific source node.Unlike most traditional approaches, ours scale to large networks with hundreds ofthousands of nodes.We introduce an efficient iterative algorithm which calculates upper and lowerbounds for the pairwise measures. For the top-k problem, we propose an algorithmthat only has access to a small subset of nodes. Our approaches rely on techniquesdeveloped in numerical linear algebra and personalized PageRank computing. Usingthree real-world networks, we examine scalability and accuracy of our algorithmsas in a short time as milliseconds to seconds.We also hypothesize that incorporating item based tags into a recommendersystem will improve its performance. We model such a system as a tri-partite graphof users, items and tags and use this graph to define a scoring function making useof graph-based proximity measures.Exactly calculating the item scores is computationally expensive, so we use theproposed top-k algorithm to calculate the scores. The usefulness and efficiency ofthe approaches are compared to a simple, non-graph based, approach. We evaluatethese approaches on a combination of the Netflix ratings data and the IMDb tagdata.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Get key application advice, hear about the latest research opportunities and keep up with the latest news from UBC's graduate programs.