Doctor of Philosophy in Electrical and Computer Engineering (PhD) 
Distributed Computing Systems
University of Waterloo
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
Future high-performance computing systems will be hybrid; they will include processors optimized for sequential processing and massively-parallel accelerators. Platforms based on Graphics Processing Units (GPUs) are an example of this hybrid architecture, they integrate commodity CPUs and GPUs. This architecture promises intriguing opportunities: within the same dollar or energy budget, GPUs offer a significant increase in peak processing power and memory bandwidthcompared to traditional CPUs, and are, at the same time, generally-programmable.The adoption of GPU-based platforms, however, faces a number of challenges, including the characterization of time/space/power tradeoffs, the development of new algorithms that efficiently harness the platform and abstracting the accelerators in a generic yet efficient way to simplify the task of developing applications on such hybrid platforms.This dissertation explores solutions to the abovementioned challenges in the context of an important class of applications, namely irregular applications.Compared to regular applications, irregular applications have unpredictable memory access patterns and typically use reference-based data structures, such as trees or graphs; moreover, new applications in this class operate on massive datasets.Using novel workload partitioning techniques and by employing datastructures that better match the hybrid platform characteristics, this work demonstrates that significant performance gains, in terms of both time to solution and energy, can be obtained when partitioning the irregular workload to be processed concurrently on the CPU and the GPU.
The open nature of the Web, online social networks (OSNs) in particular, makes it possible to design socialbots—automationsoftware that controls fake accounts in a target OSN, and has the ability to perform basic activities similar to those of real users. In the wrong hands, socialbots can be used to infiltrate online communities, build up trust over time, and then engage in various malicious activities.This dissertation presents an in-depth security analysis of malicious socialbots on the Web, OSNs in particular. The analysis focuses on two main goals: (1) to characterize and analyze the vulnerability of OSNs to cyber attacks by malicious socialbots, social infiltration in particular, and (2) to design and evaluate a countermeasure to efficiently and effectively defend against socialbots.To achieve these goals, we first studied social infiltration as an organized campaign operated by a socialbot network (SbN)—a group of programmable socialbots that are coordinated by an attacker in a botnet-like fashion. We implemented a prototypical SbN consisting of 100 socialbots and operated it on Facebook for 8 weeks. Among various findings, we observed that some users are more likely to become victims than others, depending on factors related to their social structure. Moreover, we found that traditional OSN defenses are not effective at identifying automated fake accounts or their social infiltration campaigns.Based on these findings, we designed Íntegro—an infiltration-resilient defense system that helps OSNs detect automated fake accounts via a user ranking scheme. In particular, Íntegro relies on a novel approach that leverages victim classification for robust graph-based fake account detection, with provable security guarantees. We implemented Íntegro on top of widely-used, open-source distributed systems, in which it scaled nearly linearly. We evaluated Íntegro against SybilRank—the state-of-the-art in graph-based fake account detection—using real-world datasets and a large-scale, production-class deployment at Tuenti, the largest OSN in Spain with more than 15 million users. We showed that Íntegro significantly outperforms SybilRank in ranking quality, allowing Tuenti to detect at least 10 times more fake accounts than their current abuse detection system.
This dissertation focuses on supporting the provisioning and configuration of distributed storage systems in clusters of computers that are designed to provide a high performance computing platform for batch applications. These platforms typically offer a centralized persistent backend storage system. To avoid the potential bottleneck of accessing the platform's backend storage system, intermediate storage systems aggregate resources allocated to the application to provide a shared temporary storage space dedicated to the application execution. Configuring an intermediate storage system, however, becomes increasingly complex. As a distributed storage system, intermediate storage can employ a wide range of storage techniques that enable workload-dependent trade-offs over interrelated success metrics such as response time, throughput, storage space, and energy consumption. Because it is co-deployed with the application, it offers the user the opportunity to tailor its provisioning and configuration to extract the maximum performance from the infrastructure. For example, the user can optimize the performance by deciding the total number of nodes of an allocation, splitting these nodes, or not, between the application and the intermediate storage, and choosing the values for several configuration parameters for storage techniques with different trade-offs.This dissertation targets the problem of supporting the configuration and provisioning of intermediate storage systems in the context of workflow-based scientific applications that communicate via files -- also known as many-task computing -- as well as checkpointing applications. Specifically, this study proposes performance prediction mechanisms to estimate performance of overall application or storage operations (e.g., an application turn-around time, application's energy consumption, or response time of write operations). By relying on the target application's characteristics, the proposed mechanisms can accelerate the exploration of the configuration space. The mechanisms use monitoring information available at the application level, not requiring changes to the storage system nor specialized monitoring systems. The effectiveness of these mechanisms is evaluated in a number of scenarios -- including different system scale, hardware platforms, and configuration choices. Overall, the mechanisms provide accuracy high enough to support the user's decisions about configuration and provisioning the storage system, while being 200x to 2000x less resource-intensive than running the actual applications.
Commons-based peer production systems are marked by three main characteristics, they are: radically decentralized, non-proprietary, and collaborative. Peer production is in stark contrast to market-based production and/or on a centralized organization (e.g., carpooling vs. car rental; couch surfing vs. hotels; Wikipedia vs. Encyclopedia Britannica). Social tagging systems represent a class of web systems, where peer production is central in their design. In these systems, decentralized users collect, share, and annotate (or tag) content collaboratively to produce a public pool of annotated content. This uncoordinated effort helps filling the demand for labeling an ever increasing amount of user-generated content on the web with textual information. Moreover, these labels (or simply tags) can be valuable as input to mechanisms such as personalized search or content promotion. Assessing the value of individuals contributions to peer production systems is key to design user incentives to bring high quality contributions. However, quantifying the value of peer-produced information such as tags is intrinsically challenging, as the value of information is inherently contextual and multidimensional. This research aims to address these two issues in the context of social tagging systems. To this end, this study sets forth the following hypothesis: assessing the value of peer-produced information in social tagging systems can be achieved by harnessing context and user behavior characteristics. The following questions guide the investigations. Characterization: (Q1). What are the characteristics of individual user activity? (Q2). What are the characteristics of social user activity? (Q3). What are the aspects that influence users perception of tag value? Design: (Q4). How to assess the value of tags for exploratory search? (Q5). What is the value of peer-produced information for content promotion? This study applies a mixed methods approach. The findings show that patterns of user activity can inform the design of supporting mechanisms for tagging systems. Moreover, the results suggest that the proposed method to assess value of tags is able to differentiate between valuable tags from less valuable tags, as perceived by users. Moreover, the analysis of the value of peer-produced informationfor content promotion shows that peer-produced sources can oftentimes outperform expert-produced sources.
Distributed storage system middleware acts as a bridge between the upper layer applications, and the lower layer storage resources available in the deployment platform. Storage systems are expected to efficiently support the applications’ workloads while reducing the cost of the storage platform. In this context, two factors increase the complexity of the design of storage systems: First, the applications’ workloads are diverse among number of axes: read/write access patterns, data compressibility, and security requirements to mention only a few. Second, storage system should provide high performance within a certain dollar budget. This dissertation addresses two interrelated issues in this design space. First, can the computational power of the commodity massively multicore devices be exploited to accelerate storage system operations without increasing the platform cost? Second, is it possible to build a storage system that can support a diverse set of applications yet can be optimized for each one of them?This work provides evidence that, for some system designs and workloads, significant performance gains are brought by exploiting massively multicore devices and by optimizing the storage system for a specific application. Further, my work demonstrates that these gains are possible while still supporting the POSIX API and without requiring changes to the application. Finally, while these two issues can be addressed independently, a system that includes solutions to both of them enables significant synergies.
Authorization protects application resources by allowing only authorized entities to access them. Existing authorization solutions are widely based on the request-response model, where a policy enforcement point intercepts application requests, obtains authorization decisions from a remote policy decision point, and enforces those decisions. This model enables sharing the decision point as an authorization service across multiple applications. But, with many requests and resources, using a remote shared decision point leads to increased latency and presents the risk of introducing a bottleneck and/or a single point of failure. This dissertation presents three approaches to addressing these problems.The first approach introduces and evaluates the mechanisms for authorization recycling in role-based access control systems. The algorithms that support these mechanisms allow a local secondary decision point to not only reuse previously-cached decisions but also infer new and correct decisions based on two simple rules, thereby masking possible failures of the central authorization service and reducing the network delays. Our evaluation results suggest that authorization recycling improves the availability and performance of distributed access control solutions.The second approach explores a cooperative authorization recycling system, where each secondary decision point shares its ability to make decisions with others through a discovery service. Our system does not require cooperating secondary decision points to trust each other. To maintain cache consistency at multiple secondary decision points, we propose alternative mechanisms for propagating update messages. Our evaluation results suggest that cooperation further improves the availability and performance of authorization infrastructures.The third approach examines the use of a publish-subscribe channel for delivering authorization requests and responses between policy decision points and enforcement points. By removing enforcement points' dependence on a particular decision point, this approach helps improve system availability, which is confirmed by our analytical analysis, and reduce system administration/development overhead. We also propose several subscription schemes for different deployment environments and study them using a prototype system.We finally show that combining these three approaches can further improve the authorization system availability and performance, for example, by achieving a unified cooperation framework and using speculative authorizations.
The importance of high-performance graph processing to solve big data problems targeting high-impact applications is greater than ever before. Graphs incur highly irregular memory accesses which leads to poor data locality, load imbalance, and data-dependent parallelism. Distributed graph processing frameworks, such as Google's Pregel, that employs memory-parallel, shared-nothing systems have experienced tremendous success in terms of scale and performance. Modern shared-memory systems embrace the so called Non-Uniform Memory Access (NUMA) architecture which has proven to be more scalable (in terms of numbers of cores and memory modules) than the Symmetric Multiprocessing (SMP) architecture. In many ways, a NUMA system resembles a shared-nothing distributed system: physically distinct processing cores and memory regions (although, cache-coherent in NUMA). Memory accesses to remote NUMA domains are more expensive than local accesses. This poses the opportunity to transfer the know-how and design of distributed graph processing to develop shared-memory graph processing solutions optimized for NUMA systems (which is surprisingly little-explored).In this dissertation, we explore if a distributed-memory like middleware that makes graph partitioning and communication between partitions explicit, can improve the performance on a NUMA system. We design and implement a NUMA aware graph processing framework that treats the NUMA platform as a distributed system, and embraces its design principles; in particular explicit partitioning and inter-partition communication. We further explore design trade-offs to reduce communication overhead and propose a solution that embraces design philosophies of distributed graph processing system and at the same time exploits optimization opportunities specific to single-node systems. We demonstrate up to 13.9x speedup over a state-of-the-art NUMA-aware framework, Polymer and up to 3.7x scalability on a four-socket machine using graphs with tens of billions of edges.
As workflow-based data-intensive applications have become increasingly popular, the lack of support tools to aid resource provisioning decisions, to estimate the energy cost of running such applications, or simply to support configuration choices has become increasingly evident. The goal of this thesis is to design techniques and tools to predict the energy consumption of these workflow-based applications, evaluate different optimization techniques from an energy perspective, and explore energy/performance tradeoffs. This thesis proposes a methodology to predict the energy consumption for workflow applications. More concretely, it makes three key contributions: First, it proposes a simple analytical energy consumption model that enables adequately accurate energy consumption predictions. This makes it possible not only to estimate energy consumption but also to reason about the relative benefits different system configuration and provisioning decisions offer. Second, an empirical evaluation of energy consumption is carried out using synthetic benchmarks and real workflow applications. This evaluation quantifies the energy savings of performance optimizations for the distributed storage system as well as the energy and performance impact of power-centric tuning techniques. Third, it demonstrates the predictor’s ability to expose energy performance tradeoffs for the synthetic benchmarks and workflow applications by evaluating the accuracy of the energy consumption predictions. Overall, the prediction obtained an average accuracy of more than 85% and a median of 90% across different scenarios, while using less than 200x less resources than running than actual applications.
No abstract available.
This thesis is motivated by the fact that there is an urgent need to run scientific many-task workflow applications efficiently and easily on large-scale machines. These applications run at large scale on supercomputers and perform large amount of storage I/O. The storage system is identified as the main bottleneck on large-scale computers for many-task workflow applications. The goal of this thesis is to identify the opportunities and recommend solutions to improve the performance of many-task workflow applications. To achieve the above goal this thesis proposes a two-step solution. As the first step, this thesis recommends and designs an intermediate storage system which aggregates the resources available on compute nodes (local disk, SSDs, memory and network) and provides a minimal POSIX API required by workflow applications. An intermediate storage system facilitates a high performance scratch space for workflow applications and allows the applications to scale transparently compare to a regular shared storage systems. As the second step, this thesis performs a limit study on workflow-aware storage system: an intermediate storage that is tuned depending on I/O characteristics of a workflow application. Evaluation with synthetic and real workflow applications highlights the significant performance gain attainable by an intermediate storage system and a workflow-aware storage system. The evaluation shows that an intermediate storage can bring up to 2x performance gain compared to a central storage system. Further a workflow-aware storage system can bring up to 3x performance gain compared to a vanilla distributed storage system that is unaware of the possible file-level optimizations. The findings of this research prove that an intermediate storage system with minimal POSIX API is a promising direction to provide a high-performance scalable storage system for workflow applications. The findings also strongly advocate and provide design recommendations for a workflow-aware storage system to achieve better performance gain.
The retrieval and analysis of malicious content is an essential task for security researchers. Security labs use automated HTTP clients known as client honeypots to visit hundreds of thousands of suspicious URLs daily. The dynamic nature of malware distribution networks necessitate periodic re-evaluation of a subset of the confirmed malicious sites, which introduces two problems: 1) the number of URLs requiring re-evaluation exhaust available resources, and 2) repeated evaluation exposes the system to adversarial blacklisting, which affects the accuracy of the content collected. To address these problems, I propose optimizations to the re-evaluation logic that reduce the number of re-evaluations while maintaining a constant sample discovery rate during URLs re-evaluation. I study these problems in two adversarial scenarios: 1) monitoring malware repositories where no provenance is available, and 2) monitoring Fake Anti-Virus (AV) distribution networks. I perform a study of the adversary by repeatedly content from the distribution networks. This reveals trends in the update patterns and lifetimes of the distribution sites and malicious executables. Using these observations I propose optimizations to reduce the amount of re-evaluations necessary to maintain a high malicious sample discovery rate. In the first scenario the proposed techniques, when evaluated versus a fixed interval scheduler, are shown to reduce the number of re-evaluations by 80-93% (assuming a re-evaluation interval of 1 hour to 1 day) with a corresponding impact on sample discovery rate of only 2-7% percent. In the second scenario, optimizations proposed are shown to reduce fetch volume by orders of magnitude and, more importantly, reduce the likelihood of blacklisting.During direct evaluation of malware repositories I observe multiple instances of blacklisting, but on the whole, less than 1% of the repositories studied show evidence of blacklisting. Fake AV distribution networks actively blacklist IPs; I encountered repeated occurrences of IP blacklisting while monitoring Fake AV distribution networks.