Alan Wagner

Associate Professor

Relevant Degree Programs

Affiliations to Research Centres, Institutes & Clusters


Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - Nov 2019)
FG-MPI: Fine-Grain MPI (2013)

The Message Passing Interface (MPI) is widely used to write sophisticated parallel applicationsranging from cognitive computing to weather predictions and is almost universallyadopted for High Performance Computing (HPC).Many popular MPI implementations bind MPI processes to OS-processes. This runtimemodel has closely matched single or multi-processor compute clusters. Since 2008, however,clusters of multicore nodes have been the predominant architecture for HPC, with theopportunity for parallelism inside one compute node. There are a number of popular parallelprogramming languages for multicore that use message passing. One notable differencebetween MPI and these languages is the granularity of the MPI processes. Processes writtenusing MPI tend to be coarse-grained and designed to match the number of processes to theavailable hardware, rather than the program structure. Binding MPI processes to OS-processesfails to take full advantage of the finer-grain parallelism available on today'smulticore systems. Our goal was to take advantage of the type of runtime systems used byfine-grain languages and integrate that into MPI to obtain the best of these programmingmodels; the ability to have fine-grain parallelism, while maintaining MPI's rich support forcommunication inside clusters.Fine-Grain MPI (FG-MPI) is a system that extends the execution model of MPI toinclude interleaved concurrency through integration into the MPI middleware. FG-MPIis integrated into the MPICH2 middleware, which is an open source, production-qualityimplementation of MPI. The FG-MPI runtime uses coroutines to implement light-weight MPI processes that are non-preemptively scheduled by its MPI-aware scheduler. The use ofcoroutines enables fast context-switching time and low communication and synchronizationoverhead. FG-MPI enables expression of finer-grain function-level parallelism, which allowsfor flexible process mapping, scalability, and can lead to better program performance.We have demonstrated FG-MPI's ability to scale to over a 100 million MPI processeson a large cluster of 6,480 cores. This is the first time any system has executed such a largenumber of MPI processes, and this capability will be useful in exploring scalability issues ofthe MPI middleware as systems move towards compute clusters with millions of processorcores.

View record

Transport level features for commodity clusters (2011)

There is a need for systems to provide additional processing to extract useful information from the growing amounts of data. High Performance Computing (HPC) techniques use large clusters comprised of commodity hardware and software to provide the necessary computation when a single machine does not suffice. In addition to the increase in data,there have been other architectural changes like the advent of multicore and the presence of multiple networks on a single compute node, yet the commodity transport protocols in use have not adapted. It is therefore an opportune time to revisit the question of which transport features are necessary in order to best support today’s applications. Popular in HPC, we use the Message Passing Interface (MPI) to provide support for large scale parallel applications. We propose features to the transport protocol to overcome the problems with reliability, performance, and design simplicity existing in Ethernet-based commodity clusters. We use the Stream Control Transmission Protocol (SCTP) as a vehicle to implement tools having the proposed transport features for MPI. We develop several SCTP-based MPI implementations, a full-featured userspace SCTP stack, as well as enable the execution of unmodified MPI programs over a simulated network and SCTP implementation. The tools themselves provide the HPC and networking communities means to utilize improved transportfeatures for MPI by way of SCTP. The tools developed in this thesis are used to show that the proposed transport features enable further capabilities regarding the performance, reliability, and design simplicity of MPI applications running on Ethernet-based clustersystems constructed out of commodity components.

View record

Master's Student Supervision (2010 - 2018)
CMCMPI: Compose-Map-Configure MPI (2015)

In order to manage the complexities of Multiple Program, Multiple Data (MPMD) program deployment to optimize for performance, we propose (CM)²PI as a specification and tool that employs a four stage approach to create a separation of concerns between distinct decisions: architecture interactions, software size, resource constraints, and function. With function level parallelism in mind, to create a scalable architecture specification we use multi-level compositions to improve re-usability and encapsulation. We explore different ways to abstract out communication from the tight coupling of MPI ranks and placement. One of the methods proposed is the flow-controlled channels which also aims at tackling the common issues of buffer limitations and termination. The specification increase compatibility with optimization tools. This enables the automatic optimization of program run time with respect to resource constraints. Together these features simplify the development of MPMD MPI programs.

View record

Distributed skip list in fine-grain message passing interface (2014)

With scalability in mind, we have implemented a pure message-passing distributed data structure ideal for range queries. The structure is a distributed skip list that takes full advantage of Fine-Grain MPI to execute on a variety of scales ranging from a single multicore to multiple machines across clusters. The skip list data structure provides parallel implementation of range queries. Our implementation architecture is based on several services layered on top of each other that simplifies the effort required in building distributed code. Unlike concurrent data structures the distributed skip list operations are deterministic and atomic. The layered service architecture of our implementation exposes several control parameters that make it easy to distribute load, have operation flow control, vary granularity at different layers and tune performance to the underlying machine and network. Range-queries are implemented in a way that parallelizes the operation and takes advantage of the recursive properties of the skip list structure. We investigate a shortcut mechanism that alleviates the bottleneck at the head and introduces semantic trade offs between performance and consistency. We report on the performance of the skip list on a medium size cluster of two hundred cores with twenty thousand processes.

View record

Non-linear contextual bandits (2012)

The multi-armed bandit framework can be motivated by any problem where there is an abundance of choice and the utility of trying something new must be balanced with that of going with the status quo. This is a trade-off that is present in the everyday problem of where and what to eat: should I try a new restaurant or go to that Chinese place on the corner? In this work, a multi-armed bandit algorithm is presented which uses a non-parametric non-linear data model (a Gaussian process) to solve problems of this sort. The advantages of this method over existing work is highlighted through experiments. The method is also capable of modelling correlations between separate instances of problems, e.g., between similar dishes at similar restaurants. To demonstrate this, a few experiments are performed. The first, a synthetic example where the reward function is actually sampled from a Gaussian process, begs the question but helps pin down the properties of the algorithm in a controlled environment. The second, a problem where the objective is to aim a cannon at a distant target, shows how a well-defined objective, i.e., hit the target, can be used to speed up convergence. Finally, the third, an experiment with photographic post-processing, shows how the algorithm can learn from experience. The experiments demonstrate both the flexibility and the computational complexity of the model. This complexity means that problems such as the aforementioned restaurant problem, among others, are still future work.

View record

Supporting a Process-oriented Model in MPI Through Fine-Grain Mapping (2012)

The need for intuitive parallel programming designs has grown with the rise of modern many-core processors. Process-oriented models promote high scalability by using small isolated components that interact to produce complex applications. Such models are intuitive by forcing scalability to be a design requirement. The popular MPI messaging library has not exploited fine-grain models due to its coarse-grain implementations. The binding of processes often uses a coarse-grain management system, which is done by sequentially assigning ranks to a list of machines. This may be suitable for coarse-grain applications, but inadequate for fine-grain applications with large process grouping demands; a more flexible, manageable and scalability specification is necessary to support a process-oriented model.The use of FG-MPI exposes additional concurrency through a new layer of mapping by providing smaller units of parallelism: a desirable feature in function-level programming. This collocation layer requires a fine-grain mapping mechanism to optimize communication. A graph specification is proposed that allows communication patterns, collocation of MPI processes, and binding optimizations to be extracted from such a structure.The work presented extends and evaluates MPI to a fine-grain process-oriented model and provides a graphical mapping and binding specification. Evaluation of function-level applications is done through Pilot, a CSP-like library for MPI. The smallest unit of parallelism in this architecture is evaluated and shows that small communication overheads occur when comparing hundreds of large tasks to hundreds of thousands of fine-grain tasks. The graph representation is based on Kahn Process Networks. This provides a simplistic and intuitive model to represent and organize large function-level applications. A tool is developed that reads in a graph structure and performs operations such as auto-constructing wiring diagram code, determining optimal collocated maps based on communication, and producing a binding specification. This tool is modular and extensible to other graph related operations.

View record

Scalability of Communicators in MPI (2011)

This thesis offers a novel framework for representing groups and communicators in Message Passing Interface (MPI) middleware. MPI is a widely used paradigm in a cluster environment that supports communication between the nodes. In our framework, we have implemented and evaluated scalable techniques for groups and communicators in MPI. We have tested this framework using FG-MPI, a fine-grain version of MPI that scales millions of MPI processes.Groups in MPI are the primary means for creating communicators. A group map is the underlying structure that stores participating processes in the communication. We introduce a framework for concise representations of the group map. This framework is based on the observation that a map can be decomposed into a set and a permutation. This decomposition allows us to use a compact set representation for the cases where specific mapping is not required i.e. lists with monotonically increasing order. In other cases, the representation adds a permutation as well.A variety of set compression techniques has been used. Furthermore, the framework is open to integration of new representations. One advantage of such decomposition is the ability to implicitly represent a set with set representations such as BDD. BDD and similar representations are well-suited for the types of operations used in construction of communicators. In addition to set representations for unordered maps, we incorporated Wavelet Trees on Runs. This library is designed to represent permutation. We have also included general compression techniques in the framework such as BWT. This allows some degree of compression in memory-constrained environments where there is no discernible pattern in the group structure.We have investigated time and space trade-offs among the representations to develop strategies available to the framework. The strategies tune the framework based on user's requirements. The first strategy optimizes the framework to be fast and is called the time strategy. The second strategy optimizes the framework in regard to space. The final hybrid strategy is a hybrid of both and tries to strike a reasonable trade-off between time and space. These strategies let the framework accommodate a wider range of applications and users.

View record

Solving correlation matrix completion problems using parallel differential evolution (2011)

Matrix Completion problems have been receiving increased attention due to their varied applicability in different domains. Correlation matrices arise often in studying multiple streams of time series data like technical analysis of stock market data. Often some of the values in the matrix are unknown and some reasonable replacements have to be found at the earliest opportunity to avert an unwanted consequence or keep up the pace in the business. After looking to background research related to solving this problem, we propose a new parallel technique that can solve general correlation matrix completion problems over a set of computers connected to a high speed network. We present some of our results where we could reduce the execution time.

View record


Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research, and more than 300 programs in our 2021 Graduate Viewbook!