Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - Nov 2019)
The Message Passing Interface (MPI) is widely used to write sophisticated parallel applicationsranging from cognitive computing to weather predictions and is almost universallyadopted for High Performance Computing (HPC).Many popular MPI implementations bind MPI processes to OS-processes. This runtimemodel has closely matched single or multi-processor compute clusters. Since 2008, however,clusters of multicore nodes have been the predominant architecture for HPC, with theopportunity for parallelism inside one compute node. There are a number of popular parallelprogramming languages for multicore that use message passing. One notable differencebetween MPI and these languages is the granularity of the MPI processes. Processes writtenusing MPI tend to be coarse-grained and designed to match the number of processes to theavailable hardware, rather than the program structure. Binding MPI processes to OS-processesfails to take full advantage of the finer-grain parallelism available on today'smulticore systems. Our goal was to take advantage of the type of runtime systems used byfine-grain languages and integrate that into MPI to obtain the best of these programmingmodels; the ability to have fine-grain parallelism, while maintaining MPI's rich support forcommunication inside clusters.Fine-Grain MPI (FG-MPI) is a system that extends the execution model of MPI toinclude interleaved concurrency through integration into the MPI middleware. FG-MPIis integrated into the MPICH2 middleware, which is an open source, production-qualityimplementation of MPI. The FG-MPI runtime uses coroutines to implement light-weight MPI processes that are non-preemptively scheduled by its MPI-aware scheduler. The use ofcoroutines enables fast context-switching time and low communication and synchronizationoverhead. FG-MPI enables expression of finer-grain function-level parallelism, which allowsfor flexible process mapping, scalability, and can lead to better program performance.We have demonstrated FG-MPI's ability to scale to over a 100 million MPI processeson a large cluster of 6,480 cores. This is the first time any system has executed such a largenumber of MPI processes, and this capability will be useful in exploring scalability issues ofthe MPI middleware as systems move towards compute clusters with millions of processorcores.
There is a need for systems to provide additional processing to extract useful information from the growing amounts of data. High Performance Computing (HPC) techniques use large clusters comprised of commodity hardware and software to provide the necessary computation when a single machine does not suffice. In addition to the increase in data,there have been other architectural changes like the advent of multicore and the presence of multiple networks on a single compute node, yet the commodity transport protocols in use have not adapted. It is therefore an opportune time to revisit the question of which transport features are necessary in order to best support today’s applications. Popular in HPC, we use the Message Passing Interface (MPI) to provide support for large scale parallel applications. We propose features to the transport protocol to overcome the problems with reliability, performance, and design simplicity existing in Ethernet-based commodity clusters. We use the Stream Control Transmission Protocol (SCTP) as a vehicle to implement tools having the proposed transport features for MPI. We develop several SCTP-based MPI implementations, a full-featured userspace SCTP stack, as well as enable the execution of unmodified MPI programs over a simulated network and SCTP implementation. The tools themselves provide the HPC and networking communities means to utilize improved transportfeatures for MPI by way of SCTP. The tools developed in this thesis are used to show that the proposed transport features enable further capabilities regarding the performance, reliability, and design simplicity of MPI applications running on Ethernet-based clustersystems constructed out of commodity components.
Master's Student Supervision (2010 - 2018)
In order to manage the complexities of Multiple Program, Multiple Data (MPMD) program deployment to optimize for performance, we propose (CM)²PI as a specification and tool that employs a four stage approach to create a separation of concerns between distinct decisions: architecture interactions, software size, resource constraints, and function. With function level parallelism in mind, to create a scalable architecture specification we use multi-level compositions to improve re-usability and encapsulation. We explore different ways to abstract out communication from the tight coupling of MPI ranks and placement. One of the methods proposed is the flow-controlled channels which also aims at tackling the common issues of buffer limitations and termination. The specification increase compatibility with optimization tools. This enables the automatic optimization of program run time with respect to resource constraints. Together these features simplify the development of MPMD MPI programs.
No abstract available.
No abstract available.
The need for intuitive parallel programming designs has grown with the rise of modern many-core processors. Process-oriented models promote high scalability by using small isolated components that interact to produce complex applications. Such models are intuitive by forcing scalability to be a design requirement. The popular MPI messaging library has not exploited fine-grain models due to its coarse-grain implementations. The binding of processes often uses a coarse-grain management system, which is done by sequentially assigning ranks to a list of machines. This may be suitable for coarse-grain applications, but inadequate for fine-grain applications with large process grouping demands; a more flexible, manageable and scalability specification is necessary to support a process-oriented model.The use of FG-MPI exposes additional concurrency through a new layer of mapping by providing smaller units of parallelism: a desirable feature in function-level programming. This collocation layer requires a fine-grain mapping mechanism to optimize communication. A graph specification is proposed that allows communication patterns, collocation of MPI processes, and binding optimizations to be extracted from such a structure.The work presented extends and evaluates MPI to a fine-grain process-oriented model and provides a graphical mapping and binding specification. Evaluation of function-level applications is done through Pilot, a CSP-like library for MPI. The smallest unit of parallelism in this architecture is evaluated and shows that small communication overheads occur when comparing hundreds of large tasks to hundreds of thousands of fine-grain tasks. The graph representation is based on Kahn Process Networks. This provides a simplistic and intuitive model to represent and organize large function-level applications. A tool is developed that reads in a graph structure and performs operations such as auto-constructing wiring diagram code, determining optimal collocated maps based on communication, and producing a binding specification. This tool is modular and extensible to other graph related operations.
This thesis offers a novel framework for representing groups and communicators in Message Passing Interface (MPI) middleware. MPI is a widely used paradigm in a cluster environment that supports communication between the nodes. In our framework, we have implemented and evaluated scalable techniques for groups and communicators in MPI. We have tested this framework using FG-MPI, a fine-grain version of MPI that scales millions of MPI processes.Groups in MPI are the primary means for creating communicators. A group map is the underlying structure that stores participating processes in the communication. We introduce a framework for concise representations of the group map. This framework is based on the observation that a map can be decomposed into a set and a permutation. This decomposition allows us to use a compact set representation for the cases where specific mapping is not required i.e. lists with monotonically increasing order. In other cases, the representation adds a permutation as well.A variety of set compression techniques has been used. Furthermore, the framework is open to integration of new representations. One advantage of such decomposition is the ability to implicitly represent a set with set representations such as BDD. BDD and similar representations are well-suited for the types of operations used in construction of communicators. In addition to set representations for unordered maps, we incorporated Wavelet Trees on Runs. This library is designed to represent permutation. We have also included general compression techniques in the framework such as BWT. This allows some degree of compression in memory-constrained environments where there is no discernible pattern in the group structure.We have investigated time and space trade-offs among the representations to develop strategies available to the framework. The strategies tune the framework based on user's requirements. The first strategy optimizes the framework to be fast and is called the time strategy. The second strategy optimizes the framework in regard to space. The final hybrid strategy is a hybrid of both and tries to strike a reasonable trade-off between time and space. These strategies let the framework accommodate a wider range of applications and users.
Matrix Completion problems have been receiving increased attention due to their varied applicability in different domains. Correlation matrices arise often in studying multiple streams of time series data like technical analysis of stock market data. Often some of the values in the matrix are unknown and some reasonable replacements have to be found at the earliest opportunity to avert an unwanted consequence or keep up the pace in the business. After looking to background research related to solving this problem, we propose a new parallel technique that can solve general correlation matrix completion problems over a set of computers connected to a high speed network. We present some of our results where we could reduce the execution time.