First Technology Transfer

Standard and Advanced Technical Training, Consultancy and Mentoring

MPI C Programming for NVidia Jetson TX

Duration: 5 Days

Intended Audience

This course is for experienced C/C++ programmers who also have some familiarity with CUDA who need not only to get up to speed with MPI programming, but also to explore its practical use in networks containing multiple, networked, NVidia Jetson TX2 devices.

Course Overview

Parallel programming by definition involves co-operation between processes to solve a common task. It is up to the programmer to define the tasks that will be executed by the processors, and how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together. The Message Passing Interface (MPI) is the de-facto standard for message passing. This course covers the key aspects of MPI programming such as point-to-point communication, non-blocking operations, derived datatypes, virtual topologies, collective communication. It also covers general parallel programming code design issues. The course is taught using a class network of NVidia Jetson TX2 processors and PC computers running Linux. It also covers applications that combine MPI and CUDA.

Course Contents

  • Parallel Programming - Concepts and Idioms
    • Distributed memory and shared memory computing models
    • Message-Passing Concepts
  • Message passing paradigms
    • Features of message passing programs
    • Point-to-Point Communications and Messages
    • Communication Modes and Completion Criteria
    • Blocking and Nonblocking Communication
    • Collective Communications
    • Broadcast Operations
    • Scatter and Gather Operations
    • Reduction Operations
  • MPI Program Straucture
    • MPI Routines and Return Values
    • MPI Handles
    • MPI Datatypes
      • Communicators
      • Tags
      • Modes
  • Point to Point Communication
      • Sending and Receiving
      • Blocking and Completion
      • Deadlock and Deadlock Avoidance
    • Nonblocking Sends and Receives
      • Posting, Completion, and Request Handles
      • Posting Sends and Receives without Blocking
      • Completion - Waiting and Testing
    • Send Modes
      • Standard Mode Send
      • Synchronous Mode Send
      • Ready Mode Send
      • Buffered Mode Send
  • Derived Data Types
    • Buffer filling and MPI_Pack
    • MPI_Struct and Mapping of C Structs to MPI Derived Types
    • MPI_Type_contiguous
    • MPI_Type_vector
    • MPI_Type_hvector
    • MPI_Type_indexed
    • MPI_Type_hindexed
    • Controlling the Extent of a Derived Type
  • Collective Communication
    • MPI_Barrier - Barrier Synchronisation
    • MPI_Bcast- Broadcast
    • MPI_Reduce - Reduction
    • MPI_Gather - Gathering
    • MPI_Allgather
    • MPI_Scatter - Scattering
    • MPI_Allreduce
    • MPI_Gatherv
    • MPI_Scatterv
    • MPI_Scan
    • MPI_Reduce_scatter
  • Communicators
    • MPI_Comm_world
    • MPI_Comm_group
    • MPI_Group_incl
    • MPI_Group_excl
    • MPI_Group_rank
    • MPI_Group_free
    • MPI_Comm_create
    • MPI_Comm_split
  • Virtual Topologies API
    • MPI_Cart_create
    • MPI_Cart_coords
    • MPI_Cart_rank
    • MPI_Cart_shift
    • MPI_Cart_sub
    • MPI_Cartdim_get
    • MPI_Cart_get
    • MPI_Cart_shift
  • Virtual Topologies API - Applications
    • Matrix Transposition
    • Iterative Solvers
  • Parallel IO
    • Characteristics of Serial I/O
    • Characteristics of Parallel I/O
    • Introduction to MPI-2 Parallel I/O
    • MPI-2 File Structure
    • Initializing MPI-2 File I/O
    • View
    • Data Access - Reading Data
    • Data Access - Writing Data
    • Closing MPI-2 File I/O
  • Parallel Numerical Libraries - An Overview
    • PBLAS - Parallel Basic Linear Algebra Subproblems
    • ScaLAPACK - Scalable Linear Algebra PACkage
  • MPI application design and design for performance
    • Domain decomposition
    • Functional decomposition
    • Load balancing
    • Minimising Communication
    • Designing for Performance
  • Timer synchronisation
  • Overview of CUDA parallel programming
  • Experimenting with mixed CUDA - MPI programming - introduction
  • OpenMP an Overview
  • OpenMP, MPI and CUDA compared