This page is a static HTML representation of the slides at
https://peteasa.github.io/parapg/parapg.html




from open source
by

Peter Saunderson

2016




Contents

PeterSaunderson 23rd August 2016 at 7:44pm

Proof Of Concept

PeterSaunderson 23rd August 2016 at 1:33pm

  • A proof of concept super computer was created in 2015 (Supercomputer.io)
  • Used Epiphany-III 16-core processor (32 GFLOPS with 8Gbps interface)
    • note early Parallella tests did not use the 8Gbps oh i/f
  • Can start small with little cost (<£100) and grow as required
  • Not restricted to MPMD architecture

Changing Environment

Changing Environment

PeterSaunderson 23rd August 2016 at 1:27pm

New Software Stack

New Software Stack

PeterSaunderson 23rd August 2016 at 9:19am

Parallel Architectures Library

Building the System

PeterSaunderson 23rd August 2016 at 6:23pm

System Requirements

System Requirements

PeterSaunderson 23rd August 2016 at 11:05am

  • Need to keep up to date with the latest
    • interrupt driven oh elink i/f
    • new update kernel drivers and latest eSDK or PAL libraries
  • Easy method of distributing the software to the cluster
  • Would like easy extension of fpga, kernel and software
  • Would like to build Epiphany software on the build machine

Building Open Hardware fpga

Building Open Hardware fpga

PeterSaunderson 23rd August 2016 at 4:55pm

The programmable Xilinx 7020 is flexible and can contain 8Gbps link to Epiphany-III and RISC-V core or e-link and Analog Devices HDMI or something else?

  • Necessary tools are readily available see getting started guide "Building 7020_hdmi"
  • OH! hdl code for the 8Gbps link is available
  • All that is needed is hard work to add functionality and keep the rest of the system up to date (see parallella playground)
  • Adding hardware requires update to linux drivers

Parallella Yocto Build

Parallella Yocto Build

PeterSaunderson 23rd August 2016 at 5:49pm

All the recipes to build a linux distribution in one place
https://github.com/peteasa/parallella-yoctobuild/wiki

meta-exotic

meta-exotic

PeterSaunderson 23rd August 2016 at 6:24pm

meta-exotic: Yocto layer to support cross compiler creation for exotic or foreign microcontrollers

  • Off the shelf Yocto has no support for foreign microcontrollers
    • the ARM cores are the main target of Yocto
    • the Epiphany chip does not run ARM code so is "foreign"
    • the compiler needed to build "foreign" code can't be used within Yocto
  • meta-exotic layer changes this (see meta-exotic/wiki)
  • A generic layer that could be used for other microcontrollers
Testing the System

Testing the System

PeterSaunderson 23rd August 2016 at 8:21pm

https://wiki.yoctoproject.org/wiki/Regression_Test

Future Work

Future Work

PeterSaunderson 23rd August 2016 at 5:15pm

In no particular order:

  • Built in support for PAL libraries
  • Cluster management tool like SLURM
  • Update of meta-exotic for gcc 5.x tools and adding gdb
    • also prove with another processor (RISK-V)
  • North / south direct Epiphany connection
  • Updating the various repositories that make up https://github.com/peteasa/parallella/wiki takes time

Contributors or sponsors for this work are always welcome!

@paracpg #parapg on Twitter Peter on GitHub
The End

The End

PeterSaunderson 23rd August 2016 at 4:48pm

Thank you

for reading


Glossary
References

Additional Material

PeterSaunderson 23rd August 2016 at 3:55pm

Architectures

PeterSaunderson 23rd August 2016 at 3:39pm

Taxonomy first published by Michael J Flynn (1972) now extended:

  1. Single Instruction Single Data
    SISD: simple single core computer
  2. Multiple Instruction Single Data
    MISD: fault tolerant system, or pipeline system
  3. Single Instruction Multiple Data
    SIMD: one microinstruction can operate at the same time on multiple data items
  4. Multiple Instruction Multiple Data
    MIMD: machines have a number of processors that function asynchronously and independently
  5. Single Program Multiple Data
    SPMD: multiple autonomous processors simultaneously execute the same program at independent points
  6. Multiple Program Multiple Data
    MPMD: server farms are a good example of a MPMD system
Single Instruction Single Data

Multiple Instruction Multiple Data

PeterSaunderson 21st August 2016 at 3:44pm

MIMD: Multiple instruction streams, multiple data streams

  1. Multiple instructions
    • it is possible to get a lockout condition where two processors are compete for the same data resource
    • one processor must delay to allow the other processor to finish its operation on the shared data
  2. Multiple Data streams may share storage at some level in the system to enable cooperative execution of a multi task program
  3. Symmetric Multiprocessors (SMP) and Massively Parallel Processors (MPP) are examples of MIMD architectures

Single Program Multiple Data

Multiple Instruction Single Data

PeterSaunderson 21st August 2016 at 3:24pm

MISD: Multiple instruction streams, single data stream

Examples:

  1. a pipeline where one data stream is accessed by a series of processors and each processor performs a different operation on the data stream
  2. a redundant system where the two instruction streams by choice are chosen to be the same and processors then act on the same data and thus should in theory generate the same results

Single Instruction Multiple Data

Multiple Program Multiple Data

PeterSaunderson 21st August 2016 at 3:30pm

MPMD: possibly unrelated work is split up into multiple programmes and run simultaneously on multiple processors with different input in order to obtain multiple result

  • server farms are a good example of a MPMD system

Using Multiple Cores

Single Instruction Multiple Data

PeterSaunderson 21st August 2016 at 3:31pm

SIMD: Single Instruction Stream, Multiple Data Streams.

  1. Multiple data streams may access the same data at some point in the system
  2. A single instruction: so there is no lockout condition where two ALU try to access the same data
  3. It is possible to implement an efficient data move instruction where data is moved from one memory location to another with one instruction

Single Instruction Multiple Data - 2

Single Instruction Multiple Data - 2

PeterSaunderson 21st August 2016 at 1:19pm

Standardised in OpenMP 4.0
What usually requires a loop of instructions can be performed in one instruction

Examples:

 #pragma simd
 for (i=0; i<n; i++){
  a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
 } 
Multiple Instruction Multiple Data

Single Instruction Single Data

PeterSaunderson 21st August 2016 at 3:31pm

SISD: Single Instruction Stream, Single Data Stream

  1. The Harvard Architecture: separate instruction and data memory
  2. The Von Neumann Architecture: a single combined data and instruction memory

Multiple Instruction Single Data

Single Program Multiple Data

PeterSaunderson 22nd August 2016 at 4:59pm

SPMD: tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster.

  1. standardized in MPI and OpenSHMEM
  2. the most common style of parallel programming
  3. different to SIMD that requires a Vector Processor to manipulate data streams
  4. Shared memory could be used for message passing
  5. SIMD and SPMD are not mutually exclusive
Multiple Program Multiple Data

Using Multiple Cores

PeterSaunderson 23rd August 2016 at 6:58am

A Programming Language

A Programming Language

PeterSaunderson 21st August 2016 at 7:58pm

APL: array-oriented programming language that uses pictorial symbols for its language constructs

Cal Actor Language with Networking

Bulk Synchronous Parallel

PeterSaunderson 21st August 2016 at 8:53pm

BSP: library to assist in running tasks in a parallel processing system

  • Relies on the processing units being tightly coupled
    • a network that routes messages between pairs of components
    • a facility (barrier) that allows for the synchronisation of all or a subset of components
  • Processing proceeds as follows
    • processors perform local computations making use of values stored in the local fast memory
    • asynchronous calculations proceed in parallel with put or get data calls to other processors (one sided)
    • the barrier is used to ensure that shared resources are handle atomically
    • the barrier causes all other processes to wait until they reach the same barrier avoiding deadlock

C++ Accelerated Massive Parallelism

C++ Accelerated Massive Parallelism

PeterSaunderson 21st August 2016 at 10:05pm

C++AMP: a Microsoft library built on DirectX 11

  • hardware-agnostic interface for exploiting accelerator hardware
  • language extensions and an STL-like library component

Message Passing Interface

Cal Actor Language with Networking

PeterSaunderson 21st August 2016 at 9:20pm

CAL: is a dataflow language geared towards for example multimedia processing

Erlang Language

Erlang Language

PeterSaunderson 21st August 2016 at 8:01pm

Erlang: a programming language originally developed at the Ericsson Computer Science Laboratory

Using Directive Pragma

Message Passing Interface

PeterSaunderson 23rd August 2016 at 9:22am

MPI: specification used in a distributed memory model

https://www.mpi-forum.org/

  • well-structured interface for message-passing between parallel systems
  • "industry standard" for writing message passing programs on HPC platforms
  • MPI is a specification for how a MPI conformant library should be used

Message Passing Interface 2

Message Passing Interface 2

PeterSaunderson 22nd August 2016 at 2:34pm

Example API use:

#include "mpi.h"
..
   MPI_Status Stat;   // required variable for receive routines
   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
..
     MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
     MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

Open Computing Language

Open Accelerators

PeterSaunderson 22nd August 2016 at 2:40pm

OpenACC: extends OpenMP for support of accelerators

Ref: http://www.openacc.org/

  • open standard started by CAPS, Cray, NVIDIA, and PGI

Examples of the pragma used are:

#pragma acc parallel
#pragma acc kernels

Open Hybrid Multicore Parallel Programming

Open Computing Language

PeterSaunderson 21st August 2016 at 10:38pm

OpenCL: using task-based and data-based parallelism

https://www.khronos.org/opencl/

  • Rich interface specification that enables remote tasks to be managed
  • Message passing used to send and receive data

Open Computing Language 2

Open Computing Language 2

PeterSaunderson 22nd August 2016 at 2:36pm

Example data transfer

// Create buffers on host and device
size_t size = 100000 * sizeof(int);
int* h_buffer = (int*)malloc(size);
cl_mem d_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, NULL);
...
// Write to buffer object from host memory
clEnqueueWriteBuffer(cmd_queue, d_buffer, CL_FALSE, 0, size, h_buffer, 0, NULL, NULL);
...
// Read from buffer object to host memory
clEnqueueReadBuffer(cmd_queue, d_buffer, CL_TRUE, 0, size, h_buffer, 0, NULL, NULL);

Parallel Architectures Library

Open Hybrid Multicore Parallel Programming

PeterSaunderson 22nd August 2016 at 2:38pm

OpenHMPP: provides extensions for hardware accelerators

  • Hard to find specifications
  • Created codelet's for NVIDIA CUDA
Example of the pragma used to create codelet:
 #pragma hmpp simple1 codelet, args[outv].io=inout, target=CUDA
pragma used to run the codelet:
#pragma hmpp simple1 callsite, args[outv].size={n}

Open Multi Processing

Open Multi Processing

PeterSaunderson 23rd August 2016 at 9:23am

OpenMP: shared-memory parallel programming interface

http://openmp.org/

  1. Simple and easy to demonstrate
  2. Built in to most compilers (gcc -fopenmp)
  3. Build with / without OpenMP requires stubs or pre-processor switch in code

Open Multi Processing 2

Parallel Architectures Library

PeterSaunderson 23rd August 2016 at 1:30pm

PAL: library for coprocessors and parallel machines

https://github.com/parallella/oh

  • Think of it as a POSIX library for coprocessors and parallel machines
  • Plan to implement maths, fft, dsp libraries
    • redesign c libraries from bottom up to support parallel architectures
    • functions such as load, open, close, read, write would be used to implement OpenMP, MPI, and OpenCL
  • Fine grained nature of the PAL libraries required for Epiphany-III will likely scale easily to larger more sophisticated processor architectures
Parallel Architectures Library 2

Parallel Architectures Library 2

PeterSaunderson 22nd August 2016 at 2:33pm

Example host application

#include "pal_base.h"
...
    team0 = p_open(dev0, 0, NUMBEROFCORES);            // open a team
    results_mem = p_map(dev0, MEM_RESULTS, SIZERESULTS);
    err = p_run(prog0, func, team0, 0, NUMBEROFCORES, 0, NULL, 0);
    size = p_read(&results_mem, results, 0, sizeof(*results) * NUMBEROFCORES, 0);
Example device library calls
#include "pal_math.h"
...
    rank = p_team_rank(P_TEAM_DEFAULT);
...
    for (i = 0; i < n; i++)
        out[i] = p_sqrt(ai[i]);
Building the System

Using Directive Pragma

PeterSaunderson 21st August 2016 at 8:02pm

The compiler preprocessor is given additional instructions to identify parts of the software that can be run in parallel

Open Accelerators

Using Libraries

PeterSaunderson 21st August 2016 at 8:59pm

Additional software libraries are provided that enable the application code to run in parallel

Bulk Synchronous Parallel

Open Multi Processing 2

PeterSaunderson 23rd August 2016 at 3:14pm

Example (http://sc14.supercomputing.org/program/tutorials.html):

#include <omp.h>
...
    #pragma omp parallel

#ifdef _OPENMP
    printf("Hello, world from thread %d\n", omp_get_thread_num());
#else
    printf("Hello, world!\n");
#endif

Using Libraries

Glossary

PeterSaunderson 23rd August 2016 at 3:56pm

APL
A Programming Language: is a high level, concise, array-oriented programming language that uses pictorial symbols for its language constructs also known as an Array Manipulation Language
BSP
Bulk Synchronous Parallel: library to assist in running tasks in a parallel processing system
C++AMP
C++ Accelerated Massive Parallelism: a Microsoft library built on DirectX 11 used to exploit accelerator hardware
CAL
Cal Actor Language with Networking: a dataflow language geared towards for example multimedia processing
Erlang
Erlang Language: is a programming language for building robust fault-tolerant distributed applications
MIMD
Multiple Instruction Multiple Data: machines have a number of processors that function asynchronously and independently
MISD
Multiple Instruction Single Data: fault tolerant system, or pipeline system
MPI
Message Passing Interface: specification used in a distributed memory model
MPMD
Multiple Program Multiple Data: server farms are a good example of a MPMD system
OpenACC
Open Accelerators: extends OpenMP for support of accelerators
OpenCL
Open Computing Language: one of the most dominant libraries for parallel computing using task-based and data-based parallelism
OpenHMPP
Open Hybrid Multicore Parallel Programming: provides extensions for hardware accelerators. "What OpenMP is for multi-threaded programming"
OpenMP
Open Multi Processing: shared-memory parallel programming interface
OpenSHMEM
Open Symmetric Hierarchical Memory access: parallel programming libraries
PAL
Parallel Architectures Library: library for coprocessors and parallel machines
SHMEM
Symmetric Hierarchical Memory access: parallel programming libraries
SIMD
Single Instruction Multiple Data: one microinstruction can operate at the same time on multiple data items
SISD
Single Instruction Single Data: simple single core computer
SLURM
Simple Linux Utility for Resource Management: a fault-tolerant, and highly scalable cluster management and job scheduling system
SPMD
Single Program Multiple Data: multiple autonomous processors simultaneously execute the same program at independent points
Vector Processor
CPU with instructions that act on one dimensional arrays of data (vectors)

APL

PeterSaunderson 21st August 2016 at 7:18pm

A Programming Language: is a high level, concise, array-oriented programming language that uses pictorial symbols for its language constructs also known as an Array Manipulation Language

Back to:

BSP

PeterSaunderson 21st August 2016 at 8:11pm

Bulk Synchronous Parallel: library to assist in running tasks in a parallel processing system

Back to:

C++AMP

PeterSaunderson 21st August 2016 at 8:11pm

C++ Accelerated Massive Parallelism: a Microsoft library built on DirectX 11 used to exploit accelerator hardware

Back to:

CAL

PeterSaunderson 21st August 2016 at 8:13pm

Cal Actor Language with Networking: a dataflow language geared towards for example multimedia processing

Back to:

Erlang

PeterSaunderson 21st August 2016 at 6:39pm

Erlang Language: is a programming language for building robust fault-tolerant distributed applications

Back to:

MIMD

PeterSaunderson 21st August 2016 at 1:52pm

Multiple Instruction Multiple Data: machines have a number of processors that function asynchronously and independently

Back to:

MISD

PeterSaunderson 21st August 2016 at 1:53pm

Multiple Instruction Single Data: fault tolerant system, or pipeline system

Back to:

MPI

PeterSaunderson 21st August 2016 at 8:14pm

MPMD

PeterSaunderson 21st August 2016 at 1:52pm

Multiple Program Multiple Data: server farms are a good example of a MPMD system

Back to:

OpenACC

PeterSaunderson 22nd August 2016 at 8:45pm

Open Accelerators: extends OpenMP for support of accelerators

Back to:

OpenCL

PeterSaunderson 21st August 2016 at 10:30pm

Open Computing Language: one of the most dominant libraries for parallel computing using task-based and data-based parallelism

Back to:

OpenHMPP

PeterSaunderson 21st August 2016 at 8:16pm

Open Hybrid Multicore Parallel Programming: provides extensions for hardware accelerators. "What OpenMP is for multi-threaded programming"

Back to:

OpenMP

PeterSaunderson 21st August 2016 at 9:03pm

OpenSHMEM

PeterSaunderson 22nd August 2016 at 3:48pm

Open Symmetric Hierarchical Memory access: parallel programming libraries

Back to:

PAL

PeterSaunderson 21st August 2016 at 9:02pm

SHMEM

PeterSaunderson 22nd August 2016 at 3:48pm

Symmetric Hierarchical Memory access: parallel programming libraries

Back to:

SIMD

PeterSaunderson 21st August 2016 at 1:52pm

Single Instruction Multiple Data: one microinstruction can operate at the same time on multiple data items

Back to:

SISD

PeterSaunderson 21st August 2016 at 1:52pm

SLURM

PeterSaunderson 22nd August 2016 at 3:08pm

Simple Linux Utility for Resource Management: a fault-tolerant, and highly scalable cluster management and job scheduling system

Back to:

SPMD

PeterSaunderson 21st August 2016 at 1:53pm

Single Program Multiple Data: multiple autonomous processors simultaneously execute the same program at independent points

Back to:

Vector Processor

PeterSaunderson 22nd August 2016 at 5:00pm

CPU with instructions that act on one dimensional arrays of data (vectors)

Back to:

Andreas Olofsson

PeterSaunderson 22nd August 2016 at 4:31pm

Andreas Olofsson, Tomas Nordström, Zain Ul-Abdin "Kickstarting high-performance energy-efficient manycore architectures with Epiphany" 2014 48th Asilomar Conference on Signals, Systems and Computers

Back to:

David A. Richie

PeterSaunderson 22nd August 2016 at 4:31pm

David A. Richie and James A. Ross "OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture", OpenSHMEM 2016, Third workshop on OpenSHMEM.

Back to:

Elias Kouskoumvekakis

PeterSaunderson 23rd August 2016 at 5:12pm

"RISC-V port to Parallella", Google Summer of Code 2016

Back to:

Building Open Hardware fpga

M. Mitchell Waldrop

PeterSaunderson 22nd August 2016 at 4:10pm

article: "The chips are down for Moore's law", Nature weekly journal of science, 2016

Back to:

Michael J Flynn

PeterSaunderson 21st August 2016 at 12:20pm

"Some Computer Organizations and Their Effectiveness", IEEE Transactions on Computers. Vol. c-21, No.9, September 1972

Back to:

Simon McIntosh-Smith

PeterSaunderson 22nd August 2016 at 4:11pm

slides: "It's the end of the world as we know it ...", University of Bristol, HPC Research Group, 2015

Back to:

wikipedia

PeterSaunderson 22nd August 2016 at 4:28pm

various articles

Back to:

SC.References

PeterSaunderson 21st August 2016 at 12:13pm

Andreas Olofsson
Andreas Olofsson, Tomas Nordström, Zain Ul-Abdin "Kickstarting high-performance energy-efficient manycore architectures with Epiphany" 2014 48th Asilomar Conference on Signals, Systems and Computers
David A. Richie
David A. Richie and James A. Ross "OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture", OpenSHMEM 2016, Third workshop on OpenSHMEM.
Elias Kouskoumvekakis
"RISC-V port to Parallella", Google Summer of Code 2016
M. Mitchell Waldrop
article: "The chips are down for Moore's law", Nature weekly journal of science, 2016
Michael J Flynn
"Some Computer Organizations and Their Effectiveness", IEEE Transactions on Computers. Vol. c-21, No.9, September 1972
Simon McIntosh-Smith
slides: "It's the end of the world as we know it ...", University of Bristol, HPC Research Group, 2015
wikipedia
various articles