Resorting to C for Parallel Computing
Using Multicore Machines
Multicore machines can leverage parallelism to perform computations more efficiently by distributing tasks across multiple CPU cores. In C, this is often achieved using threading libraries such as POSIX Threads (pthreads) or higher-level abstractions provided by libraries like OpenMP.
Example: Using POSIX Threads (pthreads)
Here’s a basic example of how to use POSIX threads to parallelize a simple task in C:
#include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 4 void* print_hello(void* threadid) { long tid; tid = (long)threadid; printf("Hello from thread %ld\n", tid); pthread_exit(NULL); } int main() { pthread_t threads[NUM_THREADS]; int rc; long t; for (t = 0; t < NUM_THREADS; t++) { rc = pthread_create(&threads[t], NULL, print_hello, (void *)t); if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }
In this example:
- pthread_create() is used to create threads.
- Each thread runs the print_hello function.
Running the OpenMP Code
OpenMP (Open Multi-Processing) is a popular API for parallel programming in C, C++, and Fortran. It provides a set of compiler directives, library routines, and environment variables that can be used to specify parallel regions in a program.
Example: Basic OpenMP Code
Here’s a simple example of using OpenMP to parallelize a for-loop:
#include <omp.h> #include <stdio.h> int main() { int i; #pragma omp parallel for for (i = 0; i < 10; i++) { printf("Thread %d is working on iteration %d\n", omp_get_thread_num(), i); } return 0; }
In this example:
- #pragma omp parallel for is a directive that tells the compiler to parallelize the for-loop.
- omp_get_thread_num() returns the ID of the thread executing the current iteration.
To compile this code with OpenMP support, you might use:
gcc -fopenmp -o myprogram myprogram.c
OpenMP Code Analysis
When analyzing OpenMP code, you should consider several factors:
Performance Metrics
- Speedup: Measure the execution time with and without OpenMP to determine speedup.
- Scalability: Check how the performance scales with the number of threads.
Example: Measuring Execution Time
#include <omp.h> #include <stdio.h> #include <stdlib.h> #include <time.h> int main() { int i; double start_time, end_time; int n = 10000000; int* array = (int*)malloc(n * sizeof(int)); // Initialize the array for (i = 0; i < n; i++) { array[i] = i; } start_time = omp_get_wtime(); #pragma omp parallel for for (i = 0; i < n; i++) { array[i] = array[i] * 2; } end_time = omp_get_wtime(); printf("Elapsed time: %f seconds\n", end_time - start_time); free(array); return 0; }
Correctness
- Race Conditions: Ensure that shared variables are protected by synchronization mechanisms if needed.
- Deadlocks: Avoid situations where threads wait indefinitely for each other.
Example: Using Critical Section
#include <omp.h> #include <stdio.h> int main() { int i, sum = 0; #pragma omp parallel private(i) shared(sum) { #pragma omp for for (i = 0; i < 100; i++) { #pragma omp critical sum += i; } } printf("Sum is %d\n", sum); return 0; }
In this example:
- #pragma omp critical ensures that only one thread updates sum at a time.
Other OpenMP Pragmas
OpenMP offers various pragmas for different parallelization needs:
Parallel Regions
#pragma omp parallel { printf("Hello from thread %d\n", omp_get_thread_num()); }
Reduction
#include <omp.h> #include <stdio.h> int main() { int i, sum = 0; #pragma omp parallel for reduction(+:sum) for (i = 0; i < 100; i++) { sum += i; } printf("Sum is %d\n", sum); return 0; }
In this example:
- reduction(+:sum) ensures that sum is correctly accumulated across all threads.
Sections
#include <omp.h> #include <stdio.h> int main() { #pragma omp parallel sections { #pragma omp section { printf("Section 1\n"); } #pragma omp section { printf("Section 2\n"); } } return 0; }
In this example:
- #pragma omp sections allows different sections of code to be executed in parallel.
GPU Programming
GPU programming leverages the massive parallelism available on modern graphics cards. CUDA (Compute Unified Device Architecture) is one popular framework for this purpose.
Example: Basic CUDA Code
#include <stdio.h> __global__ void add(int* a, int* b, int* c) { int index = threadIdx.x; c[index] = a[index] + b[index]; } int main() { int N = 10; int size = N * sizeof(int); int h_a[N], h_b[N], h_c[N]; int *d_a, *d_b, *d_c; // Initialize host arrays for (int i = 0; i < N; i++) { h_a[i] = i; h_b[i] = i * 2; } cudaMalloc(&d_a, size); cudaMalloc(&d_b, size); cudaMalloc(&d_c, size); cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice); add<<<1, N>>>(d_a, d_b, d_c); cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost); // Print results for (int i = 0; i < N; i++) { printf("%d + %d = %d\n", h_a[i], h_b[i], h_c[i]); } cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
In this example:
- __global__ indicates a GPU function.
- cudaMalloc() and cudaMemcpy() are used to manage memory between the host and device.
Conclusion
Resorting to C for parallel computing involves various techniques, from using multicore CPUs with pthreads or OpenMP to leveraging GPU capabilities with CUDA. Each method has its own set of pragmas, directives, and considerations:
- Multicore Machines: Use libraries like pthreads or OpenMP for CPU parallelism.
- OpenMP: Provides directives for easy parallelism in C.
- GPU Programming: Utilizes CUDA for high-performance computing on GPUs.