17 de Maio de 2015
Natal-RN
Problemas
resolvidos:1/2/3/4/5/6/7/8/9/10/12/14/15/16/17/19/20/21/22/24/25/26/
26
Problema 1
What happens in the greetings program if, instead of strlen(greeting) + 1 ,
we use strlen(greeting) for the length of the message being sent by processes 1, 2, . . . , comm sz 1 ? What happens if we use MAX STRING
instead of strlen ( greeting ) + 1 ? Can you explain these results?
Soluo
Local_b=local_a+local_n*h;
}
Problema 3
Determine which of the variables in the trapezoidal rule program are local
and which are global.
Soluo
Problema 4
Modify the program that just prints a line of output from each process ( mpi
output.c ) so that the output is printed in process rank order: process 0s output
first, then process 1s, and so on.
Soluo
A ideia para que ele imprima em seqncia ser utilizar do conceito que o
rec bloqueante. Com isso todos os rank diferentes de zero enviar para o
processo zero que ira ficar responsvel por imprimir
int main(void) {
if (my_rank =0)
{
printf (" Proc %d of %d > Does anyone have a toothpick ?\n",
my_rank , comm_sz );
for (i =1;i< comm_sz ;i ++){
MPI_Recv (& aux ,1, MPI_INT, i ,0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf (" Proc %d of %d > Does anyone have a
toothpick ?\n", aux, comm_sz );
}
}
else
{
MPI_Send (& my_rank ,1, MPI_INT ,0 ,0, MPI_COMM_WORLD);
}
entrada: n=4
Saida:
Proc 0
Proc 1
Proc 2
Proc 3
of
of
of
of
4
4
4
4
>
>
>
>
Does
Does
Does
Does
anyone
anyone
anyone
anyone
have
have
have
have
a
a
a
a
toothpick?
toothpick?
toothpick?
toothpick?
Problema 5
In a binary tree, there is a unique shortest path from each node to the root. The
length of this path is often called the depth of the node. A binary tree in which
every nonleaf has two children is called a full binary tree, and a full binary tree
in which every leaf has the same depth is sometimes called a complete binary
tree. See Figure 3.14. Use the principle of mathematical induction to prove
that if T is a complete binary tree with n leaves, then the depth of the leavesis log 2
(n)
Soluo:
Seguindo os passos :
1. A base: mostrar que o enunciado vale para n = 1
2. O passo indutivo: mostrar que, se o enunciado vale para n=k, ento o mesmo
enunciado vale para n=k+1.
3. Usando o caso quando a arvore s tem uma folha,ou seja, a prpria raiz a folha
teremos como profundidade 0 que segue o padro log(1)=0 . Sabendo que a
profundidade de um n igual a profundidade do pai +1 . E tendo que a
quantidade de folhas do pai a metade da quantidade do filho , podemos deduzir
que :
Profundidade (no)=profunidad(pai)+1=log2(n/2)+1=log2(n)-log2(2)+1=log2(n)-1+1=log2(n).
Com isso esta provada por induo que pela profundidade pelo mtodo da induo que a
profundidade da arvore igual log2(n)
Problema 6
Suppose comm sz = 4 and suppose that x is a vector with n = 14
components.
a. How would the components of x be distributed among the processes in
a program that used a block distribution?
b. How would the components of x be distributed among the processes in
a program that used a cyclic distribution?
c. How would the components of x be distributed among the processes in
a program that used a block-cyclic distribution with blocksize b = 2?
You should try to make your distributions general so that they could be
used regardless of what comm sz and n are. You should also try to make
your distri- butions fair so that if q and r are any two processes, the
difference between the number of components assigned to q and the
number of components assigned to r is as small as possible.
Soluo:
a) Bloco distribuido
Processo 0: x0,x1,x2,x3
Processo 1: x4,x5,x6,x7
processo 2:x8,x9,x10
processo 3:x11,x12,x13
b)Bloco cyclic
Processo 0: x0,x4,x8,x12
Proceso 1 : x1,x5,x9,x13
Proceso 2: x2,x6,x10
Proceso 4:x3,x7,x11
c)Bloco cyclic-distribuido
Processo 0:x0,x1,x8,x9
Proceso 1:x2,x3,x10,x11
Processo 2:x4,5x5,x12
Proceso 3:x6,x7,x13
Problema 7
What do the various MPI collective functions do if the communicator contains
a single process?
Caso o programa utilize somente um processador , o programa funcionar como um
serial . Caso tenha alguma situao bloqueante , pode ser que o programa fique
bloqueado
As funes testadas que funcionaram normal com um processador foram
:bcast,gather ,scatter,Allgather ,reduce ,Allreduce tiveram funcionamento normal com
somente um processo
Programa 8
Suppose comm sz = 8 and n = 16.
a. Draw a diagram that shows how MPI Scatter can be implemented using
tree-structured communication with comm sz processes when process 0
needs to distribute an array containing n elements.
b. Draw a diagram that shows how MPI Gather can be implemented using
tree-structured communication when an n-element array that has been
distributed among comm sz processes needs to be gathered onto process 0.
a)
b)
Problema 9
Write an MPI program that implements multiplication of a vector by a
scalar and dot product. The user should enter two vectors and a scalar, all
of which are read in by process 0 and distributed among the processes.
The results are calculated and collected onto process 0, which prints
them. You can assume that n, the order of the vectors, is evenly divisible
by comm sz .
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
void ler_tam(int *local_n_p,int *n_p,int my_rank, int comm_sz,
MPI_Comm comm);
void Ler_val(double *local_vetor1, double *local_vetor2, double*
escalar_p,
int local_n, int my_rank, int comm_sz, MPI_Comm comm);
void Imprime_vetor(double local_vec[], int local_n, int n,
Imprime_vetor(local_escalar_mult1, local_n, n,
my_rank, comm);
printf("\n produto do segundo vetor pelo escalar : \n");
Imprime_vetor(local_escalar_mult2, local_n, n,my_rank, comm);
free(local_escalar_mult2);
free(local_escalar_mult1);
free(local_vetor2);
free(local_vetor1);
MPI_Finalize();
return 0;
}
void ler_tam(int *local_n_p,int *n_p,int my_rank, int comm_sz,
MPI_Comm comm){
if(my_rank==0){
printf("Digite o tamanho do vetor \n");
scanf("%d",n_p);
}
MPI_Bcast(n_p,1,MPI_DOUBLE,0,comm);
*local_n_p=*n_p/comm_sz;
}
Soluo:
Est correto pois o erro de alias ocorre somente em casos de parmetros de sada ou de
sada/entrada. Como no programa foi utilizado como dois parametros de entrada no
tem problema de aliasing que quando dois parametros acessa o mesmo bloco de
memria
Problema 12
An alternative to a butterfly-structured allreduce is a ring-pass structure. In a
ring-pass, if there are p processes, each process q sends data to process q + 1,
except that process p 1 sends data to process 0. This is repeated until each
process has the desired result. Thus, we can implement allreduce with the
following code:
sum = temp val = my val;
for (i = 1; i < p; i++) {
MPI Sendrecv replace(&temp val, 1, MPI INT, dest,
sendtag, source, recvtag, comm, &status);
sum += temp val;
}
a. Write an MPI program that implements this algorithm for allreduce. How
does its performance compare to the butterfly-structured allreduce?
b. Modify the MPI program you wrote in the first part so that it implements
prefix sums.
Soluo:
0 -x
1- y
2- z
3- w
Resultado
Proc 0 -x+y+z+w
Proc 1 -x+y+z+w
Proc 2 -x+y+z+w
Proc 3 -x+y+z+w
b)
int Global_sum( int my_val){
int_val=my_val;
if(my_rank==0){
for (i = 1; i < comm_sz ; i++) {
if (i > dest) temp_val = 0;
MPI_Sendrecv_replace(&temp_val, 1, MPI_INT, 1, 0, 3, 0, comm,
MPI_STATUS_IGNORE);
sum += temp_val;}
}
eles if(my_rank=3){
for (i = 1; i < comm_sz ; i++) {
if (i > dest) temp_val = 0;
MPI_Sendrecv_replace(&temp_val, 1, MPI_INT, 0, 0, 2, 0, comm,
MPI_STATUS_IGNORE);
sum += temp_val;
}
}
else{
if (i > dest) temp_val = 0;
for (i = 1; i < comm_sz ; i++) {
MPI_Sendrecv_replace(&temp_val, 1, MPI_INT, my_rank+1, 0,
my_rank-1, 0, comm, MPI_STATUS_IGNORE);
sum += temp_val;
}
}
Saida:
Proc
Proc
Proc
Proc
0
1
2
3
-x
-y
-z
-w
Proc
Proc
Proc
Proc
0
1
2
3
-x
-x+y
-x+y+z
-x+y+z+w
Problema '14
a. Write a serial C program that defines a two-dimensional array in the main
function. Just use numeric constants for the dimensions:
int two d[3][4];
Initialize the array in the main function. After the array is initialized, call
a function that attempts to print the array. The prototype for the function
should look something like this.
void Print two d(int two d[][], int rows, int cols);
After writing the function try to compile the program. Can you explain
why it wont compile?
b. After consulting a C reference (e.g., Kernighan and Ritchie [29]), modify
the program so that it will compile and run, but so that it still uses a twodimensional C array.
Soluo:
int main(void) {
int two_d[3][4];
int temp = 0;
for (int i = 0; i < 3; i++)
for(int j =0; j<4; j++) {
two_d[i][j] = i+j*2
}
Print_two_d(two_d[][], 3, 4);
return 0;
}
void Print_two_d(int two[][], int rows, int cols) {
for (int i = 0; i < rows ;i++){
for(int j =0; j < cols;j++) {
printf("%d ", two[i][j]);
}
printf("\n");
}
}
b)
Duas situaes : Passando o tamanho da coluna . Modificando assim somente a
int main
int main(void)
{
int two_d[3][4];
for (int i = 0; i < 3; i++){
Print_two_d(two_d,3,4);
return 0;
Aloca dinamicamente
int main(void) {
int i,j;
/* Temos um vetor que referncia cada linha, e cada linha por sua
vez tem um vetor que referncia as colunas de cada linha. resumindo,
temos um vetor para cada vetor, vimos que podemos refernciar vetores
atrvez de ponteiro*/
int **two=(int**)malloc(3*sizeof(int*));
for(i=0;i<3;i++){
two[i]=(int*)malloc(4*sizeof(int));
}
for (i = 0; i < 3; i++)
for(j =0; j<4; j++) {
two[i][j] = i*2;
}
Print_two_d(two, 3, 4);
free(two);
return 0;
}
void Print_two_d(int **two, int rows, int cols) {
int i,j;
Problema 15
What is the relationship between the row-major storage for two- dimensional
arrays that we discussed in Section 2.2.3 and the one-dimensional storage we use in
Section 3.4.9?
Soluo:
Os dois tipos de armazenamentos so iguais . Os dois pega uma matrix de duas
dimenses e transforma em uma matriz de uma dimenso , ou seja, armazenado a
primeira linha em seguida armazenado a segunda linha e por a em diante.
Problema 16
Suppose comm sz = 8 and the vector x = (0, 1, 2, . . . , 15) has been distributed
among the processes using a block distribution. Draw a diagram illustrating the
steps in a butterfly implementation of allgather of x.
Soluo:
Problema 17
MPI Type contiguous can be used to build a derived datatype from a
collection of contiguous elements in an array. Its syntax is
int MPI Type contiguous(
int
count
MPI
Datatype
old
mpi t
MPI Datatype
new mpi t p
);
Modify the Read vector and Print vector functions so that they use
an MPI datatype created by a call to MPI Type contiguous and a
count
argument of 1 in the calls to MPI Scatter and MPI Gather .
Soluo:
local_vec[]
, int
local_n
tipo ) {
..
if (my_rank == 0) {
vetor = malloc(n*sizeof(double));
printf("Insira o vetor %s\n", vec_name);
for (i = 0; i < n; i++)
scanf("%lf", &vetor[i]);
MPI_Scatter(vetor, 1, tipo, local_vec, 1, tipo, 0, comm);
free(a);
} else {
MPI_Scatter(vetor, 1, tipo, local_vec, 1, tipo, 0, comm);
}
}
void Print_vector(double
local_vec[]
, int
local_n
, int
n
, char
title[]
, MPI_Datatype
tipo) {
....
if (my_rank == 0) {
MPI_Gather(local_vec, 1, tipo, vec, 1, tipo, 0, comm);
for (i = 0; i < n; i++)
printf("%f ", vec[i]);
free(b);
} else {
MPI_Gather(local_vec, 1, tipo, b, 1, tipo, 0, comm);
}
}
Resultado:
Taamanho vetor :4
Vetor1=4,4,4,4
Vetor e
4.000000 4.000000 4.000000 4.000000
Vetor2= 4 4 4 4
Vetor2 e
4.000000 4.000000 4.000000 4.000000
A soma e
8.000000 8.000000 8.000000 8.000000
Problema 19
MPI Type indexed can be used to build a derived datatype from arbitrary
array elements. Its syntax is
int MPI Type indexed(
int
count
int
array of blocklengths[]
int
array of displacements[]
MPI Datatype
old
mpi t
MPI Datatype
new mpi t p)
Unlike MPI Type create struct , the displacements are measured in units of
old mpi t not bytes. Use MPI Type indexed to create a derived datatype
that corresponds to the upper triangular part of a square matrix. For example,
in the 4 4 matrix
the upper triangular part is the elements 0, 1, 2, 3, 5, 6, 7, 10, 11, 15. Process
0 should read in an n n matrix as a one-dimensional array, create the
derived datatype, and send the upper triangular part with a single call to
MPI Send . Process 1 should receive the upper triangular part with a single
call to MPI Recv and then print the data it received
Soluo:
//mpicc -g -Wall -std=c99 -o problema3_19 problema3_19.c
//mpiexec -n 2 ./problema3_19
#include
#include
#include
#include
<stdio.h>
<stdlib.h>
<string.h>
<mpi.h>
int * bloco_disp;
int *bloco_tam;
MPI_Init(NULL, NULL);
comm = MPI_COMM_WORLD;
MPI_Comm_size(comm, &comm_sz);
MPI_Comm_rank(comm, &my_rank);
ler_tam(&n);
matriz=malloc(n*n*sizeof(double));
bloco_disp=malloc(n*sizeof(int));
bloco_tam=malloc(n*sizeof(int));
int disp=0;
int tam=n;
for (int i=0;i<n;i++){
bloco_tam[i]=tam;
bloco_disp[i]=disp+i*n;
disp++; tam--;
}
MPI_Type_indexed(n,bloco_tam, bloco_disp, MPI_DOUBLE,
&tipo);
MPI_Type_commit(&tipo);
free(bloco_disp);
free(bloco_tam);
if(my_rank==0)
{
printf("digite o valor \n");
for (int i=0;i<n;i++){
for (int j=0;j<n;j++){
scanf("%lf",&matriz[j+i*n]);
}
}
MPI_Send(matriz, 1, tipo, 1, 0, comm);
}
else if(my_rank==1){
//Executando essa linha, mas no deu erro sem ela
for (int i=0;i<n;i++){
for (int j=0;j<n;j++){
matriz[j+i*n]=0;
}
}
}
free(matriz);
MPI_Type_free(&tipo);
MPI_Finalize();
return 0;
if(my_rank==0){
printf("Digite o tamanho de n : \n");
scanf("%d",n_p);
}
MPI_Bcast(n_p, 1, MPI_INT, 0, comm);
}
Problema 20
The functions MPI Pack and MPI Unpack provide an alternative to derived
datatypes for grouping data. MPI Pack copies the data to be sent, one
block at a time, into a user-provided buffer. The buffer can then be sent
and received. After the data is received, MPI Unpack can be used to
unpack it from the receive buffer. The syntax of MPI Pack is
int MPI Pack(
void
in buf
int
in_buf_count
MPI_Datatype datatype
void
pack_buf
int
pack buf sz
int
position p
MPI Comm comm
}
We could therefore pack the input data to the trapezoidal rule program
with
the following code:
char pack buf[100];
int position = 0;
MPI Pack(&a, 1, MPI DOUBLE, pack buf, 100, &position, comm);
MPI Pack(&b, 1, MPI DOUBLE, pack buf, 100, &position, comm);
MPI Pack(&n, 1, MPI INT, pack buf, 100, &position, comm);
The key is the position argument. When MPI Pack is called, position should
refer to the first available slot in pack buf . When MPI Pack returns, it
refers
to the first available slot after the data that was just packed, so after
process 0 executes this code, all the processes can call MPI Bcast :
pack buf
pack buf sz
position p
out buf
out_buf_count
datatype
comm
}
This can be used by reversing the steps in MPI Pack , that is, the data is
unpacked one block at a time starting with position = 0 .
Write another Get input function for the trapezoidal rule program. This
one should use MPI Pack on process 0 and MPI Unpack on the other processes.
Soluo
* starts
local_a =
local_b =
local_int
at: */
a + my_rank*local_n*h;
local_a + local_n*h;
= Trap(local_a, local_b, local_n, h);
}/* Get_input */
/* Get_input */
/*-----------------------------------------------------------------* Function:
Trap
* Purpose:
Serial function for estimating a definite integral
*
using the trapezoidal rule
* Input args:
left_endpt
*
right_endpt
*
trap_count
*
base_len
* Return val:
Trapezoidal rule estimate of integral from
*
left_endpt to right_endpt using trap_count
*
trapezoids
*/
double Trap(
double left_endpt /* in */,
double right_endpt /* in */,
int
trap_count /* in */,
double base_len
/* in */) {
double estimate, x;
int i;
estimate = (f(left_endpt) + f(right_endpt))/2.0;
for (i = 1; i <= trap_count-1; i++) {
x = left_endpt + i*base_len;
estimate += f(x);
}
estimate = estimate*base_len;
return estimate;
} /* Trap */
/*-----------------------------------------------------------------* Function:
f
* Purpose:
Compute value of function to be integrated
* Input args: x
*/
double f(double x) {
return x*x;
} /* f */
Resultado:
Digite o valor de a , b , n respectivamente
2
10
2
With n = 2 trapezoids, our estimate
of the integral from 2.000000 to 10.000000 = 6.400000000000000e+01
Problema 21
How does your system compare to ours? What run-times does your system
get for matrix-vector multiplication? What kind of variability do you see in
the times for a given value of comm sz and n? Do the results tend to cluster
around the minimum, the mean, or the median?
Problema 22
Time our implementation of the trapezoidal rule that uses MPI Reduce . How
will you choose n, the number of trapezoids? How do the minimum times
compare to the mean and median times? What are the speedups? What are the
efficiencies? On the basis of the data you collected, would you say that the
trapezoidal rule is scalable?
Soluo:
O n foi escolhido para que ficasse na ordem de milissegundo .Porque caso a ordem no
seja igual pode ser que perca informaes.
A mdia 4.13 e a mediana =2.3(2.26-243).
De acordo com os dados podemos deduzir que o problema fracamente escalonavel .
Pois quando aumentamos a quantidade de processo a eficincia diminui.
Problema 24
Take a look at Programming Assignment 3.7. The code that we outlined for
timing the cost of sending messages should work even if the count argument
is zero. What happens on your system when the count argument is 0? Can
you explain why you get a nonzero elapsed time when you send a zero-byte
message?
Soluo:
O tempo ser diferente de zero . Pois apesar de no conter dados , o envelope ir conter
a tag e o comunicador. E com isso manter uma conexo MPI mesmo no tendo dados.
Problema 25
If comm sz = p, we mentioned that the ideal speedup is p. Is it possible
to do better?
a. Consider a parallel program that computes a vector sum. If we only time
the vector sumthat is, we ignore input and output of the vectorshow
might this program achieve speedup greater than p?
b. A program that achieves speedup greater than p is said to have superlinear speedup. Our vector sum example only achieved superlinear
speedup by overcoming certain resource limitations. What were these
resource limitations? Is it possible for a program to obtain superlinear
speedup without overcoming resource limitations?
Soluo:
Problema 26
Serial odd-even transposition sort of an n-element list can sort the list in
considerably fewer than n phases. As an extreme example, if the input list
is already sorted, the algorithm requires 0 phases.
a. Write a serial Is sorted function that determines whether a list is
sorted.
b. Modify the serial odd-even transposition sort program so that it checks
whether the list is sorted after each phase.
c. If this program is tested on a random collection of n-element lists,
roughly what fraction get improved performance by checking whether the
list is sorted?
Soluo:
a)
is sort(int vetor[],int tam)
{
for (int i =1;i<tam;i++){
if (vetor[i-1]<vetor[i]){
return 0;
}
}
return 1;
}
Resultado:Numero ordenado
b)
int a[];
int n;
int phase , i , temp;
for (phase=0;phase<n;phase++){
if (is_sort==1){break;
}
if(phase%2 == 0){
for (i=1;i<n;i+=2){
if(a[a-1]>a[i]){
temp=a[i];
a[i]=a[i-1];
a[i-1]=temp;
}
}else {
for
(i=1;o <n-1;i+=2)
if(a[i]>a[i+1]){
temp=a[i];
a[i]=a[i+1];
a[i+1]=temp;
}
}
}