Anda di halaman 1dari 30

Data Analysis using WinBUGS A Brief Tutorial

Om Prakash Singh
singhop_upc@yahoo.com

Udai Pratap (Autonomous) College Varanasi

Overview
 BUGS stands for Bayesian Inference Using Gibbs Sampling
Classic BUGS BUGS WinBUGS (Windows Version) GeoBUGS (spatial models) PKBUGS (pharmokinetic modeling) OPEN BUGS

The Classic BUGS program uses text-based model description and a command-line interface, and its versions are available for major computer platforms. However, it is not being further developed. WinBUGS is a computer software aimed at making Bayesian Inference using MCMC (Markov Chain Monte Carlo) method. It has easy interface and needs only simple programming skill. WinBUGS is free and can be downloaded from the website: http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml

 WinBUGS uses a windows program with an option of a graphical user interface, the standard point-and-click windows interface, and on-line monitoring and convergence diagnostics. (version 1.4.3). WinBUGS is a stand-alone program, although it can be called at present from other statistical software like R, SAS, STATA, Excel, Matlab etc. The list of these software is continuously growing. Follow the link for detailed and more recent information http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/remote14.shtml It is a joint endeavor between the MRC Biostatistics Unit in Cambridge and the Department of Epidemiology and Public Health of Imperial College, London GeoBUGS is an add-on to WinBUGS that fits spatial models and produces a range of maps as output. PKBUGS is an efficient and user-friendly interface for specifying complex population pharmacokinetic and pharmacodynamic (PK/PD) models within WinBUGS software. For a version of BUGS with open source code one may visit the OpenBUGS site which is maintained by Dept of Mathematics & Statistics, University of Helsinki, Finland. It contains some more features than WinBUGS but it does not provide a support/discussion forum. http://mathstat.helsinki.fi/openbugs/

Why WinBUGS ?
It has ability to analyze complex statistical models having missing data or the models not having a closed form for posterior distribution. It has in built program for many popular probability distributions. It provides facility of censoring and truncation of the distributions. It has flexibility to program a Bayesian probability model in two different ways to specify model -----DoodleBUGS: Direct graphics -----BUGS language Software is Free for download A lot of examples and tutorial are available in WinBUGS software A lot of online resources are available at http://www.mrc-bsu.cam.ac.uk/bugs/weblinks/webresource.shtml BUGS email discussion list (actively managed) of world wide users of WinBUGS provides a platform for serious and beneficial discussion on the problems faced by all level of researchers. Provides interface to R and contains potential useful packages for convergence diagnostics -----CODA (Convergence Diagnostic and Output Analysis) for S+ or R -----BOA (Bayesian Output Analysis) for S+ or R

! CAUTION !
The WinBUGS manual carries a caution to the users MCMC sampling can be dangerous! It says that WinBUGS might just crash, (which is not very good) but it might well carry on and produce answers that are wrong (which is even worse).

Installation
Package WinBUGS14.exe needs to be downloaded for windows O. S. from BUGS site. For installation, run the file WinBUGS14.exe. One way to do this is as follows: Exit all other programs currently running 1. Double click on downloaded WinBUGS14.exe 2. Follow the instructions in the dialog box 3. You should have a new directory called WinBUGS14 within Program Files 4. Inside the WinBUGS14 directory you will get a program called WinBUGS WinBUGS14.exe having pretty WinBUGS icon 5. Right-click on this WinBUGS icon, select `create shortcut', then drag this shortcut to the desktop. 6. Double click on WinBUGS14.exe or its shortcut to run WinBUGS14. 7. Obtain the key for unrestricted use by registration at http://www.mrcbsu.cam.ac.uk/bugs/winbugs/register.shtml 8. Check the box to join the BUGS email discussion list during registration. 9. Follow the instructions from automatic response of your registration. You will get a key via email to remove the restrictions in WinBUGS 1.4 10. Download and install the patch for 1.4.3, which removes many bugs in the main version.

Compound Documents y The WinBUGS software has been designed so that it produces output directly to a compound document and can get its input directly from a compound document. A compound document contains various types of information (formatted text, tables, formulae, plots, graphs etc) displayed in a single window and stored in a single file. The tools needed to create and manipulate these information types are always available, so there is no need to continuously move between different programs. Text is selected by holding down the left mouse button while dragging the mouse over a region of text. y A selection can be moved to a new position by dragging it with the mouse. To copy the selection hold down the "control" key while releasing the mouse button. These operations work across windows and across applications, and so the problem specification and the output can both be pasted into a single document, which can then be copied into another word-processor or presentation package. The style, size, font and colour of selected text can be changed using the Attributes menu.

y y y y y y

MAIN STEPS IN DATA ANALYSIS Build a suitable probability model for the data. Estimate the parameters of the probability model. Fit the model on the observed data. Evaluate the quality of fitting of the model to the observed data. Compare different models to choose a suitable one. Use the selected model for the decision making and forecasting etc.

BAYESIAN DATA ANALYSIS The first step is almost common in both classical and Bayesian methods but Bayesian method differs in other steps of data analysis with the classical. A Bayesian approach to a problem starts with the formulation of a probability model that is thought adequate to describe the underlying mechanism in the process based on the past studies and sample collection process. The next step in the process is to devise prior distributions to the parameters U of the model. Here U is completely general and it may be vector valued entity. If the model is binomial, U may be n and p and if the model is Poisson, U may be . The parameters of the model are unobservable quantities of ultimate interest. The prior is intended to capture the beliefs about the situation before seeing the data based on the past experiences. After observing the data, Bayes rule is applied to obtain a posterior distribution for these unobserved parameters, which is conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data. It takes in to account of both, the prior knowledge about the parameter and the observed data.

The idea can be expressed symbolically as follows 1. Suppose we have data X, and unknowns U which might be model parameters, missing data events we did not observe directly or exactly. Bayesian approach is concerned making statements about U given data X (posterior of U). Suppose the prior distribution of U be p(U). Use this probability as means of quantifying uncertainty about U before taking into account the observed data. Suppose the likelihood function be p(x|U). The likelihood function provides the distribution of the data, X, given the parameter value U. So it may be the binomial likelihood, a normal likelihood, a likelihood from a regression equation with associated normal residual variance, logistic regression model, etc.

2.

3.

4.

The posterior distribution is p(U|x). The posterior distribution summarizes the information in the data, X, given the parameter value together with the information in the prior distribution, p(U). Thus, it summarizes what is known about the parameter of interest U, after the data are collected using Bayes theorem
Posterior p(U | x) ! prior v likelihood p(U) v p( x | U) ! m arg inal p(U) v p( x | U)dU w p(U) v p( x | U)

5.

When we have found the posterior distribution of the parameter U, we may easily obtain an estimate of expectation of, f (U) a function of U. Suppose we have obtained a random sample U1 , U 2 ,     U n of size n from the posterior distribution of the parameter U. Then E(f (U)) expectation of f (U) can be approximated as the average provided n is sufficiently large
1 E (f (U)) } n

i !1

f (U i )

6.

The final step in the analysis is concerned with the evaluation of the fit of the model to the data and implications of the resulting posterior distributions and sensitivity of the conclusions to the assumptions. This stage enables us to get the answer of the queries like - Are the substantive conclusions obtained reasonable? How sensitive are the results to the changes in the various modeling assumptions?

MCMC (Markov Chain Monte Carlo) METHOD MCMC method is an iterative numerical method used to evaluate difficult integrals of complex expressions. The basic idea behind MCMC was known for about 50 years in physics. Its use in statistics started only as late as in the 80's. The main difference between standard Monte Carlo method and MCMC method is the dependence structure between successive simulated values obtained from the distribution. Standard Monte Carlo method produces a set of independent simulated values whereas MCMC method produces a chain of simulated values in which each of the simulated value is dependent on the preceding value. The basic principle based on the ergodic theorem of Markov chain is that once the chain has run sufficiently large and has converged then it will approximate the desired integral. The exact sampling method used by WinBUGS varies for different types of models. The intelligent program decides about choice of sampling method itself under the given set of conditions that may be modified by user. To learn more about these sampling schemes, see the MCMC methods section in the first chapter Introduction of the WinBUGS of user manual that comes with the WinBUGS program (in electronic form). WinBUGS does not allow improper priors.

Model specification through DoodleBUGS


Doodles graphics consist of three elements: nodes, plates and edges. The graph is built up out of these elements using the mouse and keyboard, controlled from the Doodle menu. Doodle Help provides quick tips. Nodes: Point the mouse cursor to an empty region of the Doodle window and click to create a node. Three type of nodes (1) Logical nodes deterministic function of other nodes, (2) Stochastic nodes are variable having a probability distribution, (3) Constant numerical value or unknown constant Ellipses: stochastic or deterministic node Rectangle: constant Two types of edges solid arrow: a stochastic dependence hollow arrow: a logical function. Plate: represents repeated structure (ctrl+click)

The menu options of Doodle are used to design the graphic model 1. New: opens a new window for drawing Doodles. A dialog box opens allowing a choice of size of the Doodle graphic and the size of the nodes of the graph. 2. Grid: the snap grid is on the centre of each node and each corner of each plate is constrained to lie on the grid. 3. Scale Model: shrinks the Doodle so that the size of each node and plate plus the separation between them is reduced by a constant factor. 4. Remove Selection: removes the highlighting from the selected node or plate of the Doodle if any. 5. Write Code: opens a window containing the BUGS language equivalent to the Doodle. After constructing a Doodle, you are strongly recommended to use Write Code to check the structure is that which you intended.

index:

from:

up to:

alpha

beta

tau

mu[i]

sigma

Y[i] for(i IN 1 : N)
name: mean Y[i] mu[i] type: precision stochastic tau density: dnorm lower bound upper bound

alpha

beta

tau

mu[i]

sigma

Y[i] for(i IN 1 : N)
name: tau type: constant

alpha

beta

tau

mu[i]

sigma

Y[i] for(i IN 1 : N)

Model Specification Through BUGS Language Open a compound document clicking new option in the file menu and write the bugs program. Some basic elements of the bugs language 1. Lexical conventions Node names. Characters: letters, numbers and period. Starting with a letter, not ending with a period (.) Upper or lower case or mixed, but case is important. Program names are case sensitive. Length restriction: 32 characters. 2. Indexing Vectors and matrices are indexed within square brackets [, ,]. No limitation on dimensionality of arrays. 3. Numbers: Standard or exponential notation. A decimal point must be included in the exponential format. Legal notion for 0.0001: .0001, 0.0001, 1.0E-4, 1.0e-4, 1.E-4, 1.0E-04, but NOT 1E-4.

4. Model statement model { text-based statements to describe model in BUGS language } Arbitrary spaces may be used in statements. Multiple statements may appear in a single line as well as one statement may extend over several lines. Comment line is followed by hash #. 5. Model description Constant: fixed by the design of study known or unknown. Stochastic nodes: variables (data or parameters) that are given a distribution e.g. x ~ dbin(p,n) where twiddle ~ means distributed as. The parameters of a distribution must be explicit. Scalar parameters can be numerical constants but function expressions are not allowed.  Currently, WinBUGS could handle 20 distributions. (See Win BUGS manual distribution chapter)  Deterministic nodes: logical functions of other nodes. pred <- beta0 + beta1 * x or Log(p) <- Q where <- means to be replaced by.

Logical expressions can be built using: +, , * , / and unitary minus. Functions inbuilt may also be used in logical expressions. (See list of functions in Table I: Functions section of Model Specification chapter in the Win BUGS manual) for loops for (name in expression1:expression2) { statements } The above two expressions should be fixed integer quantities. Example for loop for simple regression: Response Y, explanatory variable x and N independent observations. Normal assumption. for ( i in 1:N) { Y[i] ~ dnorm(mu[i],tau) mu[i] <- alpha + beta*x[i] } where tau = precision of normal distribution = 1/variance The declarative structure of model specification in BUGS language requires each node appears once and only once on the left hand side of a node. Data transformation is the only exception.

Data transformation Suppose we have data y and want to make a node for (y), the following statements could be used: for (i in 1:N) { z[i] <- sqrt(y[i]) z[i] ~ dnorm(mu, tau) } This construction is only possible when transforming observed data with no missing values. Censoring interval censored: y ~ ddist(U)I(lower, upper) right censored: y ~ ddist(U)I(lower,) left censored: y ~ ddist(U)I(, upper) This construct should NOT be used for modeling truncated distribution. Truncated distribution may be handled by working out a usually complex algebraic form for the likelihood and using the technique (zeros trick or ones trick) for arbitrary distributions discussed in the manual in the section Specifying a new sampling distribution in chapter Tricks: Advanced Use of the BUGS Language. Indexing Functions + * / and appropriate bracketing can be used in index. e.g. y[(I+J)*K] However, functions of unobserved nodes are not permitted to appear directly as an index term.

Implicit indexing: n:m represents n, n+1 ,..., m . x[] represents all values of a vector x. y[,3] indicates all values of the third column of a two-dimensional array y. Formatting of data. Scalars and arrays are named and given values in a single structure. S-plus format: Headed by Keyword list. There must be no space after list. Example list( xbar = 22, N = 30, T = 5, x = c(8.0, 15.0, 22.0, 29.0, 36.0) ) The whole data must be specified. It is not possible just to specify selected components. Missing values are represented as NA. All variables in a data file must be defined in a model. Alternatively use rectangular format. Initial values are specified following the same rules as of data. Data may be specified in S-plus format also. Writing a program in BUGS language From the file menu click on the option new to open a new compound document or click option open to open an existing document. It provides to work with the different type of files having extensions .odc, .txt, .rtf, etc. However, the compound document should be saved with .odc extension to store data, program, graphics etc. Running the Model in WinBUGS Under the Model Menu Click specification you will follwing window

check m odel com pile load inits get inits

load data num of chains 1 for chain 1

check model: Highlight the word model in the bugs program or bring DoodleBugs at top of active screen. Following message Model is syntactically correct. Or Error information. load data: Highlight list in the data formate and click button to load data. Following message will appear on the status line (lower left corner of screen) Data Loaded. or Error information. Num of chains: Specify the number of chains to be simulated. Enter the number of chains desired to be simulated. compile: Click to Build the data structures needed to carry out Gibbs sampling. Following message will appear on the status line (lower left corner of screen) Model Compiled or Error information. load inits: Click to load initial values of the nodes in the exactly same way as data. Status line information: Initial values loaded: model contains uninitialized nodes --- try gen inits. Initial values loaded: model initialized --- now model ready to update. or Error information. gen inits: Generate initial values by sampling from the prior.

Once the model has been compiled and initialized, it is ready to run MCMC to get samples from the posterior distribution. Under the Model Menu, Click update
updates 1000 update thin 1 refresh iteration adapting 100 0

over relax

Enter the number of MCMC updates to be carried out in the updates field. Click update button to start updating the model. burn in: the first one thousand or more samples are discarded to get over the influence of initial values and get the transitional distribution converged to stationary distribution. Other specifications about update. refresh: the number of updates between redrawing the screen. thin: enter k in the field to store every kth iteration. The sample for each chain are not independent. This dependence gets smaller when samples are getting further apart. over relax: In each iteration, generates multiple samples and selects one that is negatively correlated with the current one. Reduce within chain autocorrelation not always effective. adapting: indicates the initial tuning of Metropolis or slice-sampling algorithm. The first 4000 and 500 are discarded respectively.

Monitoring chains Under the Inference Menu and click Samplings:


Node beg clear stats 1 set coda chains end 1000000 trace quantiles history BGR diag 1 to thin 1 density auto cor 1
percentiles

Nodes: The variable of interest must be typed in this text field before updating to be monitored. * serves as shorthand for all the stored nodes. deviance, minus twice of log-likelihood is automatically calculated for each iteration and can be monitored by typing in deviance. set: click to finish the specification of interested variable. beg and end: select a subset of the stored samples for analysis especially for Burn-in. chains and to: select chains for analysis. Clear: removes the stored values of the variable. Trace: dynamic plot of the variable value against iteration number. Redrawn each time when the screen is redrawn. quantiles: plots the running mean with running 95% confidence interval (credible set) against iteration number. History: complete trace of the variable

auto corr: plots the autocorrelation function of the variable up to 50 lags. bgr diag: Brooks - Gelman-Rubin convergence statistic, as modified by Brooks and Gelman (1998). Green line: the width of the central 80% interval of the pooled runs. Blue line: The average width of the 80% interval within the individual runs. Red line: the ratio of pooled/within (= R) calculated in bins of length 50. Convergence diagnostic R converges to 1. Both the pooled and within interval widths converge to stability. Stats: summary statistics for the variable, pooling over the chains selected. Checking convergence Checking convergence requires considerable care. It is very difficult to say conclusively that a chain (simulation) has converged. Various diagnostics for chain convergence only diagnose when it definitely hasn't! The following are practical guidelines for assessing convergence: * For models with many parameters, it is impractical to check convergence for every parameter. * Just choose a random selection of relevant parameters to monitor. For example, rather than checking convergence for every element of a vector of random effects, just choose a random subset (say, the first 5 or 10). * Examine trace plots of the sample values versus iteration to look for evidence of when the simulation appears to have stabilised. * We can be reasonably confident that convergence has been achieved if all the chains appear to be overlapping one another.

The following plots are examples of: (i) chains for which convergence (in the pragmatic sense) looks reasonable (left-hand-side); and (ii) chains which have clearly not reached convergence (right-hand-side).

The following plots are examples of: (i) multiple chains for which convergence looks reasonable (top); and (ii) multiple chains which have clearly not reached convergence (bottom). alpha0 chains 1:2
0.5 0.0 -0.5 -1.0 -1.5 101 200 iteration
alpha0 chains 1:2 10.0 7.5 5.0 2.5 0.0 -2.5 101 200 iteration 400 600

400

600

Improving convergence Possible solutions include: a) better parameterisation to improve orthogonality of joint posterior; b) standardisation of covariates to have mean 0 and standard deviation 1; c) use of over-relaxation.

How many iterations after convergence? Once you are happy that convergence has been achieved, you will need to run the simulation for a further number of iterations to obtain samples that can be used for posterior inference. The more samples you save, the more accurate will be your posterior estimates. One way to assess the accuracy of the posterior estimates is by calculating the Monte Carlo error for each parameter. This is an estimate of the difference between the mean of the sampled values (which we are using as our estimate of the posterior mean for each parameter) and the true posterior mean. As a rule of thumb, the simulation should be run until the Monte Carlo error for each parameter of interest is less than about 5% of the sample standard deviation. The Monte Carlo error (MC error) and sample standard deviation (SD) are reported in the summary statistics table.

Some Trap messages a) 'undefined real result' indicates numerical overflow. Possible reasons include: - initial values generated from a 'vague' prior distribution may be numerically extreme - specify appropriate initial values; - numerically impossible values such as log of a non-positive number - check, for example, that no zero expectations have been given when Poisson modelling; - numerical difficulties in sampling. Possible solutions include: - better initial values; - more informative priors - uniform priors might still be used but with their range restricted to plausible values; - better parameterisation to improve orthogonality; - standardisation of covariates to have mean 0 and standard deviation 1. - can happen if all initial values are equal. The trap can sometimes be escaped from by simply clicking on the update button. b) 'index array out of range' - possible reasons include: - attempting to assign values beyond the declared length of an array; - if a logical expression is too long to evaluate - break it down into smaller components. c) 'stack overflow' can occur if there is a recursive definition of a logical node. d) 'NIL dereference (read)' can occur at compilation in some circumstances when an inappropriate transformation is made, for example an array into a scalar. e) Trap messages referring to 'DFreeARS' indicate numerical problems with the derivative-free adaptive rejection algorithm used for log-concave distributions. One possibility is to change to "Slice" sampling.The sampling methods are held in Updater/Rsrc/Methods.odc and can be edited. For example, if there are problems with WinBUGS' adaptive rejection sampler (DFreeARS), then the method "UpdaterDFreeARS" for "log concave" could be replaced by "UpdaterSlice.

Model Comparison: A General Problem All the models are wrong but some are useful. - Box We can fit several models to data But how do we know which is the best model? DIC (Deviance Information Criterion) The DIC Tool dialog box under inference menu is used to evaluate the Deviance Information Criterion (DIC; Spiegelhalter et al., 2002) and related statistics - these can be used to assess model complexity and compare different models. . set: starts calculating DIC and related statistics - the user should ensure that convergence has been achieved before pressing set as all subsequent iterations will be used in the calculation. clear: clears DIC calculation from memory, so that it may be restarted later. Dbar: this is the posterior mean of the deviance, which is exactly the same as if the node 'deviance' had been monitored. This deviance is defined as -2 * log(p( y | theta )). It is a measure of goodness of fit. Dhat: this is a point estimate of the deviance obtained by substituting in the posterior means theta.bar of theta. Dhat = -2 * log(p( y | theta.bar )). pD: this is 'the effective number of parameters', and is given by pD = Dbar - Dhat. A penalty for model complexity. DIC: this is the 'Deviance Information Criterion', and is given by DIC = Dbar + pD = Dhat + 2 * pD. The model with the smallest DIC is estimated to be the model that would best predict a replicate dataset of the same structure as that currently observed.

Thank You All

Anda mungkin juga menyukai