Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.17 No.1 pp.120-127

Probabilistic Graphical Framework for Predicting Software Project Risk

Gilseung Ahn, Minsung Kwon, Changwook Kang, Sun Hur*
Department of Industrial and Management Engineering, Hanyang University, Ansan, Republic of Korea
Corresponding Author,
December 14, 2016 August 22, 2017 November 20, 2017


Project risk management is currently one of the main topics of interest for researchers and practitioners working in the area of project management. Risk management has been designated as one of the ten subject areas of the Project Management Body of Knowledge by the Project Management Institute. Since project risk management is closely associated with other project management areas, it is important to manage project risk in detail. In this paper, we suggest a method to predict software project risk by means of probabilistic graphical model. Concretely, we identify software development process referring to ISO/IEC 12207, an international standard for software lifecycle processes and construct a probabilistic model to predict risks. The framework we suggest not only forecasts the risks, but also finds critical factors to analyze project risk.



    Rate of software project failures is very high in comparison to the other projects although many investments to the new information system have been made (Altuwaijri and Khorsheed, 2012). Industry surveys show that only one-quarter of software projects are successful outright, and that billions of dollars are lost annually due to software project failures (Charette, 2005). There are many reasons that lead to software project failures. For example, customers often change their requests, specification of software project is more ambiguous than any other projects, and budget is insufficient (Lehtinen et al., 2014).

    Due to the difficulties of software project, numerous studies regarding software project risks have been conducted. Sharma et al. (2011) explored risk dimensions of software project in India, and the identified risk dimensions are software requirement specification variability, team composition, control processes and dependability. Liu et al. (2011) founds that instable requirements would lead to potential interpersonal conflict that negatively affect performance of software projects. Fu et al. (2012) analyzed the impact of requirement changes on software project risk and established a probabilistic model using design structure matrix to evaluate the risk of requirement changes.

    Quantitative methods can be used to reduce bias, and machine learning methods have specifically been applied to predict project risk. Hu et al. (2009) applied artificial neural networks (ANN) and support vector machine (SVM) to build an intelligent model to predict and control for software project risk according to the synthetic outcomes of project quality, time, and cost. Neumann (2002) suggested a principle component analysis (PCA)-ANN technique to estimate software risk and improve the ability to identify high-risk software. Hu et al. (2015) constructed software risk prediction model based on a classifier ensemble. The candidate classifiers which compose the ensemble are decision tree, SVM, ANN, Bayesian network, random forest and so forth, and among them, SVM outperformed in accuracy but decision tree outperformed in cost analysis. Hu et al. (2013) employed Bayesian network with causality constraints to build a software project risk prediction model, and this model predicted more accurately than other machine learning model such as logistic regression and decision tree. Verner et al. (2007) employed logistic regression model for predicting software project success under different culture contexts including United States and Australia. Reyes et al. (2011) developed a genetic algorithm based model to predict success probability for software projects.

    Since existing studies adopting machine learning methods to predict project risk did not consider the processes, their findings may not be generalized because which processes are the actual causes of risk remain unknown. In addition, project risk during previous processes can affect risk during current or subsequent processes, and few previous studies have dealt effectively with this dependency issue. In order to resolve these problems, we present a method to predict project risk at each stage of the process given the status of risk factors that are potential causes of risk. This is done by introducing a probabilistic graphical modeling method that has been widely applied in labeling sequences and has demonstrated good performance in various fields. Because predicting the degree of risk for each process can be regarded as a sequence labeling problem, we adopt this method to predict software development risk at each stage. Therefore, our research novelty can be summarized as: we develop a probabilistic graphical framework to predict software project risk with the consideration of the fact: (1) a project consists of several processes, and (2) linear relationship between processes exists. Our model not only forecasts the risks, but also finds critical factors to analyze project risk.

    The suggested framework in the paper is composed of five steps: (1) defining software project processes, (2) identifying software project risk factors as potential causes of risk, (3) modeling processes as a linear chain probabilistic graphical model by introducing two feature functions, (4) learning the model to estimate parameters, and (5) predicting the risk probability of each process and determining significant factors at each stage of the project. More precisely, we refer to ISO/IEC 12207, an international standard for software life cycle processes, to define the software project process in step (1). In step (2), we also identify project risk factors by referring to existing studies of software project risk. In step (3), feature functions representing the relationships between processes and corresponding risk factors that consider dependency among the processes are introduced. In step (4), we train and infer the model designed in step (3). Finally, methods used to calculate risk probabilities of processes when risk factors are realized and determine critical factors for project risk at each stage are described in step (5).

    The remainder of this paper is organized as follows. In section 2, we provide a framework to predict software project risk. In section 3, we present our model using an artificially constructed data set and compare our results with those of other models. Finally, section 4 concludes the paper.


    In this section, a framework to predict software project risk is provided. Each subsection includes a detailed description of one step, where is presented.

    2.1 Defining Software Project Risk Processes

    ISO/IEC 12207 is an international standard for software lifecycle processes and aims to define all tasks required to develop or maintain software. It divides a software project into 12 processes along with their specific tasks: process implementation; system requirements analysis; system architecture design; software requirements analysis; software architectural design; software coding and testing; software integration; software qualification testing; system integration; system qualification testing; software installation; and software acceptance support. In this paper, we formulate a mathematical model by adopting these processes and determine risk factors for each process. Regarding the relationship between process and factor risk, a probabilistic graphical model of linear chain type is constructed in order to predict risk for each process (Figure 1). This hypothetical software project includes 12 processes and several risk factors that are related to each process.

    2.2 Identifying Software Project Risk Factors

    We identify risk factors that could affect a software project based on the existing literature (Wallace and Keil, 2004; Schmidt et al., 2001; Addison and Vallabh, 2002; Addison, 2003; Han and Huang, 2007; Paré et al., 2008). A total of 61 factors are identified after eliminating overlapping factors and regarded as potential causes of risk (Table 1). We examine all risk factors and allocate each factor to the process to which the factor is most closely related and potentially affects. For example, the risk factors “Users with negative attitudes toward the project,” “Change in organizational management during the project,” and “Late changes to requirement” are closely related to implementation activity of the software project, and therefore, assigned to process implementation. Note that risk factors may be allocated to more than one process. For example, the factor “Lack of cooperation from users” is assigned to process implementation, software requirements analysis, and software design.

    Then, each risk factor is translated into a variable, the type of which is either binary or ordinal. Specifically, if the risk factor can be described as a binary variable, then its value is 1 in cases of adverse outcomes, and 0 otherwise. Likewise, if an ordinal variable is necessary to describe the risk factor, then a five-point measure is applied, ranging from 1 ( = “nothing happens”) to 5 ( = “very risky issue happens”). Table 1 shows a partial list of risk factors and their variable types.

    These factor variables are considered as random variables which realize values with some given probabilities. A detailed illustrative example is provided in Section 4.

    2.3 Probabilistic Graphical Model

    Let x = (x1, x2,…, x12) be the set of variable vectors, where each xi , i =1, 2,⋯,12 is the vector of risk factors at process i. And y = (y1, y2,…, y12) be the vector of risk degrees of processes, where yi ∈{H,M, L}, i =1, 2,⋯,12 is the degree of risk of process i. Here, H, M, and L stand for “highly risky,” “medium risky,” and “low risky,” respectively. It should be noted that xi ’s are not independent of each other because a risk factor may appear in more than one process, as we described earlier.

    Furthermore, project risk for the previous process usually affects risk for the current process or subsequent processes and, therefore, different values of yi ’s may also depend upon each other. Because of these relationships among input variables (risk of factor) and among response variables (risk of process), project risk analysis often becomes intractable. If one uses a statistical model with many variables to estimate risk, for example, multicollinearity, a phenomenon in which two or more predictor variables are highly linearly related, might cause estimation results be unstable or even not quite. We resolve these problems by introducing two feature functions (state and transition feature functions) to build a probabilistic graphical model.

    Let (λ, μ) = (λ1,λ2,…,λK, μ1,μ2,…,μL) be the parameter set that should be learned from a given training data set of size N. Then, the probability density of the processes risks of a software project can be represented as the following conditional probabilities, or a linear chain conditional random field (CRF):

    p ( y | x ) = 1 Z ( x ) exp { i = 2 12 j = 1 K λ j g j ( y i 1 , y i , x , i )   + i = 1 12 k = 1 L μ k f k ( y i , x , i ) }

    where gj (•) is the jth transition feature function of the risk factors and processes i and i−1. This set of transition feature functions captures the dependency structure of risks between two adjacent processes i−1 and i. fk(•) is the kth state feature function of the process i and risk factors. Any relationship that may exist between the risks of factors and corresponding processes can be reflected to the model by means of these status feature functions. Lastly, X(x) is a normalization function that ensures y p( y | x ) = 1 .

    CRF is a framework for building probabilistic models to segment and label sequential data (Lafferty et al., 2001). It is used to encode known relationships between observations and construct consistent interpretations and has been applied in fields such as text processing, computer vision, and bioinformatics. Detailed explanations of CRF and its applications can be found in Chen et al. (2015), Blunsom and Cohn (2006), McCallum (2003), and Sha and Pereira (2003) and references therein.

    Here, for simplicity, we consider one status feature function and one transition feature function (that is, K = L =1). We utilize an artificial neural network (ANN) to capture the complex relationships between risk factors and process risk. ANN is a well-known machine learning method utilizing the properties of biological neural networks (Figure 2). The main advantage of this method is its nonlinearity, allowing better fit to data, and high parallelism. In particular, it can handle various types of data and obtain good results in complex areas, including project risk management. Han (2015) and López-Martín (2015) adopted ANN to predict risk in software projects. The ANN employed in the paper has one hidden layer and five hidden nodes, as seen in Figure 2.

    Then the status feature function f ( y i ,   x , i ) used in our CRF model is given as follows:

    f ( y i , x , i ) = { 1 , if  ANN( x i ) = y i , 0 , otherwise.

    where ANN(xi) is the response value of our ANN model when the input is xifor the process i.

    In addition, project risks for each process in the same project do not fluctuate sharply and since the processes interacts with each other (PMBOK® Guide 5th Edition, 2013, p. 48), therefore, the following transition feature function is adopted in our CRF model:

    g ( y i , y i + 1 , x , i ) = { 1 , 0 , if  y i = y i + 1 , otherwise .

    Note. More than one feature function can be introduced. For example, if a project manager wants to use a decision tree to represent the complex relationship between risk factors and process risk, then one can add another feature function such as:

    f 2 ( y i , x , i ) = { 1 , if DT( x i ) = y i , 0 , otherwise .

    where DT(xi) is an output of the decision tree given xi. Likewise, an additional transition feature function that may focus more on the serial processes of high risk can be introduced as follows:

    g 2 ( y i , y i + 1 , x , i ) = { 1 , if y i = y i + 1 = H , 0 , otherwise .

    2.4 Model Learning

    Learning the model (actually, estimating the parameters) involves finding the parameter set (λ*, μ*) that maximizes the log likelihood of the training data,

    L( λ , μ ) = n = 1 N log p( y ( n ) | x ( n ) )
    ( λ * , μ * ) = arg max λ , μ L( λ , μ )

    where x(n) and y(n) denote the vectors of input variables (either binary or ordinal) and response variables (degree of the process risk) of the nth sample. L(λ , μ) is a concave function, guaranteeing convergence to the global maximum, which means every local optimum is a global optimum and therefore, matrix computation, dynamic programming, and gradient ascent method can be used to find global optimum. In this paper, we adopt the gradient ascent method (Roth and Yih, 2005), which takes steps proportional to the positive value of the gradient of the function in Eq. (6) at the randomly selected initial point for finding λ* and μ*.

    2.5 Inferencing and Determining Significant Factors at each Process of Project

    Meanwhile, inference task is to find the most likeliness sequence y* of degrees of process risks given observed risk factor values x as follows:

    y * = arg max y p ( y | x ˜ )

    that is, if a set of values x ˜ of all risk factors for a software project is given, then we can estimate the probabilities of the risk of each process, p ( y = ( y 1 , y 2 , , y 12 ) | x ˜ ) , , from which the risk probability of the project may be derived.

    Since there are numerous factors which affect project risk for each process, it is in general almost impossible to manage all risk factors. For efficient management, therefore, it would be better to determine significant factors that are critical for project risk. Olden and Jackson (2002) suggest a connection weight method that calculates the sum of products of raw weights of the connections from input node to hidden nodes and from hidden nodes to output nodes in an ANN. The larger the sum for a given input node is, the more important the corresponding input variable is. Relative importance, RI , of a given input can be defined as:

    R I = H = 1 h W I H × W H O

    where h is the total number of hidden nodes, WIH is the weight of the connection between input node I and hidden node H, and WHO is the weight of the connection between hidden node H and output node o. In this study, we determine significant factors based on the values of RI. A detailed explanation of the method is given in the next section with an illustrative example.


    This section is devoted to illustrating and testing the model to demonstrate its applicability and describe its methodology in a hypothetical project case. In order for our model to be applied to any real project, data should be collected as a process scale for our model, and risks of each process and related risk factors should be recorded. Such real data, however, are not available because this kind of model has never been utilized in real project risk management situations. Therefore, we developed an illustrative test based on hypothetical software projects, in which the data set is produced based on reasonable assumptions.

    We provide an overview of the artificially produced data set. As described in the subsections 2.1 and 2.2, risk factors are assigned to each process, where the variable representing each factor is either binary (0 or 1) or ordinal (5 point scale).

    A total of 300 software projects and corresponding processes are manually produced, where one of the three risk degrees, H (high risk), M (medium risk) or L (low risk), is equally likely to be assigned to the first process for all projects. Then the subsequent risks of processes are determined with the following probabilities: Risk degree of the process i+1 is the same as that of the process i with probability 0.7, and is equally likely for the two other degrees. For example, if risk of the process i is M, then that of the process i+ 1 is M with probability 0.7, and H and L with probabilities 0.15, respectively.

    Next, we assign probabilities to random variables of the risk factors according to variable type as shown in Table 2. For example, if the process System Requirement Analysis is H, then a bad issue related with the binary risk factor occurs (value = 1) with probability 60%, and 40%, otherwise.

    With the dataset produced according to these rules, we construct a probabilistic graphical model. We also build ANN, decision tree, and naïve Bayes and logistic regression models to compare representative classification models and the proposed model in terms of accuracy. Accuracies are calculated by 5-fold cross validation. The results are summarized in Table 3.

    Table 3 shows that the accuracies of our model are the highest among all five models. For example, our model could predict degrees of risk of the process System Requirement Analysis with an accuracy of 97.33%, while the other four machine learning methods show accuracies of 80.67~86.33% for our 300 software project case. On average, our model shows 92.81% accuracy, compared to 78.00~87.56% for other models. This may be a natural consequence because our model considers not only relationships between inputs and outputs using status feature functions, but also relationships among outputs by means of transition feature functions.

    Now, we illustrate the inference process using an example. Let x 1 ( 72 ) and x 2 ( 72 ) be the vectors of risk factors of the first process (process implementation) and the second process (System Requirement Analysis) of the arbitrarily chosen 72th software project, respectively. The output of ANN given x 1 ( 72 ) is H, and therefore, the risk degree of the first process is H. Recalling that Z(x) is independent of y, the following three values are computed and compared:

    argmax r = H , M , L exp { 17.94 × g ( H , r , x 2 ( 72 ) , 1 ) + 15.16 × f ( r ,   x 2 ( 72 ) ) } = argmax ( exp ( 33.1 ) , 1 , 1 ) = H

    Therefore, the most plausible risk degree of the second process is H. This inference procedure repeats to the whole 12 processes.

    Additionally, we calculate the relative importance RI of risk factors of each process to identify critical factors for software project risk using Eq. (9). Table 4 lists the results for our dataset.


    Project risk management is a major topic of interest in the field of project management. Many standards of project management adopt and apply procedural approaches. Project management is an integrative undertaking that requires each project process to be appropriately aligned and connected with other processes to facilitate coordination.

    Since actions taken during one process typically affect that process and other related processes, we suggest a software project risk prediction model based on the probabilistic graphical model, which is designed to predict risks for each process. We applied an ANN to construct a feature defining the state of the process, expressing relationships between project risk factors and process risk. We also defined a transition feature function expressing relationships among process risks, under the assumption that project risks do not fluctuate sharply.

    We artificially generated a dataset to validate our model while considering risk probability. As a first step, we compared the accuracy of our model with four other well-known machine-learning models and found that our model outperforms all others. In addition, critical factors of each process that mostly affect project risk could be identified by their relative importance, by which overall risk management can be accomplished effectively and efficiently.

    As real data suitable for testing our model are not currently available, an illustrative test problem was developed based on hypothetical software projects. In the future, our model can be applied to real data when they are available by following the steps below:

    (1) For all completed software projects, separate the software project into processes according to ISO/IEC 12207; (2) identify risk factors for each process; (3) assign a risk value to each risk factor by referring to Table 1; (4) make a data structure for each process using the values assigned in (3); (5) construct an ANN model to predict process risk level using the data for each process made in (4); (6) construct a CRF model referring to section 2.3 by using the ANN model as feature f(yi, x, i) referring to Eq. (2); (7) assign a score to each risk factor for a software project and using this information, predict risk for each process; and (8) determine significant risk factors for each process by referring to section 2.5.

    Managerial implication of the framework can be summarized as follows. First, risk management of a software project should be done on process unit. Through the literature review and the model application, we found that the previous process usually affects risk for the current process or subsequent processes. In addition, our model based on the fact shows better prediction performance than other machine learning methods which do not consider the relationship between two consecutive processes. Second, the degree of risks at each process should be measured to develop and employ the model in the real world. Finally, the model should be developed and updated with the recent project data to identify critical factors at each process.


    This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (2017R1A2B4006643).



    Structure of the Software Project.


    Diagram of artificial neural network.


    A Partial List of Project Processes and Corresponding Factors with Variable Types

    Probability according to project risk of each process

    Accuracy of suggested model and other models

    Critical factors at each process based on relative importance


    1. T. Addison (2003) E-Commerce project development risks: Evidence from a delphi survey., Int. J. Inf. Manage., Vol.23 (1) ; pp.25-40
    2. T. Addison , S. Vallabh (2002) Controlling software project risks: An empirical study of method used by experience project managers, Proceedings of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology, ; pp.128-140
    3. M.M. Altuwaijri , M.S. Khorsheed (2012) InnoDiff: A project-based model for successful it innovation diffusion., Int. J. Proj. Manag., Vol.30 (1) ; pp.37-47
    4. P. Blunsom , T. Cohn (2006) Discriminative word alignment with conditional random fields., International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computation Linguistics, ; pp.65-72
    5. R.N. Charette (2005) Why software fails?, IEEE Spectr., Vol.42 (9) ; pp.42-49
    6. L. Chen , Z. Mu , B. Nan (2015) Semantic image segmentation based on hierarchical conditional random field model., J. Comput. Inf. Syst., Vol.11 (2) ; pp.527-534
    7. Y. Fu , M. Li , F. Chen (2012) Impact propagation and risk assessment of requirement changes for software development projects based on design structure matrix., Int. J. Proj. Manag., Vol.30 (3) ; pp.363-373
    8. W.M. Han (2015) Discriminating risky software project using neural networks., Comput. Stand. Interfaces, Vol.40 ; pp.15-22
    9. W.M. Han , S.J. Huang (2007) An empirical analysis of risk components and performance on software projects., J. Syst. Softw., Vol.80 (1) ; pp.42-50
    10. Y. Hu , B. Feng , X. Mo , X. Zhang , E.W.T. Ngai , M. Fan , M. Liu (2015) Cost-sensitive and ensemble-based prediction model for outsourced software project risk prediction., Decis. Support Syst., Vol.72 ; pp.11-23
    11. Y. Hu , X. Zhang , E.W.T. Ngai , R. Cai , M. Liu (2013) Software project risk analysis using Bayesian networks with causality constraints., Decis. Support Syst., Vol.56 ; pp.439-449
    12. Y. Hu , X. Zhang , X. Sun , M. Liu , G. Du (2009) An intelligent model for software project risk prediction, International Conference on Information Management Innovation Management and Industrial Engineering, ; pp.629-632
    13. J. Lafferty , A. McCallum , F. Pereira (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the 18th International Conference on Machine Learning, ; pp.282-289
    14. T.O. Lehtinen , M.V. Mäntylä , J. Vanhanen , J. Itkonen , C. Lassenius (2014) Perceived causes of software project failures: An analysis of their relationships., Inf. Softw. Technol., Vol.56 (6) ; pp.623-643
    15. J.Y.C. Liu , H.G. Chen , C.C. Chen , T.S. Sheu (2011) Relationships among interpersonal conflict, requirements uncertainty, and software project performance., Int. J. Proj. Manag., Vol.29 (5) ; pp.547-556
    16. C. LA3pez-MartA-n (2015) Predictive accuracy comparison between neural networks and statistical regression for development effort of software projects., Appl. Soft Comput., Vol.27 ; pp.434-449
    17. A. McCallum (2003) Efficiently inducing features of conditional random fields, Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence, ; pp.403-410
    18. D.E. Neumann (2002) An enhanced neural network technique for software risk analysis., IEEE Trans. Softw. Eng., Vol.28 (9) ; pp.904-912
    19. J.D. Olden , D.A. Jackson (2002) Illuminating the ?oblack box ??: A randomization approach for understanding variable contributions in artificial neural networks., Ecol. Modell., Vol.154 (1-2) ; pp.135-150
    20. G. ParA(c) , C. Sicotte , M. Jaana , D. Girouard (2008) Prioritizing clinical information system project risk factors: A delphi study, Proceedings of the 41st Hawaii Conference on System Science, ; pp.1-10
    21. Project Management Institute (2013) A Guide to the Project Management Body of Knowledge., Project Management Institute,
    22. F. Reyes , N. Cerpa , A. Candia-Véjar , M. Bardeen (2011) The optimization of success probability for software projects using genetic algorithms., J. Syst. Softw., Vol.84 (5) ; pp.775-785
    23. D. Roth , W.T. Yih (2005) Integer linear programming inference for conditional random fields, Proceedings of the 22nd International Conference on Machine Learning, ; pp.736-743
    24. R. Schmidt , M. Lyytinen , M. Keil , P. Cule (2001) Identifying software project risks: An international delphi study., J. Manage. Inf. Syst., Vol.17 (4) ; pp.5-36
    25. F. Sha , F. Pereira (2003) Shallow parsing with conditional random fields, Conference on Human Language Technology and North American Association for Computational Linguistics, ; pp.134-141
    26. A. Sharma , S. Sengupta , A. Gupta (2011) Exploring risk dimensions in the Indian software industry., Proj. Manage. J., Vol.42 (5) ; pp.78-91
    27. J.M. Verner , W.M. Evanco , N. Cerpa (2007) State of the practice: An exploratory analysis of schedule estimation and software project success prediction., Inf. Softw. Technol., Vol.49 (2) ; pp.181-193
    28. L. Wallace , M. Keil (2004) Software project risk and their effect on outcomes., Commun. ACM, Vol.47 (4) ; pp.68-73
    Do not open for a day Close