Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.20 No.1 pp.69-81

Exploring Predictive Factors of Academic Probation Using Data Mining Approach

Hunhee Lee, Namhyoung Kim*
Department of Applied Statistics, Gachon University, Seongnam, Republic of Korea
*Corresponding Author, E-mail:
December 9, 2020 February 24, 2021 March 8, 2021


The purpose of this study is to develop the early prediction model for academic probation to encourage retention of students at universities. For this study, various data from the administration system and learning environment of G University in South Korea were collected. We constructed the predictive model by applying logistic regression to collected data using new variables related to campus activities. To solve the class-imbalance problem, we applied data mining techniques. This study is significant in that the model is based on structured real data by using education data mining approach from the academic administrative system we can access. Predictive factors of academic probation were revealed and educational implications of developed predictive were discussed.



    The goal of higher education is deepening knowledge that students acquired in high school and maximize their aptitude and potential. Also, it has functions as a source of good-quality human resources and social debut to the labour market. The belief that desires for social betterment such as securing a good job can be articulated through higher education results in increasing participation in higher education and becoming a worldwide trend (Marginson, 2016). Korea is one of the countries that engaged in a rapid increase in higher education participation (OECD, 2016).

    However, not all students enrolled at a university can earn a degree. The dropout rate is the main concern of the university to maintain the number of students. Higher education is operated on the principle of supply and demand unlike compulsory education such as primary and secondary school. Universities are keen to maintain the number of students as much as they are keen in selecting good students. Korea’s top 19 universities revealed increasing dropout rates from 1.17% in 2010 to 2.099% in 2015. It almost increased by double. Some of the dropouts are due to low academic achievement. More than 10% of students that dropped out of prestigious universities in Korea in 2015 were caused by academic probation.

    Academic probation could be viewed as a form of dropping out. It refers to the warning that students that have not achieved adequate credits in a given semester. Students under probation constantly endure punishment of temporary suspension or restriction from participation in courses; sometimes they are suspended from university after accumulating a certain number of incidences of probation. Reducing the number of students on academic probation is an important task for universities to maintain the number of students and empower students to succeed in their academic life.

    Many studies have been conducted on dropouts or academic probation and most of them have investigated students’ characteristics on dropout or academic probation by conducting surveys or interviewing them. Recently, the growing quantity of information called ‘Big Data’, enables us to detect patterns from large collections of daily data around students without a purposeful survey or interview. Educational Data Mining (EDM) is gaining attention in that it is highly exploratory whereas other analyses are typically problem-driven or confirmatory (Berson et al., 2000).

    This purpose of this study was to develop the predictive model of academic probation. Predicting academic probation is challenging for any university because class-imbalance data is common. To address this challenge, we used data mining techniques. For this study, we constructed an optimal classification model to perform early detection of academic probation based on logistic regression; we used data from the academic administration and learning management system of G University, in the metropolitan area of South Korea. The study sought to predict if it would be necessary to warn current-semester students using log data recorded in the first two months of the semester, the previous semester’s grade data, and basic information. This study is significant in that the model is based on structured real data from the academic administrative system rather than unstructured information such as interviews and questionnaires. This means that it is possible to construct a highly accurate model suitable for many other countries by using structured real data. The following research questions were used to guide our dis-cussion:

    • 1. What are the academic probation factors and their impact?

    • 2. What is the effectiveness of the predictive model of academic probation?

    • 3. What is the educational implication of the predictive model of academic probation developed in this study?


    2.1 Previous Research Trend on Academic Probation

    Previous research dealing with academic probation can be divided into four main categories: exploring causal factors on academic probation, experiences of overcoming academic probation, developing programs for students on academic probation, and developing predictive model on academic probation.

    Most studies on the cause of academic probation tried to identify predictive factors by investigating characteristics of students on academic probation. Studies examined cognitive, affective, behavioural, and environmental characteristics of students on academic probation. It is said that students on academic probation are likely to lack basic learning ability or prior knowledge prerequisite for learning new concepts (Casey et al., 2015;Tinnesz et al., 2006). They are likely to have low self-esteem (Ju et al., 2012), and they also manifested passive stress coping strategies (Lee et al., 2017). Besides personal factors listed above, environmental characteristics such as low socio-economic status were revealed (Mavis and Doig, 1998;Sage, 2010). We also could identify differences between results of Korean and international studies on the cause of academic probation. Although international studies have demonstrated that students received academic probation for primary reasons such as academic ability, race, socio-economic status, psychological problems, and unsatisfied expectations from college, studies in Korean have revealed the primary reason such as lack of motivation to learn, maladaptation to the new environment, lack of self-control skills and time management skills (Kim et al., 2014).

    There is research addressing student experiences that overcome academic probation. These studies revealed how students used internal and external resources in to overcome academic probation (Tedeschi and Calhoun, 1995;Calhoun and Tedeschi, 2014). For example, Jang and Yang (2013) emphasized the importance of self-confidence and resilience in overcoming academic probation and Kwon and Song (2016), Ju et al. (2012) reported that social support from friends, family, and professors was significant.

    In addition to research on causal and overcoming factors of academic probation, there have been studies to develop a program for students on academic probation that investigated effectiveness. These programs can be categorized as if it is compulsory, in groups and its content (Sage, 2010). Considering characteristics of students on academic probation, it is reported that a compulsory program is likely to be more effective than a voluntary program (Sage, 2010). However, previous studies in Korea mostly conducted in a way of non-compulsory. These programs are consultation programs with a professor or consultant, or coaching program addressing time management and learning strategies. These programs were delivered in an individual format, a small group or both formats.

    Last, there is research addressing predictive model on academic probation. Although these studies are im-portant in terms of preventing academic probation and assisting students at risk, research to make a predictive model has received scant attention. Recently, these types of research are fast growing in Korea. Kwon (2016) reported predictive factors using the decision tree analysis, one of the data mining approaches. It was revealed that leaving the school experience increased the probability of academic probation and counselling with professors decreased the probability of academic probation. However, this study as well as previous studies to identify factors has limitation in that they analysed data from the survey designed for this study and these data are not raw data that a university can easily acquire or access without conducting a direct survey.

    2.2 Educational Data Mining in Higher Education

    There are two major fields that use big data in education; ‘educational data mining’ and ‘learning analytics’. It is difficult to make a clear distinction between educational data mining and learning analytics. However, educational data mining is developing distinct research areas from learning analytics. The report titled ‘Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics’ that the U.S. Department of Education published in 2012 defined educational data mining as a method of establishing a new model or an algorithm with a new pattern from data. Whereas learning analytics was defined as a broader field that applies existing predic-tive model within the educational system. Learning ana-lytics can use results of educational data mining and offer instruction or prescriptions for management of academic performance (Avella et al., 2016). In this study we use the term ‘educational data mining’ rather than ‘learning analytics’ to focus on the predictive model that we developed.

    The educational data mining approach has been used in higher education fields such as students’ behaviour modelling, improving feedback and assessment service, prediction of performance, and prediction of dropout or retention (Papamitsiou and Economides 2014). Among these thematic issues in education data mining approach, prediction of performance and dropout/retention are key issues. For example, Abdous et al. (2012) used educational data mining to understand students’ online learning behaviour and their performance. Lykourentzou et al. (2009) applied data mining techniques on detailed student information from the Learning Management System (LMS) to predict students’ dropout at an early stage.


    For this study, we developed a predictive model us-ing a real-world dataset. First, we collected and refined the data. Through preliminary analysis, we selected meaningful variables; we generated the final data to be used in the model through pre-processing. Then, we se-lected the optimal model by applying a classification technique to the learning data. Finally, we applied the final model to the test data to examine its performance. The details of each step follow.

    In the first step, we collected raw data by referring to existing studies. In this case, given that undergraduate information was divided among various departments within the university, close collaboration was required. We integrated data from various places based on the students’ ID values. Then we undertook the detection of missing values and outlier processing. We defined the independent and dependent variables by selecting meaningful variables through preliminary analysis. Subsequently, we analysed the distributional characteristics and correlations of the variables and changed the variables to fit the analysis.

    In the second step, we applied various classification techniques to the refined data. We used the logistic regression, support vector machine (SVM) model and artificial neural network (ANN) model to compare prediction performances. A brief description of each model follows below.

    In logistic regression, the following function re-presents the relationship between the dependent variable (Y) and the independent variable (X).

    logit [ π ( x ) ] = log ( π ( x ) 1 π ( x ) ) = α + β x

    where π(x) is E(Y) and α, β are model parameters. The structural form of the model describes the patterns of association and interaction between the response and independent variables. The sizes and signs of the model parameters determine the strength, importance, and direction of the effects. In this study, the response variable Y is binary (1 = academic probation, 0 = otherwise). Thus, π(x) is the probability of receiving an academic probation

    SVM is a supervised learning model for pattern recognition and data analysis and is mainly used for classification and regression analysis. It classifies data based on the maximum margin hyperplane, which maximizes the distance between groups of data belonging to each classification. A subset of training data called the support vector represents the decision boundaries. The distance from this maximum margin hyperplane to the data can be expressed using probability. Given that this method is rooted in statistical learning theory, it offers promising results in many practical applications, from handwritten digital recognition to text classification. It also works well with high-dimensional data by avoiding dimensional problems.

    ANN is an analysis model based on the human brain. It is similar to the human brain in structure in that it consists of an interconnected set of nodes with neu-ron-mimicking and directional links. The nodes consti-tute an input layer, a hidden layer, and an output layer. The values given in the input layer are transmitted to the hidden layer then the output layer. Through this process, a classification model is generated, and the corresponding value is outputted as a probability. ANN has an excellent prediction performance, and the input values are combined in the hidden layer so that the solution of the nonlinear problem is possible. However, it is difficult to understand the model intuitively and modify it by hand. We applied both of these models to the same data to compare the predictive performance of the logistic re-gression model.

    In the process of creating the model, we trained a learning algorithm using training data. While a predictive model shows high performance with training data, it cannot be generalized to new data at all; this is called an over-fitting problem. To avoid over-fitting and improve the generalized performance, we designated some of the training data as validation sets.

    Furthermore, the problem of predicting academic probation is a highly imbalanced problem: the proportion of students under academic probation is very small. In the case of a dataset containing this class-imbalance problem, the amount of data belonging to a plurality of categories is excessively distributed compared to the data of a prime category; a large amount of the data invades a prime category area, which adversely affects the performance of the classification algorithm. To address this problem, we used the sampling method in this study (Chawla et al., 2004;Xie and Qiu, 2007;Liu et al., 2009).

    Random sampling extracts data for random addition to or removal from the training. It is divided into oversampling, under sampling, and a hybrid of both (Kim et al., 2007;Thompson, 2012). Oversampling is a method that involves the extraction of data from a minority class to an equal extent to that of the data of a majority class according to a predetermined rule. This method is advantageous in that it allows the use of all the data, but it is disadvantageous in that increasing the amount of data raises the time complexity. Chawla et al. (2004) have solved the imbalance by oversampling artificially gener-ated data around a few categories using the kNN (k Nearest Neighbor) technique. However, if there are outliers or if there is noise among the data belonging to the minority category, oversampling may cause over-fitting because it overlaps some noisy data (Chawla et al., 2004;Liu et al., 2009).

    Under sampling involves the extraction of data from a majority target class to an equal extent to that of the data of a minority class. The advantage of this technique is that it reduces the time complexity, while its disadvantages are that it may not represent the distribution and characteristics of the majority class and generates a model that is not optimized. It is possible that the data of a useful majority class will not be selected because a lot of the data of the majority class is discarded (Liu et al., 2009).

    A hybrid of both forms of sampling seeks to achieve uniform class distribution by combining the under sampling of the majority class and the oversampling of the minority class. In this study, we used all three methods of random sampling to determine the best performance.

    In the class-imbalance problem, the majority class usually drives overall predictive accuracy at the expense of the crucial minority class (which has a very poor performance). In our sample, about 3% of the students were on academic probation. Thus, we used another evaluation metric: the hit rate. The hit rate is a popular measure for numerically evaluating the predictive power of models for the marketing field (Rosset et al., 2001;Kim et al., 2012). The hit rate is calculated as follows:

    Hit rate ( H ) = POD = N u m b e r o f E v e n t s c o r r e c t l y P r e d i c t e d T o t a l n u m b e r o f E v e n t s O b s e r v e d

    The hit rate (H) is also called the probability of de-tection (POD). The hit rate is only sensitive to missed events and not to false alarms. To be specific, the hit rate represents the percentage of correctly predicted observations regarding the academic probation candidates.

    A hit rate at a target point of x% is a hit rate when only the top x% of the total observed data is considered for evaluation based on the estimated probabilities of occurrence. Considering hit rates with target points is important because marketing managers have to focus solely on the top percentage of customers due to limited budget and time constraints. High hit rates are also important in terms of providing educational services. Because the budget and time constraints are limited during the provision of educational services, the higher the hit rate, the more efficient the service that can be provided.

    In the third step, we applied the optimal model that we determined through the comparison of classification performances in step 2 to the test data (students in the first semester of 2015). Moreover, we used the logistic regression results to identify the characteristics of students under academic probation and to predict academic probation.


    The empirical study employed real-world data from a Korean university. We collected and analyzed student data from G University, located in the Seoul metropolitan area, for the prediction of academic probation. G University comprises two campuses with a total population of about 20,000 students. We created the model using data for the second semester of 2014 and assessed its performance using data for the first semester of 2015. Table 1 features information regarding the data used.

    Each university has different standards or condi-tions regarding the definition of academic probation (Gaudioso and Talavera, 2006). Hence, we defined the academic probation standard of our subject, G University, as follows:

    • 1. Students whose average grade was below 1.5 each semester;

    • 2. Students who got a grade of F in more than 3 subjects each semester;

    • 3. Students who got academic probation 3 or 4 consecutive times should be expelled from the university.

    We selected the variables of raw data based on pre-vious research results. Among them, we used quantitative indicators, excluding qualitative indicators, such as motivation, time management, and self-efficacy, that needed to be surveyed.

    It is possible to obtain further information regarding students from leading universities worldwide because they have well-established student management systems. However, in Korea, it is not easy to collect various variables due to problems such as unstructured student information systems and personal information protection. The students’ data was scattered among various departments, so it was only possible to gather raw data through the close cooperation of various departments. We integrated the data from various sources into student ID values. It was possible to classify the collected variables into four types as Table 2 shows.

    Among the raw data, we collected information on participation in school life for the first two months of the semester and the remaining data at the beginning of the semester. This coincided with the purpose of learning analysis, which aimed to prevent academic probation by providing students with programs such as counselling in advance and predicting academic probation in the middle of the semester.

    We selected actual data for the empirical analysis as follows. First, as a result of searching the collected raw data, we confirmed most variable values of ‘credit exchange students’ to be missing values. Therefore, we removed all ‘credit exchange student’ data. In addition, we removed variables with many missing values and integrated variables with overlapping or embedded information. G University recently underwent university integration and academic reorganization, so there were cases where the names of affiliated colleges and departments had changed. We also revised this data. Moreover, we created new variables for processing existing variables. The changes were as follows.

    Next, we modified the variables that had too many categories (college, high school, admission type) because they could have degraded the accuracy of the model prediction. The high school variable had over 1,600 categories; therefore, we converted it into a binary variable with the existence of academic probation.

    There were about 70 categories of admission type; we combined those with total populations of fewer than 50 into other categories. As a result, we reduced the ad-mission type variable to 30 categories. G University has 14 colleges. We converted the two categorical variables into percentage variables as follows.

    N ( A P = 1 ) + 1 N ( A P = 0 ) + N ( A P = 1 ) + 1 × 100

    where N is a function that counts numbers, and AP (academic probation) is a binary variable (1 if under academic probation, 0 otherwise). In the absence of academic probation, we added 1 to both the denominator and the numerator to prevent the value of the variable from becoming zero. Then, we performed min max normalization to change the range of variables to 0 1 for both variables.

    5. RESULT

    In this section, how predictive model developed us-ing training data set will be explained and the result of evaluating predictive model using the out of sample. Educational discussion will be presented in the following section.

    5.1 Developing a Predictive Model of Academic Probation

    As we discussed in the previous section, only 2.6% of all students received academic probation in the training dataset, causing a class-imbalance problem. This classification problem occurs when the total number of a class of data is far less than the total number of another class of data. It is common in practice and is observable in various areas, including fraud detection, anomaly detection, and churn prediction.

    We used sampling methods to address it. We adjusted the class distribution of the training dataset using three random sampling methods: oversampling, under sampling, and a hybrid of both. Among them, oversampling showed the best performance. It created a sample of synthetic data by enlarging the feature space of the minority and majority class examples. Operationally, we drew the new examples from a conditional kernel density estimate of the two classes as described in Menardi and Torelli (2014). It produced a synthetic sample of data simulated according to a smoothed-bootstrap approach (Lunardon et al., 2014).

    Oversampling replicated the minority examples while preserving the original data objects. Excessive oversampling can cause over-fitting problems; therefore, we adjusted the ratio of the class to 10:1 to generate a predictive model.

    In addition, the training dataset and validation da-taset were divided in the ratio of 7:3. We used a validation dataset to adjust the classification parameters. While we employed the training set to train the candidate algorithms, we employed the validation set to compare their performances and decide which one to take.

    After data pre-processing, about 30 variables re-mained; therefore, we used stepwise variable selection algorithms to select significant variables. Logistic regres-sion can select or delete variables from a model in a stepwise manner using certain criteria. There are three selection approaches, namely, forward selection, back-ward elimination, and stepwise selection. The methods with the smallest AIC (Akaike information criterion) values were backward elimination and stepwise selection, and the two AIC values were equal to 4,839. The same 20 variables were selected. Thus, we used these 20 variables to generate classifiers.

    Table 6 shows the distribution of academic probation of categorical variables among 20 variables selected for 16,036 students in the second semester of 2014; this is the training data. The ‘gender’ of the students in the second semester of 2014 was distributed as follows: 50.2% were male and 49.9% were female. Among the male students, 4.1% were under academic probation; furthermore, 1.1% of the female students were under academic probation. With respect to the ‘transferred’ variable, among the 16,036 students enrolled in the second semester of the 2014 academic year, 93.7% of the new students were enrolled, with 2.7% of the students receiving academic probation. The transfer students from other universities accounted for 6.3% of the total number of enrolled students, of whom 1.7% received academic probation. Upon looking for the relative risk to determine the difference between the two categories of ‘transferred,’ we found that the sample relative risk of receiving academic probation was 1.59, which was about 59% higher for new students than it was for transfer students.

    With regard to the ‘second entrance exam’ variable, students who took the university entrance exam once comprised 53.6% of the total, those who took it twice comprised 24.6%, and others comprised 21.8%. Among them, the ratio of academic probation was highest at 4% for students who took the entrance exam twice. The relative risk was 1.43. A total of 5.1% of all students reported having changed their departments; of them, 1.1% of students were under academic probation.

    Table 7 shows the mean and standard deviation of the continuous variables among the 20 selected variables for the 16,036 students in the second semester of the 2014 academic year. In the case of ‘using reading room’ and ‘E-class connection,’ which were variables related to school activities, the mean values for the students under academic probation were lower than those of the others. Therefore, it was presumed that the academic needs of students under academic probation were low.

    The average of the ‘cumulative academic probation’ variable for students under academic probation was higher than that for students under non-academic probation. From this fact, we noted that many of the students who received academic probation in the 2014 academic year had a history of academic probation and were unable to overcome it.

    We then fitted the logistic regression models using all these variables to predict whether a student was under academic probation. We let the response variable if under academic probation and otherwise. Table 8 shows the results. All predictors except ‘second entrance exam,’ ‘high school type,’ and ‘general leave (disease),’ were significant at α=0.05. The positive β value means that as the value of the explanatory variable increases, the probability of receiving academic probation increases. More specifically, the odds of academic probation increase by Exp(β) for every 1-unit increase in an explanatory variable, at fixed level of the other variables.

    The results show that male students were more likely to receive academic probation than female students. The odds of men are 2.439 times that of women. Students who took the entrance exam more than two times are less likely to receive academic probation than other students. Transfer students were also more likely to receive academic probation. The results indicate that the college and admission type affected the probability of receiving academic probation.

    Students who were withdrawn or expelled for any reason were more likely to receive academic probation. The predictors related to campus participation activity, however, had a negative regression coefficient. To be specific, the more active one was in school, the less likely he or she was to get academic probation. This phenomenon is more apparent when standardized coefficients (Std. β) are obtained. Each Std. β represents the effect of a standard deviation change in a predictor, controlling for the other variables.

    To evaluate the performance of the classifier, we used the hit rate as a metric. Table 9 shows the hit rates of the developed logistic regression model on the test dataset. The logistic regression model performed 5.5 times better in terms of hit rates than the random model at the target point of 10%. However, compared to the random model’s, its relative performance difference decreased as the target point increased (about twice at the target point of 40%) because academic probation was less likely to be considered for model evaluation.

    Figure 1 graphically presents the hit rates of the proposed model on the test dataset. The performance increases sharply, but at the target point of 40%, the curve becomes gentle.

    5.2 Effectiveness of Developed Predictive Model of Academic Probation

    We applied the model we constructed from the training dataset to the student data for the first semester of 2015 and evaluated the out-of-sample performance. The results are shown in Table 10. They are similar to the in-sample results.

    Of the total of 17,669 enrolled students in the first semester of 2015, 56.7% of the students under academic probation were among the top 10% (1,766 students) who were likely to receive academic probation. The top 20% of enrolled students (3,533 people) included 66.61% of the students under academic probation. The top 30% of enrolled students (5,300 people) included 85% of the students under academic probation, and the top 40% of enrolled students (7,067 people) included 89.11% of the students under academic probation.

    Figure 2 shows that the proportion of accurately predicted academic probation increased sharply from the top 0% to the top 30% and that the rate of increase of the predicted academic probation slowed down from the top 30%. Therefore, it seems that it would be effective to establish a support program for academic probation for the top 30% of the students whom we predicted would be under academic probation.

    We applied the representative classifiers SVM and ANN to the same data to compare the predictive perfor-mance of the logistic regression model. We undertook the application of SVM and ANN to the training data with 30 independent variables before the application of the optimal variable selection algorithm. We used the Gaussian kernel with a hyper parameter of 0.051 to construct the SVM model. To construct the ANN model, we determined the optimal number of hidden nodes to be 9 through several experiments on the test dataset.

    We used the two models we constructed using SVM and ANN to calculate the probability values that we expected to be associated with academic probation on the test data. We compared the hit rate with that of the logistic regression model as shown in Table 11 and Figure 3.

    The logistic regression model outperformed the oth-er models over all target points. SVM and ANN showed similar performances. The comparison of the SVM model and the ANN model revealed the SVM model to be superior in the top 10% of the target point; the ANN model showed better performance as the target point increased.

    There was no big difference in performance between the logistic model and the other models. However, we expected the logistic model to be more helpful in the design of the program for students because, unlike the other two models, it could interpret the explanatory variables. Though the SVM model or the ANN model could be used to examine the effect of the explanatory variables through sensitivity analysis, they are referred to as black boxes because they are derived from complex mathematical processes that are difficult to understand and interpret.


    The purpose of this study was to predict which stu-dents would receive academic probation by applying educational data mining approach using structured real data of students from a university in the metropolitan area of South Korea. Based on results of a predictive model that we developed, two educational implications will be discussed.

    6.1 Predictive Factors of Academic Probation

    Previous studies reported that students placed on academic probation have cognitive characteristics such as low academic motivation and lack of knowledge or information related to academic course or career as well as affective characteristics such as low self-efficacy. They also manifested behavioural characteristics such as lack of handling a crisis and environmental characteristics such as low socio-economic status.

    It is difficult that previous studies give us an indica-tor that universities can identify students with high possibility of being put on academic probation, because these studies interviewed students or used survey items that they made. For example, it is reported that students on academic probation are dissatisfied with their major and there are a number of students in a low social economic status on academic probation. These factors such as low satisfaction with a major and socio-economic status cannot be deduced easily without conducting a direct survey. How can we identify students with these characteristics and what indicators can we presume as predictive factors of academic probation to assist students on the brink of academic probation?

    We could identify a number of predictive indicators from real data that male students that were members of a college that often warned students through academic probation, completed fewer semesters, enrolled after transferring from other universities, had a certain admission type that characterized many students under probation, were enrolled but had paused in their studies, were from another major department, had re-entered, were absent from university for longer periods, borrowed fewer books, rarely used the study room, hardly logged into the E-class, did not actively participate in academic programs, had fewer classes and credits, had more accumulated probation, and had a higher probability of being under academic probation.

    This study is significant in that we used real data around students that we can easily access. For example, expulsion (unregistered) is one of the strong indicators. Although we cannot determine precise reasons why a student did not register, this indicator may be a phenomenological indicator attributed to ‘low economic status’ or ‘low satisfaction with the major’ and so on.

    6.2 Educational Application of Predictive Model of Academic Probation

    A predictive model of academic probation will pro-vide practical information that ‘what type of students’ and ‘how many students’ would be considered as predictive program participants.

    Previous studies addressing student programs were almost for students on academic probation and reported positive results about the programs. It is obvious who is going to be participants, of course probationary students, yet it could be a significant issue that these programs are mandatory or not. Kim et al. (2014) and Sage (2010) reported that it becomes more effective when these programs are mandatory as probationary students are less likely to voluntarily seek assistance.

    Considering that probationary students are less likely to seek assistance and mandatory programs are more effective, it is a significant issue that ‘how many students’ should be asked to participate in the predictive program participants. The hit rate presented in Figure 3 suggest that it would be effective to operate the predictive program for the top 30% of students likely to be placed on academic probation. We evaluated the model using the hit rate as a metric: The proposed model had a performance of approximately 5.5 times better than the random model. The specificity of this test was 80.53%, and its sensitivity was 76.96% at the cut off value of 0.3. Considering that the data included all students with varying tendencies, the estimated rate was reasonably high. There is always an error relative to prediction probability. However, this hit rate can save time, resource, and money by representing the proportion of students with a high probability of being placed on academic probation among all students at each target point.

    Hunhee Lee obtained his master’s degree in Applied Statistics from Gachon University, South Korea. He holds a bachelor degree in Applied Statistics from Gachon University. He has conducted research on applying data mining approach to predict academic probation and drop-out using real data during his degree. His research interests are in data mining and statistical learning.

    Namhyoung Kim is an associate professor in the department of Applied Statistics at Gachon University. She holds a bachelor degree in Industrial and Management Engineering form POSTECH, South Korea. In 2013, she completed a Ph.D. degree from POSTECH. Her Ph.D. dissertation was on the volatility models. After her Ph.D., she worked at Seoul National University as a postdoctoral researcher. Her research interests include financial engineering, data mining and business intelligence.


    This research was supported by the National Re-search Foundation (NRF) of Korea (Ministry of Science and ICT, NRF-2018R1D1A1B07047487).



    Hit rates (training dataset).


    Hit ratea (test dataset).


    Performance comparison.


    Data Set

    Variables of raw dataset.

    Variable transformation

    Over-sampling results (ratio 10:1)

    Training data and validation data (ratio 7:3)

    Distribution of academic probation of selected categorical variables

    Distribution of selected continuous variables

    Logistic regression results

    In-sample performance (Hit rate)

    Out-of-sample performance (Hit rate)

    Comparison with other classifiers (Hit rates)


    1. Abdous, M. H., He, W., and Yen, C. J. (2012), Using data mining for predicting relationships between online question theme and final grade, Educational Technology & Society, 15(3), 77-88.
    2. Avella, J. T., Kebritchi, M., Nunn, S. G., and Kanai, T. (2016), Learning analytics methods, benefits, and challenges in higher education: A systematic literature review, Online Learning, 20(2), 13-29.
    3. Berson, A., Smith, S., and Thearling, K. (2000), Building Data Mining Applications for CRM, McGraw-Hill, New York, 4-14
    4. Calhoun, L. G. and Tedeschi, R. G. (2014), The foundations of posttraumatic growth: An expanded framework, In Handbook of Posttraumatic Growth, Abingdon: Routledge, 17-37.
    5. Casey, M., Cline, J., Ost, B., and Qureshi, J. (2015), Academic Probation, Student Performance and Strategic Behavior, In Annual Meeting for the Association for Education Finance and Policy, Washington, DC.
    6. Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004), Editorial: Special issue on learning from imbalanced data sets, SIGKDD Explorations Newsletter, 6(1), 1-6.
    7. Gaudioso, E. and Talavera, L. (2006), Data mining to support tutoring in virtual learning communities: Experiences and challenges, Data Mining in E-Learning (Advances in Management Information), 4, 207-225.
    8. Jang, A. and Yang, J. (2013), A qualitative study on the experiences of student being on and overcoming academic probation, Korean Journal of Counseling, 14(2), 995-1013.
    9. Ju, Y. A., Kim, Y. H., and Won, S. K. (2012), An exploration study of the factor for understanding academic achievement failure and academic persistence on academic probation: Focus group interviews among female university students, Journal of Adolescent Welfare, 14(4), 47-69.
    10. Kim, M. S., Yang, H. J., Kim, S. H., and Cheah, W. P. (2007), Improved focused sampling for class imbalance problem, The KIPS Transactions: Part B, 14(4), 287-294.
    11. Kim, N. M., Kim, H. W., and Park, W. S. (2014), Effects of a resilience improvement program applying a peer-mentoring system on college students on academic probation, The Journal of Yeolin Education, 22(1), 391-412.
    12. Kim, N., Jung, K. H., Kim, Y. S., and Lee, J. (2012), Uniformly subsampled ensemble (USE) for churn management: Theory and implementation, Expert Systems with Applications, 39(15), 11839-11845.
    13. Kwon, H. S. (2016), An analysis to identify factors associated with academic probation: A data mining approach, Journal of Human Understanding and Counseling, 37(2), 29-46.
    14. Kwon, H. S. and Song, S. J. (2016), The concept map of man and female college students’ perceived overcoming strategies on academic probation, Journal of the Korea Institute of Youth Facility and Enviornment, 14(3), 63-72, Available from:
    15. Lee, Y., Yang, H. Y., and Cho, S. (2017), Issues on the educational interventions for the college students on academic probation, The Korean Journal of Eductional Methodology Studies, 29(1), 161-184.
    16. Liu, X. Y., Wu, J., and Zhou, Z. H. (2009), Exploratory undersampling for class-imbalanced learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),39(2), 539-550.
    17. Lunardon, N., Menardi, G., and Torelli, N. (2014), ROSE: A package for binary imbalanced learning, R Journal, 6(1), 79-89.
    18. Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, G., and Loumos, V. (2009), Dropout prediction in e-learning courses through the combination of machine learning techniques, Computers & Education, 53(3), 950-965.
    19. Marginson, S. (2016), The worldwide trend to high participation higher education: Dynamics of social stratification in inclusive systems, Higher Education, 72(4), 413-434.
    20. Mavis, B. and Doig, K. (1998), The value of noncognitive factors in predicting students’ first-year academic probation, Academic medicine: Journal of the Association of American Medical Colleges, 73(2), 201-203.
    21. Menardi, G. and Torelli, N. (2014), Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery, 28(1), 92-122.
    22. OECD (2016), The economic consequences of Brexit: A taxing decision, OECD Economic Policy Paper, 16.
    23. Papamitsiou, Z. K. and Economides, A. A. (2014), Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence, Journal of Educational Technology & Society, 17(4), 49-64.
    24. Rosset, S., Neumann, E., Eick, U., Vanik, N., and Idan, I. (2001), Evaluation of prediction models for marketing campaigns, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, San Francisco, California, 456-461.
    25. Sage, T. L. (2010), Academic probation: How students navigate and make sense of their experiences, Ph.D. diss., University of Wisconsin.
    26. Tedeschi, R. G. and Calhoun, L. G. (1995), Trauma and Transformation, Sage Publications, California.
    27. Thompson, S. K. (2012), Sampling, John Wiley & Sons, Inc., Hoboken, New Jersey.
    28. Tinnesz, C. G., Ahuna, K. H., and Kiener, M. (2006), Toward college success: Internalizing active and dynamic strategies, College Teaching,54(4), 302-306.
    29. Xie, J. and Qiu, Z. (2007), The effect of imbalanced data sets on LDA: A theoretical and empirical analysis, Pattern Recognition,40(2), 557-562.
    Do not open for a day Close