A Thematic Review on Predicting Student Performance Using Data Mining

Student performance, a key criterion for academic placement and success, can be effectively predicted and potentially improved using advanced data mining techniques. However, more studies delve into patterns and methods for forecasting student performance in this context. This study presents a comprehensive thematic review of the literature published between 2016 and 2022 to highlight prevailing trends and underline the significant gaps in this rapidly growing research domain. The review emphasizes that the highest number of related publications were seen in 2021, with a clear focus on higher educational institutions. Two major factors were recurrent in these studies: academic records and demographic information of the students, widely recognized as significant determinants of academic success. This observation reaffirms previous research asserting the substantial role of academic records in predicting student performance. Our analysis expands the scope of prior literature reviews by incorporating studies from Mendeley and Science Direct databases. Interestingly, we discovered a glaring research gap in predicting performance in Malaysian secondary schools using locally relevant datasets. Therefore, further exploration in this niche could benefit this geographical context. Furthermore, our review identified the need to investigate other variables that may impact student performance. There is vast potential for unearthing novel insights that could substantially enhance our understanding and predictive accuracy. Consequently, this study emphasizes the importance of continuing research in this field, integrating various datasets, geographical contexts, and potential predictors to better anticipate and address student performance across various educational scenarios.


Introduction
Data mining has evolved as a critical method for analyzing massive amounts of information and generating valuable insights.The education industry recently began using data mining Vol 13, Issue 11, (2023) E-ISSN: 2222-6990 approaches to forecast students' academic achievement (Shahiri et al., 2015).With data mining in education, educators may now make data-driven choices that can enhance student learning outcomes (Mokhtar et al., 2019).Predicting students' academic success by data mining is a burgeoning field of study that has piqued the interest of educators, academics, and policymakers alike.This topical review study aims to investigate trends in forecasting student performance using data mining.The review will examine the literature on data mining methods to predict student performance, the data sources employed in this research, and the performance criteria used to assess the prediction models.Despite the extensive use of data mining methods in various disciplines, there is a need for more review articles in Malaysia that explicitly concentrate on data mining for predicting students' academic success.This research gap is problematic since forecasting student success is critical for educational institutions to become more data-driven in their decisionmaking processes.Furthermore, although data mining has been used in numerous studies to predict student performance, there is still a need for a comprehensive examination of the approach used.Thus, this paper aims to fill this gap in the literature by conducting a systematic review of studies published from 2016 to 2022 that discuss the trends in predicting students' performance using data mining.The review will address the following research question: What are the current trends in predicting students' performance using data mining discussed in publications from 2016 to 2022?This review aims to shed light on the state-of-the-art techniques used in predicting students' academic performance, identify potential gaps in the methodology, and suggest future research directions.This review will benefit educators, policymakers, and researchers seeking to utilize data-driven approaches to improve students' learning outcomes.

Materials and Methods
This study's literature review methodology used Zairul's (2020) thematic analysis process; hence ATLAS.ti 9 was used to conduct the theme review.Thematic analysis is used to construct patterns and themes over thorough reading on the subject (Clarke & Braun, 2013).The next stage in utilizing data mining to predict student performance is to look for patterns and establish classifications to clarify prevailing tendencies.Several factors were used in making the literary selection.Articles that meet the following criteria will be considered: 1) published between 2016 and 2022; 2) include at least one topic-related term; and 3) center on utilizing data mining to predict student performance.

results
Science Direct "predict$ student performance" AND "data mining" 110 results Mendeley and Science Direct were used to conduct the literature search.71 (Mendeley) and 110 (Science Direct) items were found in the initial search (Table 1).However, 128 publications were disregarded because they did not discuss students' performance using data mining or had premature results and anecdotes.It was also discovered that several of the articles were overlapped, had broken links, or were either incomplete or unavailable in their entirety.As a result, 53 articles were left in the final manuscript for evaluation (see Figure 1).The articles were examined based on the publication year and the discussion style.

Results and Discussions
The study's findings are presented in two forms: quantitative and qualitative.The quantitative data provides a descriptive account of the results, emphasizing numerical values.In contrast, the qualitative data focuses on identifying and describing themes that emerge from the analysis of each article.The qualitative data provides an in-depth understanding of the data, highlighting essential insights that may not be captured through quantitative analysis alone.
Additionally, this section will provide a summary of the key findings, drawing attention to noteworthy highlights that can be used to guide future research.Overall, this comprehensive presentation of the data in both forms will provide readers with a robust and nuanced understanding of the study's results.

Quantitative Data
Figure 2 shows the total number of publications published within the given period.The ascending line graph indicates that the number of research in this field is progressing in the correct direction.The most considerable number of articles were published in 2021, a total of fourteen.Ten articles were published in both 2018 and 2019, hence the overall number of publications remained the same.In the following year, there was a modest decline in published works, with just seven appearing in print.In 2016 and 2017, four and six papers were published, respectively.As of August 2022, there have been just two articles published.The articles focus on the countries depicted in Figures 3 and 4 since those countries are the ones for whom the datasets being examined were produced.The United States of America is in second place behind India in terms of the number of datasets cited in published studies, with four compared to India's total of eight.Since Greece, Nigeria, and Spain each have a total of three articles, these countries are tied for third position.Several nations' constitutions, including Brazil, Canada, Chile, Egypt, Jordan, Mexico, Pakistan, Portugal, Saudi Arabia, and Turkey, each have two articles.Other nations' constitutions contain less than two articles.
Only one research that uses datasets from each of the following countries-Australia, Bosnia, China, Columbia, Ethiopia, Malaysia, Morocco, New Zealand, Slovakia, Slovenia, Taiwan, and Vietnam-has been published.

Figure 4. The country of datasets in the article
There were three main aspects to be discussed: the context of the studies of previous literature review, the factors that affect students' performance and the data mining algorithms used to predict students' performance.Looking at Figure 5, it seems that the majority of the research -a total of forty-six -is centered on educational establishments of higher education.The following five articles focused on secondary schools, while the final article focused on elementary or primary education.
The results indicate that most studies involving student performance prediction utilized datasets from higher education institutions.In their comprehensive literature analysis, Roslan and Chen (2022) revealed that of the fifty-eight publications analyzed, forty-nine included studies in higher education institutions, eight in secondary schools, and one in elementary schools.According to our findings, forty-six studies covered datasets from higher education institutions, five from secondary schools, and one from elementary schools.In addition, the systematic literature review conducted by Roslan and Chen (2022) revealed that no study involving secondary schools in Malaysia utilized datasets from Malaysia.The same applies to

Factors Affecting Students' Performance
Figure 6 is a compilation of the factors that influence the overall performance of the students.The student's academic record is the factor that is mentioned the most often as a means of forecasting the student's performance.It accounts for 49% of all characteristics that were compiled.The portion that is published the second most often is the student demographics.
The part that is reported the most is the student's academic record.These two facets are responsible for 81% of the total number of elements that have been gathered.A limited number of researchers focused their attention on a variety of ancillary factors, such as student settings (which were only investigated in 3% of the studies), course features (which were analyzed in 5% of the studies), and other elements (11% of the studies).The sections of the paper are broken down in Table 3 according to who wrote them and the variables that affected the students' achievement.However, the results also indicate that researchers have yet to look at miscellaneous aspects such as student environments, course attributes, and other elements as frequently.This may be due to the difficulty in measuring and quantifying these aspects or a need for more focus on these factors in previous research.
Overall, the findings reported in this study highlight the importance of considering both student academic records and demographics in predicting student success.However, it also suggests the need for more research into the impact of other factors, such as student environments and course attributes, on students.

Data Mining Algorithms
The results presented in this study reveal the most commonly used data mining algorithms in the literature, as depicted in Figure 7.The algorithms were categorized into five types: trees, functions, Bayes, rules, and lazy algorithms.
The findings suggest that the most commonly used data mining algorithms were trees and functions, each accounting for 36% of the studies, which comprise 72% of all the studies employed.).Decision trees are a powerful tool for classification and prediction tasks, making them popular for data mining applications.Decision trees are prevalent in data mining for several reasons.First, decision trees are easy to interpret and visualize, which makes them helpful in communicating insights to non-technical stakeholders.Second, decision trees can handle numerical and categorical data, making them versatile.Third, decision trees can be used for classification and regression tasks, making them applicable to various problems (Hastie et al., 2009;Han et al., 2011;Witten et al., 2016).Conversely, functions are popular in data mining because they allow analysts to model complex relationships between variables and make predictions about new data.Functions are mathematical representations that describe how one variable (the dependent variable) is related to one or more other variables (the independent variables) (Hastie et al., 2009;Han et al., 2011;Witten et al., 2016).
The second most commonly used data mining algorithm was the Bayes algorithm, employed in only 14% of the studies.One of the main advantages of Bayesian methods is that they can handle complex models with many parameters.Unlike traditional statistical methods that often assume that the data follow a specific distribution, Bayesian methods allow analysts to specify a prior distribution for each parameter and then update these priors based on the observed data.This flexibility makes Bayesian methods useful for a wide range of applications in data mining, such as clustering, classification, and regression (Gelman et al., 2014;Koller et al., 2009;Murphy et al., 2012).
The lazy algorithm was the most commonly used in 10% of the studies.Lazy learning methods, also known as instance-based methods, are not as popular in data mining as eager learning methods because of their computational inefficiency and lack of generalization.Lazy learning methods do not build a model based on the entire training data but instead store the training data and make predictions based on the most similar instances to the new data.This approach can result in accurate predictions for small datasets but can be computationally expensive for large datasets because it requires comparing the new data to every instance in the training data (Aha et al., 1991;Hastie et al., 2009).
Finally, the least commonly used algorithm was rules, which were only employed in 4% of the studies.Rules-based methods involve extracting rules from the data that describe relationships between variables.These rules can be used to make predictions about new data or to identify patterns in the data.However, rules-based methods can be prone to overfitting, meaning they may perform well on the training data but poorly on new data (Han et al., 2011;Witten et al., 2016;Tan et al., 2019).
Table 3 shows the breakdown of papers according to authors and data mining algorithms.This table provides a more detailed analysis of the data mining algorithms employed in each study, along with the authors of the papers.

Conclusion and Future Studies
The study included quantitative and qualitative data to shed light on studies on forecasting student performance using data mining methods.According to the quantitative statistics, the number of published publications in this sector has increased over time, with the bulk of research focused on higher education institutions.According to the qualitative data, the student's academic record is the most generally mentioned factor influencing student achievement.
The outcomes of the study have implications for future research.One suggestion is to concentrate more on forecasting secondary school student performance, given that there is a need for more research in this area utilizing Malaysian datasets.In addition, while only one investigation has been undertaken in this area, future studies might use data mining methods to forecast student performance in elementary schools.
Another suggestion is to investigate other aspects that may impact student performance, such as student settings, course features, and other components.Further study might examine the links between these variables and student performance and how they can be incorporated into data mining systems to increase forecast accuracy.Future research might also evaluate various data mining techniques to see which are the most successful in predicting student achievement.This might include running several algorithms on the same dataset and evaluating performance indicators like accuracy, precision, recall, and F1-score.
The research offers valuable insights by reviewing the literature on student performance prediction between 2016 and 2022.It identifies the growing interest in this field, particularly in 2021, with a primary focus on higher education.By reiterating the importance of academic records and demographic data as key factors in academic success, it strengthens existing knowledge.The study's methodology, which includes data from Mendeley and Science Direct databases, broadens the scope of the review.
Additionally, the research uncovers a significant research gap in predicting student performance in Malaysian secondary schools, emphasizing the need for context-specific investigations.It also encourages exploring new variables that can influence student performance, potentially leading to more accurate predictions.Overall, the research underscores the importance of ongoing research in this area, promoting the integration of diverse data sources, consideration of various contexts, and the exploration of a wider range of predictors to enhance our understanding of and support for student success in different educational settings.Overall, the report presents a complete summary of studies on data mining strategies for forecasting student success.The results emphasize the need to consider many aspects that may impact student performance and investigate new areas of study, such as forecasting student success in secondary and elementary schools.

Figure 6 .
Figure 6.Factors affecting students' performance discussed in the article

Figure 7 .
Figure 7. Data mining algorithms discussed in the article

Figure 2. Number of the paper published from 2016 to 2022 Table 2. Publication breakdown according to the year Author(s) 2016 2017 2018 2019 2020 2021 2022
Table 2 displays the split of publications by authors and years.

Study context of the article Qualitative Data
This study extends the systematic literature reviews reported by Shahiri et al. (2015) and Abu Saa et al. (2019).Abu Saa et al. (2019) studied and assessed publications published between 2009 and 2018, while Shahiri et al. (2015) examined research published between 2002 and 2015.This research also complemented Roslan and Chen's thorough literature evaluation (2022).Roslan and Chen investigated articles published between 2015 and 2021 in Lens and Scopus, while our analysis analyzed articles published between 2016 and 2022 in Mendeley and Science Direct.This study's findings were consistent with the majority of the literature evaluated.

Table 3 . Paper breakdown according to authors and factors affecting students' performance
Abu Saa et al. (2019)discovered, based on a survey of 36 research articles published between 2009 and 2018, that students' prior grades and internal ratings are the most prevalent predictors of student achievement.Asif et al. (2014)discovered that it is feasible to predict student performance only based on academic results, regardless of other determinants.
illustrates that a student's academic record (GPA, past results, test scores, grades, marks, and attendance) is the most relevant element in determining their academic achievement.This is consistent with the findings ofShahiri et al. (2015)and Roslan and Chen (2022), who reported that nearly one-third of previous studies employed academic records such as CGPA.In a similar line,