Department of Mathematical and Computational Sciences

Permanent URI for this community

https://univendspace.univen.ac.za/handle/11602/1927

Browse

Now showing 1 - 20 of 86

Embargo
A Comparative Analysis of Machine Learning Models and Traditional Statistical Models for Continuous-Time Survival Analysis
(2026-05-19) Tshisikule, Ompha; Mulaudzi, T. B.; Bere, A.
Survival analysis is a statistical technique used to model time-to-event data, commonly applied in fields such as healthcare, engineering, and finance. Traditional approaches, including the Cox Proportional Hazards (CoxPH) model, have long been dominant due to their interpretability and theoretical foundation. However, recent advances in machine learning have shown promise in handling complex, high-dimensional datasets with nonlinear relationships. Despite this, there remains a gap in systematic comparative studies between traditional survival models and modern approaches such as regularized regression, ensemble methods, and deep learning architectures, particularly across diverse datasets with varying characteristics. This study conducts a comparative analysis of traditional, machine learning, and deep learning-based survival models, evaluating their predictive performance and computational efficiency for continuous-time survival data. The models considered include LASSO-regularized Cox regression, CoxPH, Random Survival Forest (RSF), and Long Short-Term Memory (LSTM) algorithms. Model performance was assessed using the concordance index (C-index), integrated Brier score (IBS), and Time-dependent Area Under the Curve (AUC) across three secondary datasets with different characteristics: a breast cancer dataset obtained from the SEER Program of the National Cancer Institute (2017 November update), the North Carolina Recidivism dataset (ICPSR 8987) obtained from ICPSR, and a heart failure clinical records dataset obtained from Kaggle. A rigorous statistical framework was employed, utilizing 100 iterations of stratified train-test splits to generate robust performance distributions. Distributional assumptions were systematically tested using Shapiro-Wilk and Levene’s tests to determine appropriate statistical tests, followed by omnibus tests (ANOVA, Welch’s ANOVA, or Kruskal-Wallis) and post-hoc pairwise comparisons with Bonferroni correction to control family-wise error rates. The analysis revealed that traditional survival models consistently outperformed deep learning-based approaches across all datasets. Random Survival Forest achieved the highest predictive accuracy, followed closely by CoxPH, with C-index values ranging from 0.66 to 0.73 and lower IBS scores indicating better calibration. In contrast, LSTM models performed poorly, often near random prediction levels (C-index 0.3–0.42), despite extensive optimization efforts including hyperparameter tuning, class balancing, and architectural modifications. Statistical testing confirmed that performance differences were highly significant across models and datasets (all p < 0.001), and post-hoc analyses demonstrated that RSF and CoxPH consistently outperformed LSTM for both discrimination and calibration metrics. These results suggest that traditional survival models remain the most reliable choice for moderate-sized datasets with censored observations and weak predictive signals, while LSTM networks are limited by dataset size, high censoring, and architectural mismatch with static survival data.
Embargo
A comparative evaluation of machine learning models for stock price prediction and uncertainity estimation
(2026-05-19) Nengovhela, Vhukhudo; Ravele, T.; Sigauke, C.; Ndogmo, J. C.
This study compares machine learning models for stock price prediction and uncertainty estimation using high-frequency one-minute stock data. The research looks at how different models perform across developed and emerging markets, which helps with model selection for practical financial forecasting. Four models were tested for point forecasting: Random Forest (RF), Gradient Boosting (GB), Multi-Layer Perceptron (MLP), and a hybrid stacking ensemble composed of multiple base learners. For uncertainty quantification, three interval prediction methods were used: Bootstrap Residuals, Quantile Regression Forests (QRF), and Conformalised Quantile Regression (CQR). The analysis used one-minute stock price data from Microsoft Corporation (MSFT) as a developed market example and Standard Bank Group (SBK.JO) as an emerging market example, covering the period from 3rd to 26th September 2025. The results show that GB performed best for point forecasts in both markets. For MSFT, GB had RMSE of 0.2875 and MAE of 0.1869, while for SBK.JO it achieved RMSE of 25.9248 and MAE of 14.3638. Statistical tests using the Diebold-Mariano and Giacomini-White frameworks confirmed that GB significantly outperformed the other models. For interval prediction, QRF gave sharper intervals in the relatively stable developed market, while CQR achieved better coverage in the more volatile emerging market. The Hybrid Stacking model showed some advantages in volatile conditions but didn’t consistently beat well-tuned individual models. These findings suggest that ensemble methods like GB are still very effective for financial forecasting, and that uncertainty quantification methods should be chosen based on market volatility. The study provides practical guidance for selecting forecasting methods depending on market conditions and data characteristics, which should help both researchers and practitioners working in financial risk management.
Embargo
A Multi-level Model for a Vector-Borne Organ to Tissue life Cycle Dynamics
(2025-09-05) Mahada, Awelani Sydney; Netshikweta, R.; Garira, W.
Introduction: Malaria is among the World’s most lethal infectious disease. It is caused by a parasitic pathogen transmitted by the Anopheles mosquito, which inoculates sporozoites into the human host during a blood meal. The population dynamics of malaria are well-known for their complexity, stemming not only from the parasite’s lifecycle, which involves two hosts (humans and mosquitoes)but also from the intricate replication and transmission cycles across different levels of the infectious disease system organization. Like other infectious disease systems, malaria infections inherit multilevel and multiscale systems, which pose significant challenges to efforts aimed at eliminating and ultimately eradicating the infection in a malaria-endemic population. Methodology Mathematical modeling in the study of complex system has proven to be an invaluable tool for understanding and predicting the behaviour and dynamics of a complex system within the domain of complexity science. Thus, in this study, we propose a multiscale modelling framework that captures the dynamics of malaria across three organizational levels within infectious disease systems implicated in the spread of malaria in a community. We begin by formulating a mathematical model to describe the development and progression of malaria parasites within the liver and tissue(blood) stages of an infected human host. This is followed by the formulation of a multiscale model that integrates both the inside(i.e.,the organ-tissue level)host and the outside (i.e., the host level) host malaria dynamics. Results Mathematical analysis for both the malaria models presented in this study was carried out and proved that all the models are mathematically and epidemiologically well-posed. We also compute the basic reproduction number R0 for both models and use the R0 to determine the local and global stability of the disease-free equilibriumas well as the local stability of endemic equilibrium of both models, respectively. We demonstrate that if R0 < 1, then the diseasefree equilibrium pointy of both models is locally and globally asymptotically stable, respevctively. However, if R0 > 1 the endemic equilibrium point of both models is locally asymptotically stable. The numerical results for both the models have demonstrated that the goal of intervention during malaria infection should be to reduce the rates at which merozoites and gametocytes invade healthy liver tissue as well as the blood cells. Hence it is recommended that interventions during malaria infection be directed on reducing the pace at which merozoites infect healthy blood cells and the density of merozoites in circulation. Conclusion The study presents a method that incoporates the complexity of malaria pathogens which is significant not only for malaria treatment but also for other vector-borne disease system control treatment strategies.
Open Access
Alternative methods for solving nonlinear two-point boundary value problems
(2018-03-18) Ghomanjani, Fateme; Shateyi, Stanford
In this sequel, the numerical solution of nonlinear two-point boundary value problems (NTBVPs) for ordinary di erential equations (ODEs) is found by Bezier curve method (BCM) and orthonormal Bernstein polynomials (OBPs). OBPs will be constructed by Gram-Schmidt technique. Stated methods are more easier and applicable for linear and nonlinear problems. Some numerical examples are solved and they are stated the accurate findings.
Open Access
An Intelligent Surveillance System Using Deep Facial Expression Recognition
(2026-05-19) Mutshafa, Livhuwani; Moyo, B.
Surveillance systems are critical tools for maintaining security, enhancing public safety, and safeguarding assets in diverse settings, from public spaces to private facilities. Despite their importance, these systems often face challenges that require human oversight. Recent studies have explored deep learning techniques to address such challenges, primarily focusing on face recognition and anomaly detection in static images. This study proposes a deep learning approach for detecting and interpreting facial expressions in dynamic images to enhance surveillance applications. The methodology involved a comprehensive literature review, dataset preprocessing, development of deep learning models, and rigorous model evaluation. A fine-tuned MobileNetV2 and a hybrid MobileNetV2–LSTM models were designed to capture both spatial and temporal features of facial expressions. The models were trained on benchmark datasets, including the Amsterdam Dynamic Facial Expression Set (ADFES) and the Chinese Face Dataset with Dynamic Expressions, and evaluated using accuracy, precision, recall, and F1-score metrics. Results demonstrated that the MobileNetV2–LSTM model significantly outperformed the standard MobileNetV2, achieving 95% accuracy, 95% precision, 95% recall, and 95% F1-score, highlighting the advantages of temporal modeling. The models maintained high computational efficiency, achieving 43.09 frames per second and a per-frame inference time of 0.0232 seconds, indicating strong real-time feasibility. This study contributes to intelligent surveillance by providing a highly reliable facial expression recognition framework for dynamic scenarios, with future work focusing on real-time deployment, expanded datasets with diverse ethnicities, and enhanced robustness under challenging surveillance conditions.
Open Access
Analysis of a boundary value problem for a system on non-homogeneous ordinary differential equations (ODE), with variable coefficients
(2015-01-16) Makhabane, Paul Suunyboy; Hlomuka, V. J.; Garira, W.
In this study we present a condition for the existence and uniqueness of the solution y(x) for a system of nonhomogeneous linear first order Ordinary Differential Equations (ODE). The existence and uniqueness of the solution of y(x) was confirmed through the Picard Lindelof Theorem. We then study the stability of matrix A(x) using its spectrum, moreover, A(x) is symmetric. This is a pre-condition for the application of Lefschetz direct stability method. We then modify the given Lefschetz system (Meyer, 1964) to suit the problem at hand. The direct method requires the construction of a suitable Lyapunov function; not easy for a time-independent (non-dynamic) problem. For a time-dependent problem the energy thereof becomes a suitable candidate for a Lyapunov function. For a non-dynamic problem it is harder to construct a Lyapunov function as there are no rules for that purpose. In our study we modified the Lefschetz system for the direct stability method and applied it to confirm the Lefschetz stability criterion using the modified systems of linear first order ODEs with variable coefficients. The Lefschetz method afforded us the construction of a credible Lyapunov function which enabled us to confirm the stability of the null solution to our problem. From our modified Lefschetz direct stability system, we solved the Makhabane / Hlomuka equation (5) for B(x) (7) which we later confirmed as both symmetric and positive definite.
Embargo
Application of explainable AI and uncertainity quantification in credit risk assessment
(2026-05-19) Rambauli, Mulavhelesi; Ravele, T.; Sigauke, C.
Credit risk modelling is essential for assessing the likelihood of borrower default and supporting informed lending decisions. Despite advances in predictive algorithms, challenges remain in ensuring model transparency, reliability, and robustness to uncertain inputs. This study investigates the integration of explainable AI (XAI) and uncertainty quantification (UQ) to enhance both interpretability and confidence in credit risk predictions. Three modelling approaches—Logistic Regression, Random Forest, and XGBoost—were evaluated using the Home Equity (HMEQ) dataset, with performance assessed on predictive accuracy, probability calibration, interpretability, and uncertainty handling. Ensemble methods achieved superior predictive performance, exceeding 98% accuracy and yielding near-perfect AUC scores above 0.999, whereas Logistic Regression exhibited substantially lower performance. Calibration analysis revealed a discrepancy between accuracy and probabilistic reliability: Random Forest, despite high accuracy, produced less well-calibrated predictions (ECE = 0.0475), while XGBoost achieved both strong predictive performance and reliable confidence estimates (ECE = 0.0117). Entropy-based uncertainty quantification identified instances where the model’s predictions carried high doubt, effectively highlighting challenging cases. SHAP and LIME consistently identified DELINQ, DEROG, and DEBTINC as primary drivers of default risk, aligning with established financial risk logic. By combining SHAP, LIME, and entropy-based UQ, this study proposes a unified framework that enhances interpretability, supports regulatory compliance, and increases trust in automated lending systems, emphasising the importance of reliable confidence alongside predictive accuracy.
Open Access
Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools
(2024-09-06) Mudau, Tshilisanani; Garira, Winston; Netshikweta, Rendani
Background: De-identification is a technique that eliminates identifying information from Clinical Records in order to protect individual privacy. This procedure decreases the chance of personal information being collected, processed, distributed, and published from being used to identify the person. When Machine Learning techniques were included in the de-identification process, it substantially improved over the previous method. Research Problem: The Electronic Discharge Summary(EDS) has evolved into a significantly improved technique of providing discharge summaries though this information contains Protected Health Information (PHI), which poses a risk to patients’ privacy. This makes the process of de-identification to be mandatory. There have lately been several Machine Learning approaches to de-identify data. This study focuses on applying Machine Learning techniques to figure out which model can best de-identify a data set. Methods: The open source data set from Harvard Medical School was used. This data set contains 899 Electronic Health Records (EHR), 669 for training and 220 for test purpose. The Conditional Random Fields (CRF), Long Short Term Memory (LSTM) and Random Forest models were used, and the performance of each model was assessed. Findings: In order to assess each model’s performance, evaluation metrics were used to compare F-measure, Recall and Precision at token level to determine which Machine Learning model performed best. The Long Short Term Memory was found to outperform both Conditional Random Fields and Random Forest with micro average F-measure, Recall and precision of 99%, and macro average F-measure of 77%, Recall of 73% and Precision of 90%.
Open Access
A Bayesian multilevel model for women unemployment in South Africa
(2021-08) Ramarumo, V. P.; Bere, A.; Sigauke, Caston
The study is aimed at investigating and explaining the demographic and socio-economic determinants components a ecting women unemployment in South Africa. The classical and the Bayesian estimation approach were applied to a multilevel logistic regression (MLR) model. Secondary data acquired from the Demographic and Health survey (DHS) held in South Africa in 2016 was used in the study. Information criteria revealed that the random intercept model outperformed the MLR model of the null and random coe cient multilevel models. The Intraclass Correlation Coe cient (ICC) proposes that there is an understandable di erence in women unemployment level over various provinces of South Africa. The results of the classical MLR and the Bayesian MLR indicate in ated commonness for women unemployment and the chance of being without employment for women was established to decrease with an increase of age, wealth index, and educational attainment.
Open Access
A class of efficient iterative solvers for the steady state incompressible fluid flow : a unified approach
(2016-02-01) Muzhinji, Kizito;
Open Access
Commodity Futures Market Prices: Decomposition Approach
(2023-10-05) Antwi, Emmanuel
Financial investments on commodity markets have attracted many investigations due to its importance to the global economy, and worldwide trade as a whole. The radical price changes in commodity market prices, especially agricultural, energy and industrial metal products have significant consequences on consumers and producers of economic activities. It is very crucial to accurately estimate and predict volatility in commodity futures market prices, since continuous price fluctuations have dire consequences for investors, portfolio managers, dealers and policymakers in taking prudent and sustainable decisions. Commodity price component determination and forecasting are challenging due to remarkable price volatility, uncertainty, and complexity in the futures market. As a result, commodity futures price series is nonlinear and nonstationary. Various studies are reported in the literature, in an attempt to develop models to study the persistent changes in the commodity futures price series, but these models have failed to account for the inherent complexity in the commodity futures price series. This study aims to use decomposition techniques, combined with back-propagation neural network (BPNN) and autoregressive integrated moving average (ARIMA) models to address difficulties in studying commodity futures market prices. As said earlier, this study utilized the decomposition methods, Empirical Mode Decomposition (EMD) and Variational Mode Decomposition (VMD), to analyze the daily real price series of three commodity futures market prices of: corn from agricultural products, crude oil from energy, and gold from industrial metal, using the data from 4th May 2016 to 30th April 2021. In the first part of the study, we explored the descriptive and statistical properties of the data. It was found that the three commodities market futures prices series were nonstationary and nonlinear. Subsequently, we performed an EMD-Granger causality test to establish the spillover effects among the three commodities’ markets. It was revealed that there exists a strong mutual relationship among the three commodity markets price series, which implies that the price movement of one market can be used to explain the price fluctuations of the other markets. In the second part, the EMD and VMD methods were applied to decompose the daily data of each commodity price from different periods and frequencies to their respective individual intrinsic mode functions. First, we used the Hierarchical Clustering Method and Euclidean Distance Approach to classify the IMFs, residue, and modes into high-frequency, low-frequency, and trend. Next, applying statistical measures, particularly, the Pearson product-moment correlation coefficient, Kendall rank correlation, and Spearman rank correlation coefficient, we observed that the trend and low-frequency parts of the market prices are the main drivers of commodity futures markets prices’ fluctuations and that special events caused the low frequency. In essence, commodity futures prices are affected by economic development rather than short-lived market variations caused by ordinary supply-demand disequilibrium. The third part compared the EMD and the VMD- based models using three forecasting performance evaluation criteria and statistical measures, such as, mean absolute error (MAE), root mean square error (RMSE), and mean percentage error (MAPE) to compare the capabilities of the suggested models. We also introduced Diebold Mariano (DM) test in selecting the optimal models for each commodity, since MAE, RMSE and MAPE have some shortcomings. The combined models outperformed the individual back propagation neural network (BPNN) and autoregressive integrated moving average (ARIMA) models in forecasting the series of corn and crude oil’s futures prices. At the same time, BPNN emerged as the optimal model for predicting gold futures prices’ series. In addition, variational mode decomposition emerged as the ideal data pre-treatment method and contributed to enhancing the predicting ability of the BPNN and the ARIMA models. The empirical results showed that models combined with decomposition methods predict commodity futures prices accurately and can easily capture the volatility in commodity futures prices. By utilizing the decomposition-based models in studying commodity market prices, the study filled the following gap in the existing literature as follows: the pre-treatment effect of the EMD and VMD can be compared horizontally, in decomposing commodity market price series and studying the underlying components that cause the above mentioned commodity markets price fluctuations is a novel approach in studying commodity market prices. In addition, utilizing Hierarchical Clustering and Euclidean Distance Approaches, the IMFs, residue and modes were classified into their distinctive frequencies, namely, high-frequency, low-frequency, and trend units. The effect of these frequencies and trends on commodity market price fluctuation is the first of its kind in the literature. Furthermore, applying statistical measures such as Pearson product-moment correlation coefficient, Kendall rank correlation, and Spearman rank correlation coefficient to evaluate the contribution of the IMFs, residue, and modes to the net variance of the volatility of crude oil, corn, and gold markets price fluctuations, is an innovative approach to studying financial times series. The EMD-Causality technique proposed to study the causal relationship of corn, crude oil, and gold futures prices movement, is novel in the financial market. This new approach to study price movement of commodity markets, will provide a vital information about one commodity market to explain the other commodity market price fluctuations in various markets. Also, Decomposition of financial data before forecasting have high forecasting precision accuracy in commodity futures price prediction. Additionally, using decomposition techniques in agriculture, energy, and industrial metal commodities futures markets, effectively, minimizes the prediction complexity. Furthermore, using econometric and machine learner models incorporated with decomposition methods can capture the price series information up to acceptable degrees. Finally, decomposition-based predicting techniques can effectively raise the predicting performance capability of BPNN and ARIMA models and reduce errors, thus, the proposed novel combination method can statistically improve forecast accuracy. This study, therefore, may assist in arresting the agricultural, energy, and industrial commodities markets trends and estimate volatility risk factors accurately, consequently serving as a guide for investors, governments policymakers and related sectors such as agriculture, energy, and metal industry to take prudent and sustainable planning and investment decisions. The suggested decomposition strategy, particularly VMD-based is robust in analyzing the determinants, modeling, and forecasting commodity futures market prices fluctuations, thereby, improving forecasting precision accuracy. Remarkably, in using the decomposition approach in estimating compositions of commodity prices data series separately, different predicting strategies can be explored. For instance, based on the features of decomposed IMFs or modes, a suitable predicting technique can be considered to forecast each IMF or mode; for example, the residue can be estimated by utilizing a polynomial function, while Fourier transform can be considered in predicting low-frequency IMFs or modes, hence, it is recommended that researchers, institutions, investors, and policymakers interested in studying commodity price movements should consider using this novel technique to achieve better results. It is further suggested that the decomposition approach could be utilized in other fields of study to prove the approach’s generality. Finally, further study can extend the proposed methodology by considering other decompositions techniques rather than just EMD and VMD and evaluate their robustness in studying financial markets, as EMD approach has the problem of mode mixing and endpoint effects. Eventually, we propose that a new model or consolidated predicting technique should be investigated to cater for special events’ influences on commodity market prices since no one can predict the time and the place they will occur.
Embargo
Comparative Analysis of Discrimination and Calibration Accuracy of Discrete Survival, Random Forests, and Neural Networks in Health-Related Survival Prediction Models
(2025-09-05) Ramachela, Audrey Tshepho; Bere, Alphonce; Mulaudzi, Tshilidzi; Motsuku, Lactatia
Prediction models for survival analysis are commonly used in biomedical sciences to understand the onset of certain diseases. Traditional statistical models have been employed for the previous years, however, their limitations and inability to handle big data sets has made a way for the introduction of machine learning methods which gained recognition due to their ability to learn complex algorithms. However, existing literature indicates that the predictive accuracy of machine learning and statistical models for survival analysis varies significantly across different data sets. This variability underscores the need for further research utilizing data sets with diverse characteristics. Such research is essential to develop generalizable insights into the conditions under which each method performs best. In this research project, we compared the predictive performance of traditional statistical method and machine learning algorithms in discrete survival analysis. The machine learning methods include discrete-time survival trees, discrete-time random survival forests, and discrete-time neural networks. The study uses calibration (measured by the prediction error curves) to assess model fit and discrimination (measured by the Concordance index and area under curve) to evaluate predictive accuracy. These methods were applied to data sets: Breast cancer, age at first alcohol intake and CRASH-2. The discrete-time neural network had the best prediction performance as compared to the rest of the models for survival of breast cancer. The discrete-time random forest with hellinger distance had the overall prediction performance on the age at first alcohol intake. The discrete-time survival model outperformed the rest of the models in predicting survival of bleeding trauma patients from the CRASH-2 data .
Open Access
Comparative analysis of Machine Learning Algorithms for Estimating Global Solar Radiation at Selected Weather Stations in Vhembe District Municipality
(2023-10-05) Marandela, Mulalo Veronica; Mulaudzi, T. S.; Maluta, N. E.
hstimating anct assessing the energy talling in a particular area 1s essential tor installers ot renewable technologies. Different equations have been applied as the most reliable empir ical for estimating global solar radiation(GSR) in different climatic conditions. The main objective of this work is to estimate the global solar radiation of two stations namely, Mu tale and Messina found in Vhembe District, Limpopo Province, South Africa. Four different methods (Random forest(RF) regression, K-nearest neighour (K-NN), Support Vector Ma chines(SVM) and Extreme Gradient Boosting mechanism(XGBoost)) is used to estimate the GRS in this study. The RF model on Mutale station was found to be the best fitting model with R² = 0.9902, MSE = 0.4085 and RMSE = 0.6391, followed by XGB with R² = 0.9898, MSE = 0.4245 and RMSE = 0.6515. RF was also found to be the best for Messina station with R² = 0.9636, MSE = 0.1.4138 and RMSE = 1.1890, followed by XGB model with R² = 0.9595, MSE = 1.5723 and RMSE = 1.2539. From the results, it can be concluded that RF is a better model for estimating GSR for different stations.
Open Access
A comparison of some methods of modeling baseline hazard function in discrete survival models
(2019-09-20) Mashabela, Mahlageng Retang; Bere, Alphonce; Sigauke, Caston
The baseline parameter vector in a discrete-time survival model is determined by the number of time points. The larger the number of the time points, the higher the dimension of the baseline parameter vector which often leads to biased maximum likelihood estimates. One of the ways to overcome this problem is to use a simpler parametrization that contains fewer parameters. A simulation approach was used to compare the accuracy of three variants of penalised regression spline methods in smoothing the baseline hazard function. Root mean squared error (RMSE) analysis suggests that generally all the smoothing methods performed better than the model with a discrete baseline hazard function. No single smoothing method outperformed the other smoothing methods. These methods were also applied to data on age at rst alcohol intake in Thohoyandou. The results from real data application suggest that there were no signi cant di erences amongst the estimated models. Consumption of other drugs, having a parent who drinks, being a male and having been abused in life are associated with high chances of drinking alcohol very early in life.
Open Access
Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis
(2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.
While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intake.
Open Access
Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis
(2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.
While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intake
Open Access
Computational analysis of magnetohydrodynamics boundary layer flow of nanofluid over a stretching sheet in the presence of heat generation or absorption and chemical reaction
(2022-07-15) Molaudzi, Vhutshilo; Shateyi, S.; Muzhinji, K,
In this study, we present the effect of two-dimensional magnetohydrodynamics of a nanofluid over a stretching sheet in the presence of chemical reaction, as well as heat generation or absorption. The partial differential equations are reduced to coupled nonlinear ordinary differential equations using similarity transformations, which are then solved numerically using spectral local linearization and spectral relaxation methods. The effects of different parameters, Lewis number, Eckert number, stretching, chemical reaction, local Reynolds number, Prandtl number, constant, heat source, Brownian motion, and Thermophoresis are analysed and compared. The numerical results for velocity, temperature, skin friction coefficient, concentration, Sherwood number, and Nusselt number are presented in tabular form and visualized graphically. The findings of the spectral local linearization and spectral relaxation methods are very similar to the bvp4c method’s results. When compared to the spectral relaxation method, the results from the spectral local linearization method were more effective. We found that the velocity profile are increased with increasing values of the Grashof number (Gr). Since Grashof number (Gr) is ratio of buoyancy to viscous forces in the boundary layer it causes an increase in the buoyancy forces relative to the viscous forces which influence the velocity in the boundary layer region. An increase in the heat source/sink parameter (S) results in the increase in velocity and temperature, but a decrease in concentration. The concentration diffusion species were reduced due to the heat source/sink parameter (S). The results also show that heat generation increases the momentum and thermal boundary layer thickness while decreasing the nanofluid concentration boundary layer thickness.
Open Access
Credit Card Fraud Detection using Boosted Random Forest Algorithm
(2023-10-05) Mashamba, Thanganedzo Beverly; Chagwiza, W.; Garira,W.
Financial fraud is a growing concern with far-reaching concerns in financial institutions, government, and corporate organizations, leading to substantial monetary losses. The primary cause of financial loss is credit card fraud; it affects issuers and clients, which is a significant threat to the business as clients will run to their competitors, wherein they will feel secure. Solving fraud problems is beyond human capability, so financial institutions can utilize machine learning algorithms to detect fraudulent behaviour by learning through credit card transactions. This thesis develops the boosted random forest, integrating an adaptive boosting algorithm into a random forest algorithm, such that the performance of a model is improved in predicting credit card fraudulent transactions. The confusion matrix is used to evaluate the performance of the models, wherein random forest, adaptive boosting and boosted random forest were compared. The results indicated that the boosted random forest outperformed the individual models with an accuracy of 99.9%, which corresponded with the results from confusion matrix. However random forest and adaptive boosting had 100% and 99% respectively, which did not correspond to the results on confusion matrix, meaning the individual models need to be more accurate. Thus, by implementing the proposed approach to a credit card management system, financial loss will be reduced to a greater extent.
Open Access
Determination of factors contributing towards women's unemployment in the Capricorn and Sekhukhune districts in the Limpopo Province
(2017-09-18) Maboko, Tumisho; Kyei, K. A.
See the attached abstract below
Open Access
Determination of factors that influence digit preference: A Case study of South African Census 2011 Age-Sex date
(2020-01) Netshiozwi, Masala; Kyei, K. A.; Moyo, S.
The age distribution of a population is one of the most important demographic factors that plays a major role in describing and making projections about the population. Age distribution determine life expectancy, fertility and migration. It suﬀers most of the difﬁculties with regard to its accuracy, due to age misstatement and other factors. The study sought to determine the factors that inﬂuence digit preference in Age data using the South African census 2011 Age-sex data. Various methods were applied to examine the objectives of the study. The Visual Inspection methods (Line graph and Population Pyramid), Statistical methods (Age Ratio and Sex Ratio) and multivariate methods (Generalized linear model, Principal Component analysis and Regression analysis) which have been reviewed in detail in the study. This study utilized a full age dataset in single years. Based on the United Nation Age-sex Accuracy Index which was found to be 18.3, it shows that the data collected was of good quality. Besides the results deduced from the analysis to determine the quality of data, the study found that education level, place of residence, gender and ethnic group are the factors that inﬂuence digit preference. This was provided as evidence by calculated p-values <0.05, showing a positive relationship for generalized linear model. Principal component analysis and Regression analysis conﬁrm the ﬁndings by Generalized linear model