Theses and Dissertations
Permanent URI for this collection
Browse
Recent Submissions
Item Embargo Modelling Extreme Forecast Errors in Wind Energy Using South African Wind Farms(2026-05-19) Mushadu, Vhonani; Ravele, T.; Sigauke, C.; Ndogmo, J. C.Accurate wind energy forecasting has become crucial for preserving grid stability and guaranteeing a consistent power supply in the light of South Africa’s expanding shift to renewable energy. As they have a direct impact on scheduling, dispatch choices, and reserve allocation, extreme prediction errors in particular cause serious operational and financial issues. This study uses data from a collection of wind farms in South Africa to model shortterm extreme forecast mistakes in wind energy generation. The blended generalised extreme value (bGEV) distribution and extremal mixture models are two sophisticated extreme value modelling frameworks whose predictive accuracy is compared in this study. An additive quantile regression (AQR) model is used to derive wind energy forecast residuals. Both modelling techniques were then used to identify tail behaviour associated with extreme under- or over-prediction. The findings demonstrate that, in comparison to extremal mixture models, the bGEV model o!ers more accurate, dependable, and well-calibrated predictions of severe forecast errors. These results emphasise how crucial strong and adaptable extreme value models are to enhancing operational wind energy forecasting in South Africa. By showing how better modelling of extreme errors will enhance power system planning, lower uncertainty, and facilitate more e!ective integration of wind energy into the national grid, the study further advances the renewable energy industry. To improve prediction accuracy and deepen system-level insights, future research should take into account geographically disaggregated data from individual wind farms.Item Embargo Explainability of Machine Learning Models in Credit Risk Management(2026-05-19) Dzhivhuho, Asikundwi Praise the Lord; Mukhodobwane, R. M.; Mphephu, N.; Netshikweta, R.The effective management of credit risk is a critical challenge for financial institutions, with accurate assessment of loan default risk playing a central role in maintaining financial stability. Machine Learning (ML) techniques have become increasingly prevalent in credit risk assessment due to their ability to capture complex patterns in borrower behavior and improve predictive accuracy. However, the lack of interpretability of many advanced ML models, such as Random Forest, XGBoost, and Neural Networks, raises concerns regarding transparency, fairness, and accountability in decision-making, particularly in high-stakes environments where regulatory compliance and ethical considerations are paramount. This study seeks to bridge the gap between predictive accuracy and interpretability by applying two post-hoc, model-agnostic explainability techniques Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) to evaluate five commonly used ML models Logistic Regression, Multivariate Adaptive Regression Splines (MARS), Neural Networks, Random Forest, and XGBoost. Using an open-access Kaggle dataset, the study examines both the predictive performance and the interpretability of these models, with a particular focus on the trade-offs between high accuracy and model transparency. The results highlight a clear trade-off while ensemble models like XGBoost and Random Forests exhibit superior accuracy, particularly in predicting low-risk borrowers, they struggle with detecting high risk applicants and lack the interpretability required for transparent decision-making. Simpler models, such as Logistic Regression, offer greater transparency and are more effective in identifying high-risk cases but sacrifice predictive accuracy. Neural Networks strike a balance, providing better accuracy than linear models while maintaining moderate sensitivity to high-risk applicants. By leveraging SHAP and LIME, this research enhances model transparency, offering both global insights into risk factors and local instance-level explanations for individual predictions, which can aid stakeholders such as financial institutions, regulators, and applicants in making more informed, fair, and accountable credit decisionItem Embargo Exploring the dynamics of the ZAR/USD Exchange Rate volatility using the FGARCH and FIRST-ORDER BETA-SKEW-T-EGARCH Models(2026-05-19) Mashavhela, Dzulani; Ravele, T.; Ndogmo, J. C.; Sigauke, C.The effect of exchange rate fluctuations on international trade, investment choices, and economic stability has captured the attention of economists, policymakers, and market participants for a long time. This study investigates the dynamics of the ZAR/USD exchange rate volatility using advanced econometric models: the Family GARCH (fGARCH) model and the First- Order Beta-Skew-T-Generalised Autoregressive Conditional Heteroskedasticity (First-Order Beta-Skew-T-EGARCH) model. The ZAR/USD exchange rate is an important indicator for global trade, investment, and economic stability. However, traditional volatility models often struggle to fully capture its complex behaviour. This research aims to fill this gap by using the fGARCH and the First-Order Beta-Skew-T-EGARCH models to better understand volatility characteristics, including long-memory effects, asymmetry, and skewness, using the daily data from 5/01/2000 to 01/10/2024. The sGARCH and fGARCH were first compared using the following five error distributions: Student’s t, skewed Student’s t, generalised error, skewed generalised error distributions, and generalised hyperbolic distribution. The model selection is based on the information criteria with the lowest AIC, BIC, Shibata, and Hannan-Quinn. The fGARCH(1,1) model has the lowest AIC compared to the sGARCH model. The covariate effects were analysed for day, month, trend, oil, and platinum. The trend is statistically significant (p = 0.007) and positively influences the ZAR/USD market. Beta-Skew- T-EGARCH with one and two components displayed a significant spike in both 2008 and 2009 due to a global financial crisis. The two-component model provides a better fit with the lowest BIC (3.242162) and a high Log- Likelihood of -748.464826. Volatility was analysed over seven days using one and two-component models. The one-component level remained high, indicating persistent volatility, while the two-component model showed low conditional volatility. This suggests that the two-component model outperforms the one-component model, effectively reducing uncertainty. The outcomes of this research will contribute to the refinement of models for understanding and predicting volatility in the foreign exchange markets, providing valuable implications for financial decision-makers and policy-makers.Item Embargo Lie group analysis of Schr¨odinger equations describing optical waves in birefringent waveguides(2026-05-19) Mbala, Emmanuel Mayombo; Ndogmo. Jean-Claude; Folly-Gbetoula, MensahLie group analysis will be carried out for a system of two nonlinear Schr¨odinger equations characterizing the propagation of optical pulses and involving fourwave mixing terms in a birefringent media. The ultimate goal is to find the most general symmetry transformation that preserves the solution space and generates in particular a series of solutions of the system starting from a seed solution. Such a general symmetry transformation is the symmetry group, and it will be computed from the symmetry algebra L of the system rewritten as a system of four nonlinear uncoupled equations in real form. An optimal system of one dimensional subalgebras of L will be found and some related symmetry reductions will be given. Soliton or other solutions will be obtained by means of either the Hirota direct method, or some substitution methods such as the Tanh-expansion method or its variants else by other common methods including the direct Lie group methods.Item Embargo Enhancing PCE Prediction for Organic Solar Cells through the Integration of Supervised and Unsupervised Learning(2026-05-19) Mudau, Mulweli Raymond; Maluta, N. E.; Dima, R. S.; Netshikweta, R.Machine learning (ML) has significantly advanced solar cell research, particularly in material optimization and discovery. However, many studies rely on supervised learning models that assume consistent predictive trends across materials, potentially overlooking complex correlations affecting power conversion efficiency (PCE). Unsupervised clustering techniques offer an alternative by uncovering hidden patterns in material properties, yet their application in organic solar cell (OSC) research remains limited. This study addresses this gap by integrating clustering techniques with supervised learning to enhance PCE predictions in OSCs. The research employed K-means, DBSCAN, and hierarchical clustering to categorize OSCs based on molecular descriptors, then incorporated cluster labels as additional features in supervised models including Linear Regression, Random Forest, XGBoost, and Support Vector Regressor. Despite weak inherent cluster structure indicated by clusterability tests, the integration of cluster labels consistently improved predictive performance across all configurations. XGBoost paired with hierarchical clustering achieved the most substantial enhancement, with R² reaching 0.9640 and MAE reducing from 0.2917 to 0.2859. The findings demonstrate that (1) unsupervised learning can identify meaningful structural patterns in OSC datasets, and (2) incorporating cluster labels as engineered features improves PCE prediction accuracy compared to traditional supervised approaches alone. Importantly, even statistically weak clusters provided valuable predictive signals, contributing to enhanced model performance and supporting accelerated discovery of high-efficiency OSC materialsItem Open Access An Intelligent Surveillance System Using Deep Facial Expression Recognition(2026-05-19) Mutshafa, Livhuwani; Moyo, B.Surveillance systems are critical tools for maintaining security, enhancing public safety, and safeguarding assets in diverse settings, from public spaces to private facilities. Despite their importance, these systems often face challenges that require human oversight. Recent studies have explored deep learning techniques to address such challenges, primarily focusing on face recognition and anomaly detection in static images. This study proposes a deep learning approach for detecting and interpreting facial expressions in dynamic images to enhance surveillance applications. The methodology involved a comprehensive literature review, dataset preprocessing, development of deep learning models, and rigorous model evaluation. A fine-tuned MobileNetV2 and a hybrid MobileNetV2–LSTM models were designed to capture both spatial and temporal features of facial expressions. The models were trained on benchmark datasets, including the Amsterdam Dynamic Facial Expression Set (ADFES) and the Chinese Face Dataset with Dynamic Expressions, and evaluated using accuracy, precision, recall, and F1-score metrics. Results demonstrated that the MobileNetV2–LSTM model significantly outperformed the standard MobileNetV2, achieving 95% accuracy, 95% precision, 95% recall, and 95% F1-score, highlighting the advantages of temporal modeling. The models maintained high computational efficiency, achieving 43.09 frames per second and a per-frame inference time of 0.0232 seconds, indicating strong real-time feasibility. This study contributes to intelligent surveillance by providing a highly reliable facial expression recognition framework for dynamic scenarios, with future work focusing on real-time deployment, expanded datasets with diverse ethnicities, and enhanced robustness under challenging surveillance conditions.Item Embargo A comparative evaluation of machine learning models for stock price prediction and uncertainity estimation(2026-05-19) Nengovhela, Vhukhudo; Ravele, T.; Sigauke, C.; Ndogmo, J. C.This study compares machine learning models for stock price prediction and uncertainty estimation using high-frequency one-minute stock data. The research looks at how different models perform across developed and emerging markets, which helps with model selection for practical financial forecasting. Four models were tested for point forecasting: Random Forest (RF), Gradient Boosting (GB), Multi-Layer Perceptron (MLP), and a hybrid stacking ensemble composed of multiple base learners. For uncertainty quantification, three interval prediction methods were used: Bootstrap Residuals, Quantile Regression Forests (QRF), and Conformalised Quantile Regression (CQR). The analysis used one-minute stock price data from Microsoft Corporation (MSFT) as a developed market example and Standard Bank Group (SBK.JO) as an emerging market example, covering the period from 3rd to 26th September 2025. The results show that GB performed best for point forecasts in both markets. For MSFT, GB had RMSE of 0.2875 and MAE of 0.1869, while for SBK.JO it achieved RMSE of 25.9248 and MAE of 14.3638. Statistical tests using the Diebold-Mariano and Giacomini-White frameworks confirmed that GB significantly outperformed the other models. For interval prediction, QRF gave sharper intervals in the relatively stable developed market, while CQR achieved better coverage in the more volatile emerging market. The Hybrid Stacking model showed some advantages in volatile conditions but didn’t consistently beat well-tuned individual models. These findings suggest that ensemble methods like GB are still very effective for financial forecasting, and that uncertainty quantification methods should be chosen based on market volatility. The study provides practical guidance for selecting forecasting methods depending on market conditions and data characteristics, which should help both researchers and practitioners working in financial risk management.Item Embargo Improving Computational Efficiency of MRI Brain Tumour Analysis Using Hybrid Machine Learning Models(2026-05-19) Netshamutshedzi, Ndivhuwo; Obagbuwa, Ibidun Christiana; Ndogmo, Jean-Claude; Netshikweta, RendaniBrain tumor is a critical challenge in medical diagnostics, worsen by the high mortality rate and prevalence worldwide of the disease. Accurate and early detection is paramount to improving patient outcomes. This study focuses on evaluating the usefulness of machine learning (ML) and deep learning (DL) models in classifying brain tumor and non-tumor cases using a dataset sourced from Kaggle. After preprocessing, the dataset was analyzed using Support Vector Machines (SVM), VGG-19, and YOLOv10 models. Metrics including accuracy, precision, recall, F1-score, and ROC-AUC were utilized to evaluate the model's effectiveness. The findings reveal that hybrid models, particularly SVM+VGG-19, excel in tumor classifi cation, achieving an outstanding accuracy of 99.80% and a ROC-AUC of 98.01%. These models not only deliver superior accuracy but also require less training time compared to standalone models like SVM, VGG-19, or YOLOv10, employ explainable AI techniques such as LIME and SHAP to explain the models. By combining high precision with relatively low computational time, the SVM+VGG-19 hybrid model emerges as a robust way to deal with the MRI brain tumor segmentation problem, making it highly suitable for real-time image analysis.Item Embargo Application of explainable AI and uncertainity quantification in credit risk assessment(2026-05-19) Rambauli, Mulavhelesi; Ravele, T.; Sigauke, C.Credit risk modelling is essential for assessing the likelihood of borrower default and supporting informed lending decisions. Despite advances in predictive algorithms, challenges remain in ensuring model transparency, reliability, and robustness to uncertain inputs. This study investigates the integration of explainable AI (XAI) and uncertainty quantification (UQ) to enhance both interpretability and confidence in credit risk predictions. Three modelling approaches—Logistic Regression, Random Forest, and XGBoost—were evaluated using the Home Equity (HMEQ) dataset, with performance assessed on predictive accuracy, probability calibration, interpretability, and uncertainty handling. Ensemble methods achieved superior predictive performance, exceeding 98% accuracy and yielding near-perfect AUC scores above 0.999, whereas Logistic Regression exhibited substantially lower performance. Calibration analysis revealed a discrepancy between accuracy and probabilistic reliability: Random Forest, despite high accuracy, produced less well-calibrated predictions (ECE = 0.0475), while XGBoost achieved both strong predictive performance and reliable confidence estimates (ECE = 0.0117). Entropy-based uncertainty quantification identified instances where the model’s predictions carried high doubt, effectively highlighting challenging cases. SHAP and LIME consistently identified DELINQ, DEROG, and DEBTINC as primary drivers of default risk, aligning with established financial risk logic. By combining SHAP, LIME, and entropy-based UQ, this study proposes a unified framework that enhances interpretability, supports regulatory compliance, and increases trust in automated lending systems, emphasising the importance of reliable confidence alongside predictive accuracy.Item Embargo A Comparative Analysis of Machine Learning Models and Traditional Statistical Models for Continuous-Time Survival Analysis(2026-05-19) Tshisikule, Ompha; Mulaudzi, T. B.; Bere, A.Survival analysis is a statistical technique used to model time-to-event data, commonly applied in fields such as healthcare, engineering, and finance. Traditional approaches, including the Cox Proportional Hazards (CoxPH) model, have long been dominant due to their interpretability and theoretical foundation. However, recent advances in machine learning have shown promise in handling complex, high-dimensional datasets with nonlinear relationships. Despite this, there remains a gap in systematic comparative studies between traditional survival models and modern approaches such as regularized regression, ensemble methods, and deep learning architectures, particularly across diverse datasets with varying characteristics. This study conducts a comparative analysis of traditional, machine learning, and deep learning-based survival models, evaluating their predictive performance and computational efficiency for continuous-time survival data. The models considered include LASSO-regularized Cox regression, CoxPH, Random Survival Forest (RSF), and Long Short-Term Memory (LSTM) algorithms. Model performance was assessed using the concordance index (C-index), integrated Brier score (IBS), and Time-dependent Area Under the Curve (AUC) across three secondary datasets with different characteristics: a breast cancer dataset obtained from the SEER Program of the National Cancer Institute (2017 November update), the North Carolina Recidivism dataset (ICPSR 8987) obtained from ICPSR, and a heart failure clinical records dataset obtained from Kaggle. A rigorous statistical framework was employed, utilizing 100 iterations of stratified train-test splits to generate robust performance distributions. Distributional assumptions were systematically tested using Shapiro-Wilk and Levene’s tests to determine appropriate statistical tests, followed by omnibus tests (ANOVA, Welch’s ANOVA, or Kruskal-Wallis) and post-hoc pairwise comparisons with Bonferroni correction to control family-wise error rates. The analysis revealed that traditional survival models consistently outperformed deep learning-based approaches across all datasets. Random Survival Forest achieved the highest predictive accuracy, followed closely by CoxPH, with C-index values ranging from 0.66 to 0.73 and lower IBS scores indicating better calibration. In contrast, LSTM models performed poorly, often near random prediction levels (C-index 0.3–0.42), despite extensive optimization efforts including hyperparameter tuning, class balancing, and architectural modifications. Statistical testing confirmed that performance differences were highly significant across models and datasets (all p < 0.001), and post-hoc analyses demonstrated that RSF and CoxPH consistently outperformed LSTM for both discrimination and calibration metrics. These results suggest that traditional survival models remain the most reliable choice for moderate-sized datasets with censored observations and weak predictive signals, while LSTM networks are limited by dataset size, high censoring, and architectural mismatch with static survival data.Item Embargo Predicting price volatility crytocurrency ethereum(2025-09-05) Rambevha, Vhukhudo Ronny; Sigauke, Caston; Ravele, ThakhaniVolatility is essential when trading or investing in cryptocurrency Ethereum. Over the years, investors, traders and investment banks have found it difficult to predict the price volatility of Ethereumdue to its rapid price fluctuation. This report focuses on forecasting the price volatility of Ethereum for the next two days using daily historical observations of the price of Ethereumobtained from Coindesk and tweets extracted from Twitter ranging from the 1st of August 2022 to the 8th of August 2022. Two models are used to compute the forecast for the next two days: support vector regression and recurrent neural network. The main evaluationmetric used is the mean absolute error. In this study, according to MAE, RNN without tweets forecasts outperformthe SVR model without tweets forecasts, with the best model being the RNN without tweets producing an MAE of 0.0309.Item Open Access Short-term forecasting of global horizontal irradiance using stacked ensemble machine learning alogorithms(2025-09-05) Mugware, Fhulufhelo Walter; Ravele, T.; Sigauke, C.In today’s world, where sustainable energy is essential for the planet’s survival, accurate solar energy forecasting is crucial. This study focused on predicting short-term Global Horizontal Irradiance (GHI) using data from the Southern African Universities Radiometric Network (SAURAN) at the Univen Radiometric Station in South Africa. Various techniques were evaluated for their predictive accuracy, including Recurrent Neural Networks (RNN), Support Vector Regression (SVR), Gradient Boosting (GB), Random Forest (RF), Stacking Ensemble, and Double Nested Stacking (DNS). The results indicated that RNN performed the best in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) among the machine learning models. However, Stacking ensembles with XGBoost as the meta-model outperformed all individual models, improving accuracy by 67.06% in MAE and 22.28% in RMSE. DNS further enhanced accuracy, achieving a 93.05% reduction in MAE and an 88.54% reduction in RMSE compared to the best machine learning model, as well as a 78.89% decrease in MAE and an 85.27% decrease in RMSE compared to the best single stacking model. Furthermore, experimenting with the order of the DNS meta-model revealed that using RF as the first-level meta-model followed by XGBoost yielded the highest accuracy, showing a 47.39% decrease in MAE and a 61.35% decrease in RMSE compared to DNS with RF at both levels. These findings underscore the potential of advanced stacking techniques to significantly improve GHI forecasting.Item Embargo Comparative Analysis of Discrimination and Calibration Accuracy of Discrete Survival, Random Forests, and Neural Networks in Health-Related Survival Prediction Models(2025-09-05) Ramachela, Audrey Tshepho; Bere, Alphonce; Mulaudzi, Tshilidzi; Motsuku, LactatiaPrediction models for survival analysis are commonly used in biomedical sciences to understand the onset of certain diseases. Traditional statistical models have been employed for the previous years, however, their limitations and inability to handle big data sets has made a way for the introduction of machine learning methods which gained recognition due to their ability to learn complex algorithms. However, existing literature indicates that the predictive accuracy of machine learning and statistical models for survival analysis varies significantly across different data sets. This variability underscores the need for further research utilizing data sets with diverse characteristics. Such research is essential to develop generalizable insights into the conditions under which each method performs best. In this research project, we compared the predictive performance of traditional statistical method and machine learning algorithms in discrete survival analysis. The machine learning methods include discrete-time survival trees, discrete-time random survival forests, and discrete-time neural networks. The study uses calibration (measured by the prediction error curves) to assess model fit and discrimination (measured by the Concordance index and area under curve) to evaluate predictive accuracy. These methods were applied to data sets: Breast cancer, age at first alcohol intake and CRASH-2. The discrete-time neural network had the best prediction performance as compared to the rest of the models for survival of breast cancer. The discrete-time random forest with hellinger distance had the overall prediction performance on the age at first alcohol intake. The discrete-time survival model outperformed the rest of the models in predicting survival of bleeding trauma patients from the CRASH-2 data .Item Embargo A Multi-level Model for a Vector-Borne Organ to Tissue life Cycle Dynamics(2025-09-05) Mahada, Awelani Sydney; Netshikweta, R.; Garira, W.Introduction: Malaria is among the World’s most lethal infectious disease. It is caused by a parasitic pathogen transmitted by the Anopheles mosquito, which inoculates sporozoites into the human host during a blood meal. The population dynamics of malaria are well-known for their complexity, stemming not only from the parasite’s lifecycle, which involves two hosts (humans and mosquitoes)but also from the intricate replication and transmission cycles across different levels of the infectious disease system organization. Like other infectious disease systems, malaria infections inherit multilevel and multiscale systems, which pose significant challenges to efforts aimed at eliminating and ultimately eradicating the infection in a malaria-endemic population. Methodology Mathematical modeling in the study of complex system has proven to be an invaluable tool for understanding and predicting the behaviour and dynamics of a complex system within the domain of complexity science. Thus, in this study, we propose a multiscale modelling framework that captures the dynamics of malaria across three organizational levels within infectious disease systems implicated in the spread of malaria in a community. We begin by formulating a mathematical model to describe the development and progression of malaria parasites within the liver and tissue(blood) stages of an infected human host. This is followed by the formulation of a multiscale model that integrates both the inside(i.e.,the organ-tissue level)host and the outside (i.e., the host level) host malaria dynamics. Results Mathematical analysis for both the malaria models presented in this study was carried out and proved that all the models are mathematically and epidemiologically well-posed. We also compute the basic reproduction number R0 for both models and use the R0 to determine the local and global stability of the disease-free equilibriumas well as the local stability of endemic equilibrium of both models, respectively. We demonstrate that if R0 < 1, then the diseasefree equilibrium pointy of both models is locally and globally asymptotically stable, respevctively. However, if R0 > 1 the endemic equilibrium point of both models is locally asymptotically stable. The numerical results for both the models have demonstrated that the goal of intervention during malaria infection should be to reduce the rates at which merozoites and gametocytes invade healthy liver tissue as well as the blood cells. Hence it is recommended that interventions during malaria infection be directed on reducing the pace at which merozoites infect healthy blood cells and the density of merozoites in circulation. Conclusion The study presents a method that incoporates the complexity of malaria pathogens which is significant not only for malaria treatment but also for other vector-borne disease system control treatment strategies.Item Embargo Predictive modelling of student progression at the University of Venda using statistical and machine learning techniques(2025-09-05) Muthundinne, Phindulo Pretty; Bere, Alphonce; Mulaudzi, Tshilidzi B.One of the challenges facing higher education is the steadily rising number of university dropouts. Over the years, survival analysis has been used in order to address the issue of student’s dropout. In developed countries, machine learning methods have gained more attention on solving the problem of student’s dropout. The main motivation is the lack of application of both the discrete time statistical and discrete time machine learning methods when analysing student academic outcomes. This study built both the discrete time competing risk model and discrete time machine learning models for the time from registration until graduation or dropout for students at the University of Venda. These two approaches were compared(in terms of calibration and discrimination) to check which one works best. The proposed methodology implemented the application of statistical methods (discrete time survival model for single risk and competing risk) and the machine learning models(Classification trees for competing risk) using the R Statistical Software. For the competing risk models, we considered the time intervals 3 up to 6, since the possibility of graduation starts ate the third year. This study used comparison measures like Brier Score and C-Index to evaluate the models. Results show that the discrete cause-specific model and decision tree for competing risks showed a higher discrimination ability about the students progression. However, the decision tree model seemed to be the best model than the cause-specific model since the C-index is higher. While the results showed that male students are more likely to dropout and less likely to graduate, They also showed that female students are more likely to graduate. Students with an average mark of 70+ have 48.2% higher odds of graduating compared to those with an average below 50. Students in the faculty of Human and Social Sciences are less likely to dropout as compared to those in the faculty of Science, Engineering and Agriculture. However, HSS students do not significantly differ from FSEA students in graduation odds(SE = 0.073, OR=0.904, 95% CI(0.784; 1.042) and p-value= 0.165). The Faculty of Commerce, Management, and Law (FMCL) does not significantly differ from FSEA in either dropout(p-value=0.766) or graduation(p-value=0.072). This study found that older students are more likely to dropout than younger ones. This study suggests that using a decision tree model is more efficient than standard approaches for analyzing student dropout and academic results and recommends that it should therefore be used for analysing academic outcomes. Interventions for reducing dropout rates and shortening the time from first registration to graduation should target the identified high risk groups such as male and older students.Item Open Access Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis(2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intake.Item Open Access Probabilistic renewable energy modelling in South Africa(2024-05-05) Ravele, Thakhani; Sigauke, Caston; Jhamba, LodwellThe variability of solar power creates problems in planning and managing power system operations. It is critical to forecast accurately in order to maintain the safety and stability of large-scale integration of solar power into the grid. Accurate forecasting is vital because it prevents transmission obstruction and maintains a power equilibrium. This thesis uses robust models to solve this problem by addressing four main issues. The first issue involves the construction of quantile regression models for forecasting extreme peak electricity demand and determining the optimal number of units to commit at minimal costs for each period using the forecasts obtained from the developed models. The bounded variable mixed-integer linear programming (MILP) model solves the unit commitment (UC) problem. This is based on priority constraints where demand is first met from renewable energy sources followed by energy from fossil fuels. Secondly, the thesis discusses the modelling and prediction of extremely high quantiles of solar power. The methods used are a semi-parametric extremal mixture (SPEM), generalised additive extreme value (GAEV) or quantile regression via asymmetric Laplace distribution (QR-ALD), additive quantile regression with covariate t (AQR-1), additive quantile regression with temperature variable (AQR-2) and penalised cubic regression smoothing spline (benchmark) models. The predictions from this study are valuable to power utility decision-makers and system operators in knowing the maximum possible solar power which can be generated. This helps them make high-risk decisions and regulatory frameworks requiring high-security levels. As far as we know, this is the first application to conduct a comparative analysis of the proposed robust models using South African solar irradiance data. The interaction between global horizontal irradiance (GHI) and temperature helps determine the maximum amount of solar power generated. As temperature increases, GHI increases up to the point that it increases at a decreasing rate and then decreases. Therefore, system operators need to know the temperature range in which the maximum possible solar power can be generated. The study used the multivariate adaptive regression splines and extreme value theory to determine the maximum temperature to generate the maximum GHI ceteris paribus. Lastly, the study discusses extremal dependence modelling of GHI with temperature and relative humidity (RH) using the conditional multivariate extreme value (CMEV) and copula modes. Due to the nonlinearity and different structure of the dependence on GHI against temperature and RH, unlike previous literature, we use three Archimedean copula functions: Clayton, Frank and Gumbel, to model the dependence structure. This work was then extended by constructing a mixture copula model which combined the Frank and Gumbel models. One of the contributions of this thesis is the construction of additive quantile regression models for forecasting extreme quantiles of electrical load, which are then used in solving the UC problem with bounded MILP with priority constraints. The other contribution is developing a modelling framework that shows that GHI converges to its upper limit if temperature converges to the upper bound. Another contribution is constructing a mixture of some copulas for modelling the extremal dependence of GHI with temperature and RH. This thesis reveals the following key findings: (i) the additive quantile regression model is the best-fitting model for hours 18:00 and 19:00. In contrast, the linear quantile regression model is the best-fitting model for hours 20:00 and 21:00. The UC problem results show that using all the generating units, such as hydroelectric, wind power, concentrated solar power and solar photovoltaic is less costly. (ii) the AQR-2 was the best-fitting model and gave the most accurate prediction of quantiles at τ = 0.95, 0.97, 0.99 and 0.999, while at 0.9999- quantile, the GAEV model had the most accurate predictions. (iii) the marginal increases of GHI converge to 0.12 W/m2 when temperature converges to 44.26 ◦C and the marginal increases of GHI converge to −0.1 W/m2 when RH converges to 103.26%. Conditioning on GHI, the study found that temperature and RH variables have a negative extremal dependence on large values of GHI. (iv) the dependence structure between GHI and variable temperature and RH is asymmetric. Furthermore, the Frank copula is the best-fitting model for variable temperature and RH, implying the presence of extreme co-movements. The modelling framework discussed in this thesis could be useful to decisioniii makers in power utilities, who must optimally integrate highly intermittent renewable energies on the grid. It could be helpful to system operators that face uncertainty in GHI power production due to extreme temperatures and RH, including maintaining the minimum cost by scheduling and dispatching electricity during peak hours when the grid is constrained due to peak load demand.Item Open Access Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis(2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intakeItem Open Access Multiscale Modelling of Foodborne Diseases(2024-09-06) Maphiri, Azwindini Delinah; Muzhinyi, K.; Garira, W.; Mathebula, D.Infectious disease systems are essentially multiscale complex system wherein pathogens multiply within hosts, spread across people, and infect entire populations of hosts. The description of most biological processes involves multiple, interconnected phenomena occurring on different spatial and temporal scales in the human body. Traditional approaches for modelling infectious disease systems rely on the principles and concepts of the transmission mechanism theory that considers transmission to be the primary cause of infectious disease spread at the macroscale. Modellers of infectious diseases are increasingly using multiscale modelling approach in response to this challenge. Multiscale models of infectious disease systems encompass intricate structures that revolve around the interplay of three distinct sub-systems: the host, the pathogen, and the environmental subsystems. The replication-transmission relativity theory is a novel theory designed for the purpose of multiscale modeling of infectious disease systems, accounting for variations in time and space by incorporating pathogen replication that leads to transmission. Replicationtransmission relativity theory consists of seven distinct levels of organization within an infectious disease system, each level including the within-host scale (microscale) and between-host scale (macroscale). Five separate classifications of multiscale models can be formulated that integrate the microscale and macroscale. A research gap has been created in an attempt to establish a multiscale framework in order to understand the mechanisms on how foodborne pathogens cause infections on human beings and animals, as very little has been done in modelling of foodborne disease. The primary goal of this study is to create multiscale models for foodborne diseases to examine whether a mutual influence exists between the microscale and macroscale, guided by the principles of replication-relativity theory. The multiscale models are developed by considering three environmental transmitted diseases at host level caused by pathogens: norovirus, E. coli O157:H7 and taenia solium. We start by developing a single-scale model of foodborne diseases caused by viruses in general, which is then extended to create a multiscale model for norovirus. We formulate a non-standard finite difference scheme for the single-scale model, norovirus, and E. coli O157:H7. For taenia solium, we use ODE solvers in Python, specifically, ODE int function in the sci.integrate. The numerical findings from the study confirm the applicability of the replication-transmission relativity theory in cases where the reciprocal impact between the within-host scale and the between-host scale involves both infection/super-infection (for the effect of the between-host scale on the within-host scale) and pathogen excretion/shedding (for the effect of the within-host scale on the between-host scale). We expect that our study will help modellers integrate microscale and macroscale dynamics across various levels of organization within infectious disease systems.Item Open Access Long term peak electricity demand forecastion in South Africa using quantile regression(2024-09-06) Maswanganyi, Norman; Sigauke, Caston; Ranganai EdmoreIt is widely accepted that South Africa needs to maximise sustainable electricity supply growth to meet the new and growing demand for higher economic growth rates, especially in energy-intensive sectors. To diversify the energy mix, the country also needs to take urgent actions to ensure the sustainability of renewable energy and energy e ciency by 2030. Hence, it is important to provide a modelling framework for forecasting long-term peak electricity demand and quantifying uncertainty of future electricity demand for better electricity security management. In order to estimate and capture changes in long-term peak electricity demand, the study employed quantile regression (QR) based models, including hybrid models for assessing and managing electricity demand using South African data. The changes in long-term electricity demand depend on network location areas and the uncertainties within the energy sectors. Long-term peak electricity demand forecasting using QR models seems scarce in South Africa. The current study closes a gap by developing a modelling framework that can be used for future electricity demand forecasting. Although many studies have been done on short-, medium and long-term peak electricity demand forecasting, an investigation of the extremal quantile regression (EQR) model for forecasting electricity demand (based on combined economic and weather conditions) still needs to be explored as far as we know. Accurately predicting extreme electricity demand distributions would signi cantly mitigate load shedding and overloading and allow energy-e cient storage. This thesis identi es weather-related and non-weather-related factors using the EQR approach to modelling and estimating the error of extremely low and high quantiles of peak electricity demand. Results from the thesis show that EQR provides a higher level of detail and can model the non-central behaviour of electricity demand than the other models used in the study. The study has shown how the additive quantile regression (AQR) model can provide the highest predictive ability and create superior accuracy of the forecast results. Power systems reliability requires a probabilistic characterisation of extreme peak loads, which results in severe system stress and causes grid problems. Accurate predictions of long-term electricity demand are very important as such forecasts can be used in the timing and rate of occurrence of such extreme peak loads. The study used hybrid additive quantile regression coupled with autoregressive models and variable selection using Lasso for hierarchical interactions to examine the power system's reliability in random extreme peak loads.