Theses and Dissertations

Browse

Recent Submissions

Now showing 1 - 20 of 70
  • ItemEmbargo
    Comparative Analysis of Discrimination and Calibration Accuracy of Discrete Survival, Random Forests, and Neural Networks in Health-Related Survival Prediction Models
    (2025-09-05) Ramachela, Audrey Tshepho; Bere, Alphonce; Mulaudzi, Tshilidzi; Motsuku, Lactatia
    Prediction models for survival analysis are commonly used in biomedical sciences to understand the onset of certain diseases. Traditional statistical models have been employed for the previous years, however, their limitations and inability to handle big data sets has made a way for the introduction of machine learning methods which gained recognition due to their ability to learn complex algorithms. However, existing literature indicates that the predictive accuracy of machine learning and statistical models for survival analysis varies significantly across different data sets. This variability underscores the need for further research utilizing data sets with diverse characteristics. Such research is essential to develop generalizable insights into the conditions under which each method performs best. In this research project, we compared the predictive performance of traditional statistical method and machine learning algorithms in discrete survival analysis. The machine learning methods include discrete-time survival trees, discrete-time random survival forests, and discrete-time neural networks. The study uses calibration (measured by the prediction error curves) to assess model fit and discrimination (measured by the Concordance index and area under curve) to evaluate predictive accuracy. These methods were applied to data sets: Breast cancer, age at first alcohol intake and CRASH-2. The discrete-time neural network had the best prediction performance as compared to the rest of the models for survival of breast cancer. The discrete-time random forest with hellinger distance had the overall prediction performance on the age at first alcohol intake. The discrete-time survival model outperformed the rest of the models in predicting survival of bleeding trauma patients from the CRASH-2 data .
  • ItemEmbargo
    A Multi-level Model for a Vector-Borne Organ to Tissue life Cycle Dynamics
    (2025-09-05) Mahada, Awelani Sydney; Netshikweta, R.; Garira, W.
    Introduction: Malaria is among the World’s most lethal infectious disease. It is caused by a parasitic pathogen transmitted by the Anopheles mosquito, which inoculates sporozoites into the human host during a blood meal. The population dynamics of malaria are well-known for their complexity, stemming not only from the parasite’s lifecycle, which involves two hosts (humans and mosquitoes)but also from the intricate replication and transmission cycles across different levels of the infectious disease system organization. Like other infectious disease systems, malaria infections inherit multilevel and multiscale systems, which pose significant challenges to efforts aimed at eliminating and ultimately eradicating the infection in a malaria-endemic population. Methodology Mathematical modeling in the study of complex system has proven to be an invaluable tool for understanding and predicting the behaviour and dynamics of a complex system within the domain of complexity science. Thus, in this study, we propose a multiscale modelling framework that captures the dynamics of malaria across three organizational levels within infectious disease systems implicated in the spread of malaria in a community. We begin by formulating a mathematical model to describe the development and progression of malaria parasites within the liver and tissue(blood) stages of an infected human host. This is followed by the formulation of a multiscale model that integrates both the inside(i.e.,the organ-tissue level)host and the outside (i.e., the host level) host malaria dynamics. Results Mathematical analysis for both the malaria models presented in this study was carried out and proved that all the models are mathematically and epidemiologically well-posed. We also compute the basic reproduction number R0 for both models and use the R0 to determine the local and global stability of the disease-free equilibriumas well as the local stability of endemic equilibrium of both models, respectively. We demonstrate that if R0 < 1, then the diseasefree equilibrium pointy of both models is locally and globally asymptotically stable, respevctively. However, if R0 > 1 the endemic equilibrium point of both models is locally asymptotically stable. The numerical results for both the models have demonstrated that the goal of intervention during malaria infection should be to reduce the rates at which merozoites and gametocytes invade healthy liver tissue as well as the blood cells. Hence it is recommended that interventions during malaria infection be directed on reducing the pace at which merozoites infect healthy blood cells and the density of merozoites in circulation. Conclusion The study presents a method that incoporates the complexity of malaria pathogens which is significant not only for malaria treatment but also for other vector-borne disease system control treatment strategies.
  • ItemEmbargo
    Predictive modelling of student progression at the University of Venda using statistical and machine learning techniques
    (2025-09-05) Muthundinne, Phindulo Pretty; Bere, Alphonce; Mulaudzi, Tshilidzi B.
    One of the challenges facing higher education is the steadily rising number of university dropouts. Over the years, survival analysis has been used in order to address the issue of student’s dropout. In developed countries, machine learning methods have gained more attention on solving the problem of student’s dropout. The main motivation is the lack of application of both the discrete time statistical and discrete time machine learning methods when analysing student academic outcomes. This study built both the discrete time competing risk model and discrete time machine learning models for the time from registration until graduation or dropout for students at the University of Venda. These two approaches were compared(in terms of calibration and discrimination) to check which one works best. The proposed methodology implemented the application of statistical methods (discrete time survival model for single risk and competing risk) and the machine learning models(Classification trees for competing risk) using the R Statistical Software. For the competing risk models, we considered the time intervals 3 up to 6, since the possibility of graduation starts ate the third year. This study used comparison measures like Brier Score and C-Index to evaluate the models. Results show that the discrete cause-specific model and decision tree for competing risks showed a higher discrimination ability about the students progression. However, the decision tree model seemed to be the best model than the cause-specific model since the C-index is higher. While the results showed that male students are more likely to dropout and less likely to graduate, They also showed that female students are more likely to graduate. Students with an average mark of 70+ have 48.2% higher odds of graduating compared to those with an average below 50. Students in the faculty of Human and Social Sciences are less likely to dropout as compared to those in the faculty of Science, Engineering and Agriculture. However, HSS students do not significantly differ from FSEA students in graduation odds(SE = 0.073, OR=0.904, 95% CI(0.784; 1.042) and p-value= 0.165). The Faculty of Commerce, Management, and Law (FMCL) does not significantly differ from FSEA in either dropout(p-value=0.766) or graduation(p-value=0.072). This study found that older students are more likely to dropout than younger ones. This study suggests that using a decision tree model is more efficient than standard approaches for analyzing student dropout and academic results and recommends that it should therefore be used for analysing academic outcomes. Interventions for reducing dropout rates and shortening the time from first registration to graduation should target the identified high risk groups such as male and older students.
  • ItemOpen Access
    Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis
    (2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.
    While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intake.
  • ItemOpen Access
    Probabilistic renewable energy modelling in South Africa
    (2024-05-05) Ravele, Thakhani; Sigauke, Caston; Jhamba, Lodwell
    The variability of solar power creates problems in planning and managing power system operations. It is critical to forecast accurately in order to maintain the safety and stability of large-scale integration of solar power into the grid. Accurate forecasting is vital because it prevents transmission obstruction and maintains a power equilibrium. This thesis uses robust models to solve this problem by addressing four main issues. The first issue involves the construction of quantile regression models for forecasting extreme peak electricity demand and determining the optimal number of units to commit at minimal costs for each period using the forecasts obtained from the developed models. The bounded variable mixed-integer linear programming (MILP) model solves the unit commitment (UC) problem. This is based on priority constraints where demand is first met from renewable energy sources followed by energy from fossil fuels. Secondly, the thesis discusses the modelling and prediction of extremely high quantiles of solar power. The methods used are a semi-parametric extremal mixture (SPEM), generalised additive extreme value (GAEV) or quantile regression via asymmetric Laplace distribution (QR-ALD), additive quantile regression with covariate t (AQR-1), additive quantile regression with temperature variable (AQR-2) and penalised cubic regression smoothing spline (benchmark) models. The predictions from this study are valuable to power utility decision-makers and system operators in knowing the maximum possible solar power which can be generated. This helps them make high-risk decisions and regulatory frameworks requiring high-security levels. As far as we know, this is the first application to conduct a comparative analysis of the proposed robust models using South African solar irradiance data. The interaction between global horizontal irradiance (GHI) and temperature helps determine the maximum amount of solar power generated. As temperature increases, GHI increases up to the point that it increases at a decreasing rate and then decreases. Therefore, system operators need to know the temperature range in which the maximum possible solar power can be generated. The study used the multivariate adaptive regression splines and extreme value theory to determine the maximum temperature to generate the maximum GHI ceteris paribus. Lastly, the study discusses extremal dependence modelling of GHI with temperature and relative humidity (RH) using the conditional multivariate extreme value (CMEV) and copula modes. Due to the nonlinearity and different structure of the dependence on GHI against temperature and RH, unlike previous literature, we use three Archimedean copula functions: Clayton, Frank and Gumbel, to model the dependence structure. This work was then extended by constructing a mixture copula model which combined the Frank and Gumbel models. One of the contributions of this thesis is the construction of additive quantile regression models for forecasting extreme quantiles of electrical load, which are then used in solving the UC problem with bounded MILP with priority constraints. The other contribution is developing a modelling framework that shows that GHI converges to its upper limit if temperature converges to the upper bound. Another contribution is constructing a mixture of some copulas for modelling the extremal dependence of GHI with temperature and RH. This thesis reveals the following key findings: (i) the additive quantile regression model is the best-fitting model for hours 18:00 and 19:00. In contrast, the linear quantile regression model is the best-fitting model for hours 20:00 and 21:00. The UC problem results show that using all the generating units, such as hydroelectric, wind power, concentrated solar power and solar photovoltaic is less costly. (ii) the AQR-2 was the best-fitting model and gave the most accurate prediction of quantiles at τ = 0.95, 0.97, 0.99 and 0.999, while at 0.9999- quantile, the GAEV model had the most accurate predictions. (iii) the marginal increases of GHI converge to 0.12 W/m2 when temperature converges to 44.26 ◦C and the marginal increases of GHI converge to −0.1 W/m2 when RH converges to 103.26%. Conditioning on GHI, the study found that temperature and RH variables have a negative extremal dependence on large values of GHI. (iv) the dependence structure between GHI and variable temperature and RH is asymmetric. Furthermore, the Frank copula is the best-fitting model for variable temperature and RH, implying the presence of extreme co-movements. The modelling framework discussed in this thesis could be useful to decisioniii makers in power utilities, who must optimally integrate highly intermittent renewable energies on the grid. It could be helpful to system operators that face uncertainty in GHI power production due to extreme temperatures and RH, including maintaining the minimum cost by scheduling and dispatching electricity during peak hours when the grid is constrained due to peak load demand.
  • ItemOpen Access
    Comparison of Some Statistical and Machine Learning Models for Continuous Survival Analysis
    (2024-09-06) Ndou, Sedzani Emanuel; Mulaudzi, T. B.; Bere, A.
    While statistical models have been traditionally utilized, there is a growing interest in exploring the potential of machine learning techniques. Existing literature shows varying results on their performance which is based on the dateset employed. This study will conduct a comparative evaluation of the predictive accuracy of both statistical and machine learning models for continuous survival analysis utilizing two distinct datasets: time to first alcohol intake and North Carolina recidivism data. LassoCV was used to select variables for both datasets by encouraging limited coefficient estimates. Kaplan-Meier survival curves were utilized to compare the survival distributions among groups of variables incorporated in the model, alongside the logrank test. The proposed methods include the Cox Proportional Hazards, Lasso-regularized Cox, Survival Trees, Random Survival Forest, and Neural Networks. Model performance was evaluated using Integrated Brier score (IBS), Area Under the Curve and Concordance index. Our findings shows consistent dominance of Neural Network (NN) and Random Survival Forest (RSF) models across multiple metrics for both datasets. Specifically, Neural Network demonstrates remarkable performance, closely followed by RSF, CoxPH and CoxLasso models with slightly lower performance, and Survival Tree (ST) consistently lags behind. This study can contribute to advancing knowledge and provides practical guidance for improving survival in recidivism and alcohol intake
  • ItemOpen Access
    Multiscale Modelling of Foodborne Diseases
    (2024-09-06) Maphiri, Azwindini Delinah; Muzhinyi, K.; Garira, W.; Mathebula, D.
    Infectious disease systems are essentially multiscale complex system wherein pathogens multiply within hosts, spread across people, and infect entire populations of hosts. The description of most biological processes involves multiple, interconnected phenomena occurring on different spatial and temporal scales in the human body. Traditional approaches for modelling infectious disease systems rely on the principles and concepts of the transmission mechanism theory that considers transmission to be the primary cause of infectious disease spread at the macroscale. Modellers of infectious diseases are increasingly using multiscale modelling approach in response to this challenge. Multiscale models of infectious disease systems encompass intricate structures that revolve around the interplay of three distinct sub-systems: the host, the pathogen, and the environmental subsystems. The replication-transmission relativity theory is a novel theory designed for the purpose of multiscale modeling of infectious disease systems, accounting for variations in time and space by incorporating pathogen replication that leads to transmission. Replicationtransmission relativity theory consists of seven distinct levels of organization within an infectious disease system, each level including the within-host scale (microscale) and between-host scale (macroscale). Five separate classifications of multiscale models can be formulated that integrate the microscale and macroscale. A research gap has been created in an attempt to establish a multiscale framework in order to understand the mechanisms on how foodborne pathogens cause infections on human beings and animals, as very little has been done in modelling of foodborne disease. The primary goal of this study is to create multiscale models for foodborne diseases to examine whether a mutual influence exists between the microscale and macroscale, guided by the principles of replication-relativity theory. The multiscale models are developed by considering three environmental transmitted diseases at host level caused by pathogens: norovirus, E. coli O157:H7 and taenia solium. We start by developing a single-scale model of foodborne diseases caused by viruses in general, which is then extended to create a multiscale model for norovirus. We formulate a non-standard finite difference scheme for the single-scale model, norovirus, and E. coli O157:H7. For taenia solium, we use ODE solvers in Python, specifically, ODE int function in the sci.integrate. The numerical findings from the study confirm the applicability of the replication-transmission relativity theory in cases where the reciprocal impact between the within-host scale and the between-host scale involves both infection/super-infection (for the effect of the between-host scale on the within-host scale) and pathogen excretion/shedding (for the effect of the within-host scale on the between-host scale). We expect that our study will help modellers integrate microscale and macroscale dynamics across various levels of organization within infectious disease systems.
  • ItemOpen Access
    Long term peak electricity demand forecastion in South Africa using quantile regression
    (2024-09-06) Maswanganyi, Norman; Sigauke, Caston; Ranganai Edmore
    It is widely accepted that South Africa needs to maximise sustainable electricity supply growth to meet the new and growing demand for higher economic growth rates, especially in energy-intensive sectors. To diversify the energy mix, the country also needs to take urgent actions to ensure the sustainability of renewable energy and energy e ciency by 2030. Hence, it is important to provide a modelling framework for forecasting long-term peak electricity demand and quantifying uncertainty of future electricity demand for better electricity security management. In order to estimate and capture changes in long-term peak electricity demand, the study employed quantile regression (QR) based models, including hybrid models for assessing and managing electricity demand using South African data. The changes in long-term electricity demand depend on network location areas and the uncertainties within the energy sectors. Long-term peak electricity demand forecasting using QR models seems scarce in South Africa. The current study closes a gap by developing a modelling framework that can be used for future electricity demand forecasting. Although many studies have been done on short-, medium and long-term peak electricity demand forecasting, an investigation of the extremal quantile regression (EQR) model for forecasting electricity demand (based on combined economic and weather conditions) still needs to be explored as far as we know. Accurately predicting extreme electricity demand distributions would signi cantly mitigate load shedding and overloading and allow energy-e cient storage. This thesis identi es weather-related and non-weather-related factors using the EQR approach to modelling and estimating the error of extremely low and high quantiles of peak electricity demand. Results from the thesis show that EQR provides a higher level of detail and can model the non-central behaviour of electricity demand than the other models used in the study. The study has shown how the additive quantile regression (AQR) model can provide the highest predictive ability and create superior accuracy of the forecast results. Power systems reliability requires a probabilistic characterisation of extreme peak loads, which results in severe system stress and causes grid problems. Accurate predictions of long-term electricity demand are very important as such forecasts can be used in the timing and rate of occurrence of such extreme peak loads. The study used hybrid additive quantile regression coupled with autoregressive models and variable selection using Lasso for hierarchical interactions to examine the power system's reliability in random extreme peak loads.
  • ItemOpen Access
    Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools
    (2024-09-06) Mudau, Tshilisanani; Garira, Winston; Netshikweta, Rendani
    Background: De-identification is a technique that eliminates identifying information from Clinical Records in order to protect individual privacy. This procedure decreases the chance of personal information being collected, processed, distributed, and published from being used to identify the person. When Machine Learning techniques were included in the de-identification process, it substantially improved over the previous method. Research Problem: The Electronic Discharge Summary(EDS) has evolved into a significantly improved technique of providing discharge summaries though this information contains Protected Health Information (PHI), which poses a risk to patients’ privacy. This makes the process of de-identification to be mandatory. There have lately been several Machine Learning approaches to de-identify data. This study focuses on applying Machine Learning techniques to figure out which model can best de-identify a data set. Methods: The open source data set from Harvard Medical School was used. This data set contains 899 Electronic Health Records (EHR), 669 for training and 220 for test purpose. The Conditional Random Fields (CRF), Long Short Term Memory (LSTM) and Random Forest models were used, and the performance of each model was assessed. Findings: In order to assess each model’s performance, evaluation metrics were used to compare F-measure, Recall and Precision at token level to determine which Machine Learning model performed best. The Long Short Term Memory was found to outperform both Conditional Random Fields and Random Forest with micro average F-measure, Recall and precision of 99%, and macro average F-measure of 77%, Recall of 73% and Precision of 90%.
  • ItemOpen Access
    Performance Evaluation of Deep Learning Models on Brain Tumor MRI Classification and Explainability
    (2024-09-06) Nhlapo, Wandile Juddy; Ndogmo, Jean-Claude; Atemkeng, Marcellin
    Deep learning models often act as black boxes, making it difficult to understand their decision-making process. To understand how these models make decisions, this paper proposes a framework involving two phases. The first phase evaluates the performance of ten deep transfer learning models—ViT Transformer, EfficientNetB0, DenseNet121, Xception, GoogleNet, Inception V3, VGG16, VGG19, ResNet50, and AlexNet—for classifying brain tumors using MRI scans. The models are assessed based on metrics like accuracy, F1 score, recall, and precision, with EfficientNetB0 outperforming the other models with 98% accuracy and a balanced precision and recall, resulting in an F1 score of 98%. In the second phase, we use interpretability techniques such as Grad-CAM and Grad-CAM++, Integrated Gradient, and Saliency Mapping to investigate what these models learn within MRI images to make classification decisions. The results show that both Grad-Cam and Grad- Cam++ effectively identify the exact locations of tumor localization in the MRI images. This result enhances our understanding of the specific locations within the images where transfer learning models extract features to make classification decisions
  • ItemOpen Access
    Comparative analysis of Machine Learning Algorithms for Estimating Global Solar Radiation at Selected Weather Stations in Vhembe District Municipality
    (2023-10-05) Marandela, Mulalo Veronica; Mulaudzi, T. S.; Maluta, N. E.
    hstimating anct assessing the energy talling in a particular area 1s essential tor installers ot renewable technologies. Different equations have been applied as the most reliable empir­ ical for estimating global solar radiation(GSR) in different climatic conditions. The main objective of this work is to estimate the global solar radiation of two stations namely, Mu­ tale and Messina found in Vhembe District, Limpopo Province, South Africa. Four different methods (Random forest(RF) regression, K-nearest neighour (K-NN), Support Vector Ma­ chines(SVM) and Extreme Gradient Boosting mechanism(XGBoost)) is used to estimate the GRS in this study. The RF model on Mutale station was found to be the best fitting model with R² = 0.9902, MSE = 0.4085 and RMSE = 0.6391, followed by XGB with R² = 0.9898, MSE = 0.4245 and RMSE = 0.6515. RF was also found to be the best for Messina station with R² = 0.9636, MSE = 0.1.4138 and RMSE = 1.1890, followed by XGB model with R² = 0.9595, MSE = 1.5723 and RMSE = 1.2539. From the results, it can be concluded that RF is a better model for estimating GSR for different stations.
  • ItemOpen Access
    Credit Card Fraud Detection using Boosted Random Forest Algorithm
    (2023-10-05) Mashamba, Thanganedzo Beverly; Chagwiza, W.; Garira,W.
    Financial fraud is a growing concern with far-reaching concerns in financial institutions, government, and corporate organizations, leading to substantial monetary losses. The primary cause of financial loss is credit card fraud; it affects issuers and clients, which is a significant threat to the business as clients will run to their competitors, wherein they will feel secure. Solving fraud problems is beyond human capability, so financial institutions can utilize machine learning algorithms to detect fraudulent behaviour by learning through credit card transactions. This thesis develops the boosted random forest, integrating an adaptive boosting algorithm into a random forest algorithm, such that the performance of a model is improved in predicting credit card fraudulent transactions. The confusion matrix is used to evaluate the performance of the models, wherein random forest, adaptive boosting and boosted random forest were compared. The results indicated that the boosted random forest outperformed the individual models with an accuracy of 99.9%, which corresponded with the results from confusion matrix. However random forest and adaptive boosting had 100% and 99% respectively, which did not correspond to the results on confusion matrix, meaning the individual models need to be more accurate. Thus, by implementing the proposed approach to a credit card management system, financial loss will be reduced to a greater extent.
  • ItemOpen Access
    Solar power forecasting using Gaussian process regression
    (2023-10-05) Chandiwana, Edina; Sigauke, Caston; Bere, Alphonce
    Solar power forecasting has become an important aspect affecting crucial day-to-day activities in people's lives. Many African countries are now facing blackouts due to a shortage of energy. This has caused the urge to encourage people to use other energy sources to rise, resulting in different energy inputs into the main electricity grid. When the number of power sources being fed into the main grid increases, so does the need for efficient methods of forecasting these inputs. Thus, there is a need to come up with efficient prediction techniques inorder to facilitate proper grid management. The main goal of this thesis is to explore how Gaussian process predicting frameworks can be developed and used to predict global horiz0ontal irra- diance. Data on Global horizontal irrandiance and some weather variables collected from various meterological stations were made available through SAURAN (Southern African Universities Radiometric Network). The length of the dataset ranged from 496 to 17325 datapoints. Ve proposed using Gaussian process regression (GPR) to predict solar power generation. In South Africa, studies based on GPR regarding forecasting solar power are still very few, and more needs to be done in this area. At first, we explored covariance function selection, and a GPR was developed using Core vector regression (CVR). The predictions produced through this method were more accurate than the benchmark models used: Gradient Boosting Regression (GBR) and Support Vector Regression then, we explored interval estimation, Quantile re- gression and GPR were coupled in order to develop the modelling framework. This was also done to improve the accuracy of the GPR models. The results proved that the model performed better than the Bayesian Structural Time Series Regression. Ve also explored spatial dependence; spatio-temporal regression was incorporated into the modelling framework coupled with GPR. This was done to incorporate various weather stations' conditions into the modelling process. The spatial analysis results proved that GPR coupled with spatial analysis produced results that were superior to the Autoregressive Spatial analysis and benchmark model used: Linear Spatial analysis. The GPR results had accuracy measures that proved superior to the benchmark models. Various other tools were used to improve the accuracy of i the GPR results. This includes the use of combining forecasts and standardisation of predictions. The superior results indicate a vast benefit economic-wise because it allows those who manage the power grid to do so effectively and efficiently. Effective power grid management reduces power blackouts, thus benefitting the nation eco- nomically and socially.
  • ItemOpen Access
    Commodity Futures Market Prices: Decomposition Approach
    (2023-10-05) Antwi, Emmanuel
    Financial investments on commodity markets have attracted many investigations due to its importance to the global economy, and worldwide trade as a whole. The radical price changes in commodity market prices, especially agricultural, energy and industrial metal products have significant consequences on consumers and producers of economic activities. It is very crucial to accurately estimate and predict volatility in commodity futures market prices, since continuous price fluctuations have dire consequences for investors, portfolio managers, dealers and policymakers in taking prudent and sustainable decisions. Commodity price component determination and forecasting are challenging due to remarkable price volatility, uncertainty, and complexity in the futures market. As a result, commodity futures price series is nonlinear and nonstationary. Various studies are reported in the literature, in an attempt to develop models to study the persistent changes in the commodity futures price series, but these models have failed to account for the inherent complexity in the commodity futures price series. This study aims to use decomposition techniques, combined with back-propagation neural network (BPNN) and autoregressive integrated moving average (ARIMA) models to address difficulties in studying commodity futures market prices. As said earlier, this study utilized the decomposition methods, Empirical Mode Decomposition (EMD) and Variational Mode Decomposition (VMD), to analyze the daily real price series of three commodity futures market prices of: corn from agricultural products, crude oil from energy, and gold from industrial metal, using the data from 4th May 2016 to 30th April 2021. In the first part of the study, we explored the descriptive and statistical properties of the data. It was found that the three commodities market futures prices series were nonstationary and nonlinear. Subsequently, we performed an EMD-Granger causality test to establish the spillover effects among the three commodities’ markets. It was revealed that there exists a strong mutual relationship among the three commodity markets price series, which implies that the price movement of one market can be used to explain the price fluctuations of the other markets. In the second part, the EMD and VMD methods were applied to decompose the daily data of each commodity price from different periods and frequencies to their respective individual intrinsic mode functions. First, we used the Hierarchical Clustering Method and Euclidean Distance Approach to classify the IMFs, residue, and modes into high-frequency, low-frequency, and trend. Next, applying statistical measures, particularly, the Pearson product-moment correlation coefficient, Kendall rank correlation, and Spearman rank correlation coefficient, we observed that the trend and low-frequency parts of the market prices are the main drivers of commodity futures markets prices’ fluctuations and that special events caused the low frequency. In essence, commodity futures prices are affected by economic development rather than short-lived market variations caused by ordinary supply-demand disequilibrium. The third part compared the EMD and the VMD- based models using three forecasting performance evaluation criteria and statistical measures, such as, mean absolute error (MAE), root mean square error (RMSE), and mean percentage error (MAPE) to compare the capabilities of the suggested models. We also introduced Diebold Mariano (DM) test in selecting the optimal models for each commodity, since MAE, RMSE and MAPE have some shortcomings. The combined models outperformed the individual back propagation neural network (BPNN) and autoregressive integrated moving average (ARIMA) models in forecasting the series of corn and crude oil’s futures prices. At the same time, BPNN emerged as the optimal model for predicting gold futures prices’ series. In addition, variational mode decomposition emerged as the ideal data pre-treatment method and contributed to enhancing the predicting ability of the BPNN and the ARIMA models. The empirical results showed that models combined with decomposition methods predict commodity futures prices accurately and can easily capture the volatility in commodity futures prices. By utilizing the decomposition-based models in studying commodity market prices, the study filled the following gap in the existing literature as follows: the pre-treatment effect of the EMD and VMD can be compared horizontally, in decomposing commodity market price series and studying the underlying components that cause the above mentioned commodity markets price fluctuations is a novel approach in studying commodity market prices. In addition, utilizing Hierarchical Clustering and Euclidean Distance Approaches, the IMFs, residue and modes were classified into their distinctive frequencies, namely, high-frequency, low-frequency, and trend units. The effect of these frequencies and trends on commodity market price fluctuation is the first of its kind in the literature. Furthermore, applying statistical measures such as Pearson product-moment correlation coefficient, Kendall rank correlation, and Spearman rank correlation coefficient to evaluate the contribution of the IMFs, residue, and modes to the net variance of the volatility of crude oil, corn, and gold markets price fluctuations, is an innovative approach to studying financial times series. The EMD-Causality technique proposed to study the causal relationship of corn, crude oil, and gold futures prices movement, is novel in the financial market. This new approach to study price movement of commodity markets, will provide a vital information about one commodity market to explain the other commodity market price fluctuations in various markets. Also, Decomposition of financial data before forecasting have high forecasting precision accuracy in commodity futures price prediction. Additionally, using decomposition techniques in agriculture, energy, and industrial metal commodities futures markets, effectively, minimizes the prediction complexity. Furthermore, using econometric and machine learner models incorporated with decomposition methods can capture the price series information up to acceptable degrees. Finally, decomposition-based predicting techniques can effectively raise the predicting performance capability of BPNN and ARIMA models and reduce errors, thus, the proposed novel combination method can statistically improve forecast accuracy. This study, therefore, may assist in arresting the agricultural, energy, and industrial commodities markets trends and estimate volatility risk factors accurately, consequently serving as a guide for investors, governments policymakers and related sectors such as agriculture, energy, and metal industry to take prudent and sustainable planning and investment decisions. The suggested decomposition strategy, particularly VMD-based is robust in analyzing the determinants, modeling, and forecasting commodity futures market prices fluctuations, thereby, improving forecasting precision accuracy. Remarkably, in using the decomposition approach in estimating compositions of commodity prices data series separately, different predicting strategies can be explored. For instance, based on the features of decomposed IMFs or modes, a suitable predicting technique can be considered to forecast each IMF or mode; for example, the residue can be estimated by utilizing a polynomial function, while Fourier transform can be considered in predicting low-frequency IMFs or modes, hence, it is recommended that researchers, institutions, investors, and policymakers interested in studying commodity price movements should consider using this novel technique to achieve better results. It is further suggested that the decomposition approach could be utilized in other fields of study to prove the approach’s generality. Finally, further study can extend the proposed methodology by considering other decompositions techniques rather than just EMD and VMD and evaluate their robustness in studying financial markets, as EMD approach has the problem of mode mixing and endpoint effects. Eventually, we propose that a new model or consolidated predicting technique should be investigated to cater for special events’ influences on commodity market prices since no one can predict the time and the place they will occur.
  • ItemOpen Access
    Exploring the Multi-scale character of infectious disease dynamics
    (2023-05-19) Mufoya, Blessings; Garira, W.; Mathebula, D.
    This research study characterised multiscale models of infectious disease dynamics. This was achieved by establishing when it is appropriate to implement particular mathematical methods for different multiscale models. The study of infectious disease systems has been elucidated ever since the discovery of mathematical modelling. Due to the vast complexities in the dynamics of infectious disease systems, modellers are increasingly gravitating towards multiscale modelling approach as a favourable alternative. Among the diseases that have persistently plagued most developing countries are vector-borne diseases like Malaria and directly transmitted diseases like Foot-and-Mouth disease (FMD). Globally, FMD has caused major losses in the economic sector (particularly agriculture) as well as tourism. On the other hand, Malaria remains amongst the most severe public health problems worldwide with millions of people estimated to live in permanent risk of contracting the disease. We developed multiscale models that can describe both local transmission and global transmission of infectious disease systems at any hierarchical level of organization using FMD and Malaria disease as paradigms. The first stage in formulating the multiscale models in this study was to integrate two submodels namely: (i) the between-host submodel and (ii) within-host submodel of an infectious disease system using the nested approach. The outcome was a system of nonlinear ordinary differential equations which described the local transmission mechanism of the infectious disease system. The next step was to incorporate graph theoretic methods to the system of differential equations. This approach enabled modelling the migration of humans/animals between communities (also called patches or geographical distant locations) thereby describing the global transmission mechanism of infectious disease systems. At whole organism-level we considered the organs in a host as patches and the transmission within-organ scale as direct transmission represented by ordinary differential equations. However, at between-organ scale there was movement of pathogen between the organs through the blood. This transmission mechanism called global transmission was represented by graph-theoretic methods. At macrocommunity-level we considered communities as patches and established that at withincommunity scale there was direct transmission of pathogen represented by ordinary differental equations and at between-community scale there was movement of infected individuals. Furthermore, the systems of differential equations were extended to stochastic differential equations in order to incorporate randomness in the infectious disease dynamics. By adopting a cocktail of computational and analytical tools we sufficiently analyzed the impact of the transmission mechanisms in the different multiscale models. We established that once we used a graph-theoretic method at host level it would be difficult to extend this to community level. However, when we used different methods then it was easy to extend to community level. This was the main aspect of the characterization of multiscale models that we investigated in this thesis which has not been done before. We also established distinctions between local transmission and global transmission mechanisms which enable us to implement intervention strategies targeted torwards both local transmission such as vaccination and global transmission such as travel restrictions. In spite of the fact that the results collected in this study are restricted to FMD and Malaria, the multiscale modelling frameworks established are suitable for other directly transmitted diseases and vector-borne diseases.
  • ItemOpen Access
    The Development and Application of Coupled Multiscale Models of Malaria Disease System
    (2022-11-10) Maregere, Bothwell; Garira, W.; Mathebula, D.
    The purpose of this thesis is to develop coupled multi-scale dynamics of infectious disease systems. An infectious disease system consists of three subsystems interacting, which are the host, the pathogen, and the environment. Each level has two different interaction scales (micro-scale and macro-scale) and is organized into hierarchical levels of an organization, from the cellular level to the macro-ecosystem level, and is arranged into hierarchical levels of an organization. There are two main theories of infectious diseases: (i) the transmission mechanism theory, (ii) the replication-transmission relativity theory. A significant difference exists between these theories in that (i) the transmission mechanism theory considers transmission to be the primary cause of infectious disease spread at the macro-scale, while (ii) replicationtransmission relativity theory is an extension of the first theory. It is important to consider the interaction between two scales when pathogen replication occurs within the host and transmission occurs between hosts (macro-scale). Our research primarily focuses on the replication-transmission relativity theory of pathogens. The main purpose of this study is to develop coupled multi-scale models of direct vectorborne diseases using malaria as a paradigm. We have developed a basic coupled multi-scale model with a combination of two other categories of multi-scale models, which are a nested multi-scale model in the human host and an embedded multi-scale model in the mosquito host. The developed multi-scale model consists of approaches of nonlinear differential equations that are employed to provide the mathematical results to the underlying issues of the multi-scale cycle of pathogen replication and transmission of malaria disease. Stability analyses of the models were evolved to substantiate that the infection-free equilibrium is locally and globally asymptotically stable whenever R0 < 1, and the endemic equilibrium exists and is globally asymptotically stable whenever R0 > 1. We applied the vaccination process as a governing measure on the multi-scale model of malaria with mosquito life cycle by comprising the three stages of vaccination, namely pre-erythrocyte stage vaccines, blood stage vaccines and transmission stage vaccines. The impact of vaccination on malaria disease has been proven. Through numerical simulation, it was found that when the comparative of vaccination efficacy is high, the community pathogen load (GH and PV ) decreases and the reproductive number can be reduced by 89.09%, that is, the transmission of malaria can be reduced on the dynamics of individual level and population-level.We also evolved the multi-scale model with the human immune response on a within-human sub-model which is stimulated by the malaria parasite. We investigated the effect of immune cells on reducing malaria infection at both the betweenhost scale and within-host scale. We incorporate the environmental factor, such as temperature in the multi-scale model of the malaria disease system with a mosquito life cycle. We discovered that as the temperature enhances the mosquito population also increases which has the impact of increasing malaria infection at the individual level and at the community-scale. We also investigated the influence of the mosquito life cycle on the multi-scale model of the malaria disease system. The increase in eggs, larval and pupal stages of mosquitoes result in the increase of mosquito density and malaria transmission at the individual level and community-scale. Therefore, the suggestion is that immature and mature mosquitoes be controlled to lessen malaria transmission. The results indicated that the combination of malaria health interventions with the highest efficacy has the influence of reducing malaria infection at the populationlevel. Models developed and analyzed in this study can play a significant role in preventing malaria outbreaks. Using the coupled multi-scale models that were developed in this study, we made conclusions about the malaria disease system based on the results obtained. It is possible to apply the multi-scale framework in this study to other vector-borne diseases as well.
  • ItemOpen Access
    Time-frequency domain analysis of exchange rate market integration in Southern Africa Development Community: A Hilbert-Huang Transform approach
    (2022-11-10) Adam, Anokye Mohammed; Kyei, Kwabena A.; Moyo, Simiso; Gill, Ryan S.; Gyamfi, Emmanuel N.
    The desire of most African economic communities to introduce a common currency has persisted for years. As postulated by the Optimum Currency Area hypothesis, coordination of policy indicators among member countries is desirable for stable monetary union. In this regard, the integration of exchange rate markets has been studied and cited as one of the key indicators that could signal economic integration. Therefore, analysis of similarities, interdependence, and information transfer across exchange rate markets in Southern African Development Community (SADC) is a necessity to measure the extent of integration in the region. However, the intrinsic complexity of exchange rate data generation and its stylised characteristics of non-stationarity and non-linearity influence the modelling of such data in terms of the accuracy of the analysis and the embedded policy direction. In response, this thesis proposes empirical mode decomposition-based market integration analysis to address the limitations of the existing literature which fails to recognise the heterogeneity of market participants and data generation of the exchange rate in SADC. The data employed for the thesis are the daily real exchange rates from 15 out of 16 member countries of the SADC from 3rd January, 1994 to 7th January 2019. The choice of study window and countries was based on the availability of adequate and consistent data for robust analysis and the period after South Africa, the largest economy, joined SADC. Based on the criteria, Zimbabwe was excluded from the analysis. To achieve the purpose of this thesis, a four-step approach was used. The first step reviewed and explored the non-stationarity and non-linearity stylised facts about the data and observed that exchange series in SADC are non-stationary and non-linear. The second stage compared the performance of two Hilbert-Huang Transforms (EMD and EEMD) to decompose SADC exchange rate markets of which EEMD emerged superior. The components of the decomposed series were examined for dominance and ability to define the exchange rate trajectory in SADC. The residue of all the markets explained over 80% of the variation of the original series except Angola. The short- and long-term comovement was analysed through the analysis of the characteristics of IMFs and residues. The analysis of the IMFs and residues obtained from EEMD showed that exchange rate markets in SADC are driven by economic fundamentals and 12 out of 15 countries examined showed some level of similarity in the long-term trend. In the third stage, EEMD-DCCA based multifrequency network was introduced to study the dynamic interdependence structure of the exchange rate markets in SADC. This was done by first decomposing all series into intrinsic mode functions using EEMD and reconstructing the series into three frequency modes: high, medium, and low frequency, and residue. The DCCA method was used to analyse the cross-correlation between the various frequencies, residues and original series. These were meant to address the non-linearity and non-stationarity in observed exchange rate data. A correlation network was formed from the cross-correlation coefficients to reveal rich information iii than would have been obtained from the original series. The results showed similarities between the nature of cross-correlation between high-frequency series mimicking the original series. There was also a significant cross-correlation of long-term trends of most SADC countries’ exchange rate markets. The final stage proposed EEMD-Effective transfer entropy-based model to study exchange rate market information transmission in SADC at various frequencies. The combination of Ensemble Empirical Mode Decomposition (EEMD) and the Rényi effective transfer entropy techniques to investigate the multiscale information transfer helped quantify the directional flow of information at four frequency domains, high-, medium-, and low-frequencies, representing short-, medium-, and long-terms, respectively, in addition to the residue (fundamental feature). This revealed a significant positive information flow in the high frequency, but negative flow in the medium and low frequencies. Based on the findings of this thesis we recommend that EEMD based method be used in the analysis of financial data that susceptible to non-linearity and non-stationary to elicit the time-frequency information. In terms of policy towards monetary formulation, we recommend a stepwise approach to monetary integration in SADC.
  • ItemOpen Access
    Share Price Prediction for Increasing Market Efficiency using Random Forest
    (2022-11-10) Mbedzi, Tshinanne Angel; Chagwiza, W.; Garira, W.
    The price of a single share of a collection of sell-able shares, options, or other financial assets, shall be the price of a share price. The share price is unpredictable since it primarily depends on buyers’ and sellers’ expectations. Share is a primary and secondary market equity security. In this study we will use machine learning techniques to predict the share price for increasing market efficiency. In addition, it is important for us to build a models to create appropriate features to improve the performance of the models. The random forest and the recurrent neural network will be used to achieve this. To fix class imbalance, we analyse preprocessing of the data set, like the selection of the features using filter and wrapper methods and selected oversampling techniques. The model’s performance will be evaluated using Mean absolute error (MAE), Mean square error (MSE), Root mean square error (RMSE), Relative MAE (rMAE), and Relative RMSE (rRMSE). The performance of the RNN and Rf algorithms was compared for the prediction of the closing price. The Rf model was found to be the best model for predicting the stock price (closing price). This research project together with its findings will have an impact in increasing market efficiency. This will also promote potential economic growth.
  • ItemOpen Access
    Fundamental Analysis for Stocks using Extreme Gradient Boosting
    (2022-11-10) Gumani, Thanyani Rodney; Chagwiza, Wilbert; Kubjana, Tlou
    When it comes to stock price prediction, machine learning has grown in popularity. Accurate stock prediction is a very difficult activity as financial stock markets are unpredictable and non-linear in nature. With the advent of machine learning and improved computational capabilities, programmed prediction methods have proven to be more effective in stock price prediction. Extreme gradient boosting(XGBoost) is the variant of the gradient boosting machine. XGBoost, an ensemble method of classification trees, is investigated for the prediction of stock prices based on the fundamental analysis. XGBoost outperformed the competition and had higher accuracy. The developed XGBoost model proved to be an effective model that accurately predicts the stock market trend, which is considered to be much better than conventional non-ensemble learning techniques.
  • ItemOpen Access
    Predicting an Economic Recession Using Machine Learning Techniques
    (2022-11-10) Molepo, Mashaka Ruth; Chagwiza, Wilbert; Kubjana, Tlou
    few economic downturns were predicted months in advance. This research has the ability to give the best performing models to assist businesses in navigating prior recession periods. The study address the subject of identifying the most important variables to improve the overall performance of the algorithm that would effectively predict recessions. The primary aim of this study was to improve economic recession prediction using machine learning (ML) techniques by developing an inch-perfect and efficient prediction model in order to avoid greater government deficits, growing inequality, significantly decreased income, and higher unemployment. The study objective was to establish the relevant method for addressing imbalance data with suitable features selection strategy to enhance the performance of the machine learning algorithm developed. Furthermore, artificial neural network(ANN) and Random Forest (RF) were used in predicting economic recession using ML techniques. This study would not have been possible without the publicly available data from the online open source Kaggle, which provided ordinal categorical data for the specific data utilized. The major findings of this study were that the ML algorithm RF performed better at recession prediction than its rival ANN. Due to the fact that two ML algorithms in this research were employed , further ML tools can be used to improve the statistical components of the study.