No AI Generated Content
Data Mining on Parkinson's Disease Dataset Using Random Forest Algorithm
Get free samples written by our Top-Notch subject experts for taking online assignment help australian assignment helpservices
Overview of data and Business understanding
In the case of performing data mining, there is a “Parkinson” dataset that has been used in this project. The data plays a major role in the case of both decision-making and prediction as well. In the given dataset there are several columns that can be seen which contain some useful information. Understanding all this information can be very useful for making the overall project. In the dataset various column like X, Y, Z, Pressure, Gripangle, Timestamp, and Test ID can be seen. These columns contain the data which can be used in the case of implementing the machine learning algorithms accordingly.
Descriptive statistics of the data
In the case of understanding and describing the dataset, a summary and measurement can be taken accordingly. These summaries can be very useful for the analyst in the case of implementing useful machine learning algorithms for masking a predictive model that can be able to take the correct decision from time to time. In the case of understanding the statistical description of the dataset, there are several measurements that can be taken. Those measurements can be based on the 'mean', median and mode columns accordingly (Regin et al., 2021). There are various columns like ‘X’, ‘Y’, ‘Z’, ‘Pressure’, ‘Gripangle’, ‘Timestamp’, and ‘Test ID’ that can be seen in the “Parkinson” dataset. In the ‘Test ID’ column of the dataset “Static spiral test”, “Dynamic spiral test”, and “Circular motion test” can be seen. The “Static spiral test” segment can be used in the case of drawing the spiral pattern, the “Dynamic spiral test” can be used in the case of making a spiral pattern that can blink in a certain time and the circular motion test can be able to draw circles around all the red points accordingly (Dogan and Birant, 2021). The spiral tests can be very beneficial in the case of identifying all the risks. The solution to all the risks can be easily get by performing a spiral test.
Data cleaning and preparation
Data cleaning is an essential aspect of data analysis and data mining. It has to be done properly before performing an analysis. There are several duplicates, null and unnecessary values can be found inside a dataset. Removing those values is a crucial part because those values can be able to affect the overall decision-making and prediction negatively (Wang et al., 2020). Data cleaning is the first stage that can be seen in the data preparation section. In the preparation part, the raw data can be transformed according to the given task. It is a lengthy process where an analyst checks and identifies all the errors that can affect the overall prediction and decision making and then they can be able to fix those errors accordingly. Fixing all the errors that can be seen in the dataset can be very beneficial because it will increase the overall speed and accuracy of the decision-making process. That is why data preparation plays an important part in the case of da1ta analysis and data mining. The overall decision-making can be done much better if the data come from a valid source. There are less number of errors that can be seen if the data source is valid and it can save up the time of data analysts.
Appropriate analysis method
In the case of data mining and data analysis, there are two types of analysis methods that can be seen from time to time. One of those is qualitative data analysis and the other one is quantitive data analysis. For this project, quantitive data analysis has been used accordingly (Yang et al., 2020). It has been used because there are numerical values can be seen in the dataset. Inductive and detective data analysis can also be very useful in the project. The researchers can be able to implement these two techniques as per the requirement of the task. All the necessary machine learning algorithms can be implemented properly by performing these analyses.
Observed results
In the case of performing data mining, the implementation of machine learning algorithms can be done accordingly. There are several machine learning algorithms that can be used accordingly. For implementing the algorithm, “RapidMiner” software has been used. One of the machine learning algorithms that has been used in the project is “random forest”. It is one of the popular machine learning algorithms that can be used in the case of solving both classification and regression problems. There are uses of multiple decisions tree can be seen in a random forest classifier. It can be able to generate accurate results even though one or two decision trees generate inaccurate results (Mughal, 2018). The accuracy of this algorithm is greater than the decision trees. That is why it can be very beneficial for this project. Another machine-learning algorithm that can be used is the “decision tree.” It is a very powerful machine-learning algorithm that can be used in the case of making predictive models. In this project heavy uses of “random forest” can be seen. There is an accuracy score has been generated and the accuracy of this algorithm is hundred per cent. That is why this algorithm can be able to generate the proper prediction result from time to time.
- Process diagram
The detailed process diagram is represented in this section. The entire procedure is performed through several steps in the “RapidMiner” environment. In the retrieval section, the respected dataset is imported, then the missing values are replaced, and in the proceeding step the target column is chosen. After that, the following dataset is split into 70% and 30%. The random forest algorithm is applied in the next step. The model is applied and then the performance of the model is evaluated.
- Collected dataset
This is the description of the respective dataset that is collected. There are several attribute columns in the following dataset. Some of the columns are row no, X, Y, Z, Test ID, timestamp, pressure, and so on. In the following Test ID, 0 suggests the “static spiral test”, 1 suggests the “dynamic spiral test” and 2 suggests the “circular motion test”. The respected values of the columns are also represented properly.
- Target column
The following figure shows the chosen target column. The target column is here the Test ID column. Here the values of the Test ID are between the 0,1 and 2. The following target column is also highlighted in this section. As per the Test ID, the entire operation is performed, and also the analysis is done from this respected column.
- Random forest model 1st tree
The algorithm of the random forest is applied here. The 1st tree model of the random forest is represented here. Here the parent node is the timestamp which is divided into several child nodes. The grip angle is the node of the timestamp which is also divided also into several child nodes.
- Random forest model 1st tree description
This is the description of the 1st tree, which has the values of the different attributes. The value of the grip angle, the timestamp value, and also the value of Y is represented here. The respected time is also shown here.
- Random forest model 2nd tree
This is the random forest model of the 2nd tree. In this model, the parent node Grip angle is divided into timestamp and the pressure, the timestamp has also several child nodes and also the values are represented.
Random forest model 2nd tree description
This is the detailed description of the random forest model. Here the respected values of the timestamp and also the values of the pressure are represented. The node labels and the edge labels are also shown.
- Random forest model 3rd tree
This is the 3rd model of the respected random forest. Here the parent node has also the child nodes with the timestamp and the 0. The timestamp is also classified into two child nodes.
- Random forest model 3rd tree description
This is the description of the 3rd tree model of the random forest. Here the values of the time stamp are represented and also the respected time is also shown.
- Random forest model accuracy score
This figure represents the respected accuracy scores of the following “random forest model”. Here several predictions are performed which are represented as pred0 to pred2 and the accuracy scores of all predictions are 100%.
- Random forest model confusion matrix
This is the confusion matrix of the following model of the “random forest”. This matrix relies on the X, Y, and Z plane, where the X represents the true class, Y is the pred class and the Z is the counters.
References
Dogan, A. and Birant, D., 2021. Machine learning and data mining in manufacturing. Expert Systems with Applications, 166, p.114060.
Mughal, M.J.H., 2018. Data mining: Web data mining techniques, tools and algorithms: An overview. International Journal of Advanced Computer Science and Applications, 9(6).
Regin, R., Rajest, S.S. and Singh, B., 2021. Spatial data mining methods databases and statistics point of views. Innovations in Information and Communication Technology Series, pp.103-109.
Wang, S., Cao, J. and Yu, P., 2020. Deep learning for spatio-temporal data mining: A survey. IEEE transactions on knowledge and data engineering.
Yang, J., Li, Y., Liu, Q., Li, L., Feng, A., Wang, T., Zheng, S., Xu, A. and Lyu, J., 2020. Brief introduction of medical database and data mining technology in big data era. Journal of Evidence?Based Medicine, 13(1), pp.57-69