See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/341665309
Statistic Solution for Machine Learning to Analyze Heart Disease Data
Conference Paper · February 2020
6 authors, including:
Some of the authors of this publication are also working on these related projects:
Foreign Accent Identification by using Machine Learning View project
Statistic Solution for Machine Learning to Analyze Heart Disease Data View project
Chinese Academy of Sciences
18 PUBLICATIONS 55 CITATIONS
29 PUBLICATIONS 261 CITATIONS
5 PUBLICATIONS 12 CITATIONS
3 PUBLICATIONS 6 CITATIONS
All content following this page was uploaded by Abdur Rasool on 13 July 2020.
The user has requested enhancement of the downloaded file.
Statistic Solution for Machine Learning to Analyze Heart
Abdur Rasool1, RanTao1, Kaleem Kashif2, Waqas Khan2, Promise Agbedanu1 and Neeta
School of Computer Science & Technology, Donghua University, Shanghai, China
School of Information Science and Technology, Donghua University, Shanghai, China
+86-18616146162, +86-18116255945, +86-186201020127, +86-1861678490, +86-18616070405, +86-
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]; [email protected]
Data crawling, collection and analysis have become a popular
pillar for the business intelligence of big data analysis which is the
latest hot-topic among the research association. Numerous tools
and techniques to solve and analyze the structured and
unstructured datasets are developing very quickly. The previous
studies show the different approaches in the identification of the
strengths and weaknesses of multiple machine learning algorithms.
But, most of the approaches demand more expert knowledge base
information to understand the concepts of given data. In this paper,
we modernize the machine learning methods for the effective
prediction of heart disease. This work deliberates the detailed
process of implementation of our proposed system. The goal of
this work is to find a strong and effective machine learning
algorithm for disease prediction for the problem; how can doctors
get fast and better results for their diagnosis of heart disease. We
design a new system for disease prediction using machine
learning prediction algorithms (LR, ANN and SVC) by utilizing
an effective approach of ETL, OLAP and data mining. The results
showed that the best machine learning algorithm is SVC with 92%
accuracy for the risk prediction model. We found that subjects at
56-64 years old have a high risk of heart disease, as well as men,
have more heart disease rate than women. This proposed study
can be favorable for the medical practitioners in the field of
healthcare, supportive practice and precautions to the heart
• Information systems➝Information systems
applications➝Mobile information processing systems
Machine learning; data mining; heart disease analysis.
Business Intelligence (BI) has become a vital field of research
over the last two decades. Google’s chief economist Hal Varian
commented in 2008  that Data is getting ubiquitous and cheap,
so he recommended to take the progressions in manipulating and
analyzing the data. The sensor and social media data are large and
complex, so big data analytics are needed to describe the data sets
and analytical techniques. Especially in industries where data is
structured and often stored in Relational Database Management
Systems (RDBMS) is considered for customer opinion, needs and
recognizing new opportunities. Data analysis becomes more
complicated and comprehensive when it deals with the human
healthcare environment. A human being is a complex biological
system, which contains much information. It is full of information
but still “knowledge poor”. So, an automated system could be an
advantage for medical diagnosis to detect the relationship between
the different data . Heart disease is the leading death for both
men and women. About 610,000 people died in the United States
every year . Diagnosing heart disease depends on a complex
biological system. Therefore this work is going to set its goal to
analyze the complex system regarding the prediction of heart
disease. The main contribution is to answer these questions:
(1) How to analyze the data to know each attribution’s impact on
heart disease and their relationship with each other?
(2) Based on the analyses, which machine learning model will be
the best to provide a Risk Prediction Model for heart disease?
To solve these problems we design our proposed system by taking
the UCI dataset which is often far from perfect. Although other
data mining can tolerate some degree of deficiency in the data, we
focus on understanding and improvement of the data to achieve
the best quality of the result analysis. Due to the declining cost
and storage of large data amounts of banking and e-business, the
offered products and services exploded . To integrate
enterprise-specific data, data marts and tools for Extraction,
Transformation, and Load (ETL) are needed. For analysis, Online
Analyzing Processing (OLAP) and reporting tools are used to
look at the different data characteristics. Statistical analysis and
data mining are used for association analysis, data segmentation,
clustering as well as classification, regression analysis, and
anomaly detection. BI platforms that are already offered by
Microsoft, IBM, Oracle, and SAP already include these
mentioned data processing and analytic tools . Microsoft SQL
is a BI platform, which could be used to build data integration and
transformation solutions for enterprises. This tool will be used to
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
ICMLC 2020, February 15–17, 2020, Shenzhen, China
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7642-6/20/02…$15.00
solve complex business problems. Our proposed system for
combatting the said problem is shown below.
(1) Dataset collected from UCI. The data is unstructured and in
text format. Merging the data together with python and importing
it to SSIS/ELT.
(2) Creation of fact and dimension tables to suit our research goals.
Extract important features for analysis and reporting.
(3) Using python to make some statistical analyses on the impact
of heart disease. Finding the best Machine Learning (ML) models
to predict and diagnose heart disease.
The rest of the article follows as section 2, discusses related work,
section 3 is concerned with the experimental methodology of the
ETL, OLAP, and data mining aspect. Section 4 includes heart
disease analysis with an ML algorithm based on person, blood,
and heart. Section 5 is a conclusion and future work.
2. RELATED WORK
With the invention of the 21st century, the big data research field
is getting more progression with the passage of time . In the
last decades, a plethora of data has been generated by multiple
organization including the public and private sectors. Recently,
this data is going to more fruitful if it has been analyzed properly
to get meaningful information and results. Bordleau  stated that
business intelligence has been improved for the gigantic data
volume for Industry 4.0 (I4.0) technologies. It generates a large
volume of data that needs to be processed and used in decisionmaking. To generate value for the companies, data analysis of
strategic and operational activities is still needed to more explore.
In this sense, the preparation of a perfect model to analyze these
datasets are getting to be a difficult job due to the presence of
numerous methods and technologies with claiming better
accuracies and best results.
The extraordinary developments in biotechnology and health
sciences have directed an important construction of data, for
example, the high quality of genetic data and clinical information
. To deal with gigantic data the machine learning and data
mining (DM) approach in biosciences is more effective than ever
before. T.P. Fowdur  provide an in-depth overview of the
several open-source tools and technique currently being utilized in
analyzing and learning for big data. This overview is based on the
most common principle of IBM, Oracle and Microsoft Azure by
induction of machine learning algorithms such as Random Forest
(RF), Naïve Baye (NB). Fatima M  conduct a survey which
delivers the comparative analysis of multiple machine learning
algorithms for prediction and diagnosis of numerous diseases such
as diabetes disease, heart disease, dengue disease, liver disease,
and hepatitis disease, This study comes up with the consideration
towards the suite of ML algorithms, tools, and techniques that are
helpful for the prediction of disease consequently.
Chaurasia  study with the aim to examine the performance of
different machine learning classification methods by using the
data of breast cancer. In the experiment, the compare three
classification methods in Weka and achieve the results of
Sequential Minimal Optimization (SMO) which has higher
prediction accuracy i.e. 96.2% than IBK and BF Tree methods.
Adam Fleman  find some statistics about heart disease which
shows that around 630,000 people in the USA died and one
person dies per minute due to heart disease. Min Chaen 
research on a regional long-lasting disease of the brain. They
suggest a novel Convolutional Neural Network (CNN) for the
multimodal disease prediction by taking the structured and
unstructured datasets provides from the hospital. They found the
prediction accuracy of the suggested algorithm is 94.8% which is
more accurate than the conventional unimodal of the disease
prediction algorithm. Mehrbakhsh Nilashi  proposed an
analytical method for prediction of disease by using a fuzzy rulebased approach with EM, PCA, and CART. They used medical
datasets to extract the fuzz rule for prediction. They found their
proposed system can be effective for disease prediction.
Shifei Ding  used US-ELM (unsupervised Extreme Learning
Machine) technique to get the attribution by experimenting at UCI
data sets. They compare the K-means algorithms with a spectra
clustering algorithm to achieve the best accuracy and efficiency
for the given UCI data. Adilah Sabtu  implements the ETL
(Extract, Transform, and Load) system to prove a standard choice
for handling and supporting the process of the valued big datasets.
They found that the ETL classification system is still challenging
to prepare on real-time data, and they accept that still, a different
perspective of data for ETL must be conducted into future
research. Arik Safon  suggested that data warehouses are the
best to store accurate data after the processing of the ETL stage.
OLAP (On-Line Analytical Processing) is another tool that is
being used to process the data for advancement so that the
filtering process could be more accurate, reliable and efficient.
They suggested the solution that it is essential need to create a
model for offering validation data to create data warehouses in
3. EXPERIMENTAL METHODOLOGY
3.1 ETL Process Dataset Creation
This work contains two different datasets of heart disease, one
from the UCI website and one Statlog Heart dataset . Dataset-1
contains a heart.dat file and a heart.doc file. Heart.dat contains
270 person information and heart.doc provided attribution
explanations on them. Dataset-2 contains 76 attributes, but all
published experiments refer to using a subset of 14 of them. For
our work, we merge those two datasets into one dataset with 573
samples. The description of each attribution is as follows in Table
By observing the data, It can be found that some of the fields in
the data are vacant or “?”. These attributes are invalid for our data
analysis. After cleaning the “?” we use python’s own panda’s
package to read data and add column names (attributes) to the
data to distinguish their properties. It found that there were 270
subjects remaining in dataset-1 and 297 subjects in dataset-2.
After preprocessing, we combine the datasets into one .csv file
and it became 567 total subjects. The combined dataset contains
num attribute with 1 or 2 which indicates the absence (1) or
presence (2) of heart disease. It loaded this .csv data file into the
SQL server using the import data function.
Table 1. The description of all 14 attributes
|chest pain type (cp)||Value1: typical angina
Value2: atypical angina
|resting blood pressure
|Measured in mm Hg|
|cholesterine (chol)||Serum of cholestoral|
|fasting blood sugar (fbs)||Value0: false|
|Value0: normal restecg
Value1:ST-T wave is abnormal. T
wave or ST elevation is bigger
than 0.5 mV
Value2: probable or definite left
ventricular hypertrophy criteria is
|exercise induced angina
|Oldpeak||ST depression which is induced
by exercise relative to rest
|Slope||peak exercise ST segment
|number of major vessels
|Values from 0 to 3|
|thallasemia (thal)||Value3: normal
Value6: fixed defect
Value7: reversible defect
|diagnosis of heart
|Diagnosis of heart disease
Value1: smaller than 50%
Value2: over 50% diameter
The ETL process prepared the .csv file. This ETL is based on an
Integration Service (project) of SSIS. In the next step, it creates
the OLE database Connection by choosing the correct data
connection and then add the OLE database Destination from the
SQL Server Integration Services (SSIS) Toolbox, to specify the
destination of the data and choose the database table to load the
data in. The final step is to test the ETL process and check if the
data arrives in the database correctly as given in figure 1.
Figure 1. Database result after ETL process.
3.2 Online Analytical Process (OLAP)
Through the above data preprocessing, it has obtained the desired
data. The following sets up the cube data for these data. After the
pretreatment, we obtained four tables. The person table contained
the age and gender of 271 subjects. The heart table contains the
specific heart attributes of these 271 users. The Symptoms table
contains information such as symptomsID, cheastpaintype chest
pain type. The HeartFact fact table that contains the final
diagnosis result num, the primary key personID, heartID,
symptomID, and bloodID of the four tables. After creating five
tables, which contains 4 dimension tables and 1 fact table, it
builds a cube as is shown in Fig 2. Then import this data into SQL
Server Integration Services (SSIS) check the Cube configuration,
measures, and dimensions.
Figure 2. Star schema for cube creation
Then import this data into SQL Server Integration Services (SSIS)
check the Cube configuration, measures, and dimensions.
3.3 Data Mining
After creating a cube during the OLAP step that assists to produce
the dimension of data. Next, It will start with data mining
according to our defined problems in the section above. Four
different data mining techniques will be used in the following
section. It will use the association rule to find out if there are any
patterns, correlation or associations.
Hypothesis 1 gathers that there is a relation between the age and
the risk of getting heart disease. Figure 3 show the prediction of
heart disease (num 1 = true, num 0 = false) in context to age and
sex (women = 0, men = 1).
Figure 3. Association rules for age and sex.
There is a probability of 0.552 that men will get heart disease. It
can be seen a probability of 0.649 that at the age of 56-64 people
have a higher risk to get heart disease. Interestingly enough is that
women have a probability of 0.754 to get no heart disease. It is
clear that men tend to have a higher risk to get heart disease. Men
younger than 42 years have a probability of 1.0 to not get heart
According to figure 4, the attribute chestpain type (cpt=4) is
decisive for heart disease. It can be seen as a relation in Fig 4
between the attributes thalassemia (thal) and chestpain type (cpt).
According to this association rule table, we can also say that there
is a relation between the number of vessels (ca) and chestpain
type (cpt). The same goes for ca and thal, but we cannot decide
whether these related attributes have a relation to heart disease or
4. HEART DISEASE ANALYSIS BASED
ON MACHINE LEARNING
To analyze the heart disease dataset, three different Machine
learning algorithm including Logistic Regression (LR), Artificial
Neural Network (ANN), and Support Vector Classifier (SVC), has
been utilized to predict the classification’s accuracy.
The logistic model is a statistical model that is often applied to
binary dependent variables. ANN is a neural network of simple
elements (neurons) that get input, change internal state and
produce output based on input. In this work, ANN is based on
Keras framework to analyze the heart disease. However, SVC is a
supervised learning model that used to classify the data into
multiclass. Here, It used K-fold cross-validation by setting K=10.
The implementation of these machine learning algorithms shows
that SVC has the highest accuracy with 92%, while LR achieved
85% and ANN received only 82% accuracy.
In this research, by using the SVC model we analyzed the heart
disease at three different basses of person, blood and heart.
Figure 4. Association rules of chestpaintype, thalassemia and
number of vessels
4.1 Heart Disease Analysis Based on Person
In figure 5, it is clear that both males and females have the
tendency that the older they are the more likely they will have
heart disease. For women, the line reaches the top at 95% at age
67. For men, they will reach 95% at age 68. It means age 67 and
68 is the most dangerous age for females and males to present
heart disease respectively. The health and heart disease in males
all takes 50%. While this distribution in the female is healthy
takes 70% and heart disease accounts for 30%. Another difference
is the age of the woman who has heart disease varies from 50 to
68, but that of man varies from 38 to 70. A conclusion can be
drawn that when compared with woman, man is more inclined to
have heart disease and man will show present of heart disease at
an earlier age than a woman.
Figure 5: The relationship between heart disease, age, and sex
4.2 Heart Disease Analysis Based on Blood
Figure 6 shows the relationship between heart disease, serum
steroids in mg/dl and the fasting blood glucose.
Figure 6: The relationship between heart disease, fbs, and chol.
It can clearly be seen that both the pink part which stands for
fasting blood glucose < 120 mg/dl and the green part fasting blood
glucose > 120 mg/dl have the same percentage both in group 1
and group 2. Take the conclusion 4 into consideration we see that
who has higher serum steroids in mg/dl and the fasting blood
glucose > 120 mg/dl are inclined to have heart disease.
4.3 Heart Disease Analysis Based on Heart
In this analysis, we use three attributions include resting blood
pressure, maximum heart rate and ST-segment depression to
analyze heart disease based on heart, as shown in Fig 7.
Figure 7: The relationship between heart disease and trestbps.
In the first chart, we can see the blood pressure varies from 100 to
180 in the heart disease group while 50 to 150 in a normal people
group. In the second group, the mean, maximum and minimum of
maximum heart rate in the normal people group is significantly
higher than that of the heart disease group. In the third group, it
can be clearly seen that the number of ST-segment depression in
the left group is less than that of the right group. We may
conclude that the people who have higher blood pressure and STsegment depression, lower maximum heart rate are more likely to
have heart disease.
The strategic challenges for health care have always been to figure
out how to leverage disease prediction in the context of health
care that drives efficient results. In this research, it has been fully
considered one such case to propose and implement a feasible
solution for our problem which is “which machine learning model
will be the best to provide a Risk Prediction Model for heart
disease”. Designing the ETL/ELT process that hysterics our
designated problem statement has been a dare from stage one of
this research, considering the fact that the dataset is unstructured
and it has to go through the process of imposing structured form
on it. Also, the techniques to accomplish and relevant quality of
external data sources was a perplexing development as well.
In these analyses, according to having heart disease, it is obvious
that this is effected by chestpain type (cpt=4), but also by the
maximum heart rate of a subject. If the maximum heart rate is
lower than 113, between 113-132, and 132-151 the subject has to
be alert that the subject has heart disease. Especially subjects at an
age between 56 and 64 have a higher risk to have heart disease,
especially if the values of the already mentioned attributes are in
the area that is shown in the full view of the dependency network.
This also confirms the other analysis above. Results also confirm
that a large number of vessels (ca = 1 to 3) also have a significant
effect on having heart disease. Regarding gender, it showed that
especially men have a higher risk to get heart disease, which was
also already mentioned above.
It came to concluded that the factors that will affect the heart
disease significantly are the attribution resting blood pressure,
serum steroids, maximum heart rate the tester had achieved,
resting ECG results. Although man is more inclined to have heart
disease and man will show the presence of heart disease at an
earlier age than a woman. Finally, analyses found the SVC
machine learning algorithm is better as compared to LR and ANN.
SVC can be the best risk prediction model for heart disease, so
some health care centers and the medical community could use it
as a good health care facility to monitor the heart disease patients.
In the future, we propose to add numerous attributions of the same
dataset, to analyze with multiple factors of heart disease by
implementing the different ways of data mining with machine
 Hal Varian Answers Your Questions, February 25, 2008
(http://www.freakonomics.com/2008/02/25/hal-variananswers-your-questions/), accessed: 2018-05-20.
 EMC education services (2014) “Data Science and Big Data
Analytics: Discovering, Analyzing, Visualizing and
 Center for disease control and prevention,
https://www.cdc.gov/heartdisease/facts.htm, accessed: 2018-
 Surajit Chaudhuri, Umeshwar Dayal, Vivek Narasayya,
Communications of the ACM, Vol. 54 No. 8, Pages 88-
 Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business
intelligence and analytics: from big data to big impact. MIS
 Soni, J., Ansari, U., Sharma, D., & Soni, S. (2011).
Intelligent and effective heart disease prediction system
using weighted associative classifiers. International Journal
on Computer Science and Engineering, 3(6), 2385-2392.
 Bordeleau, Fanny-Ève, “Business Intelligence in Industry 4.0:
State of the art and research opportunities”, The Digital
Supply Chain of the Future: Technologies, Applications and
Business Models, DOI: 10.24251/HICSS.2018.495.
 IoannisKavakiotis, (2017), “Machine Learning and Data
Mining Methods in Diabetes Research”, Volume 15, 2017,
Pages 104-116, https://doi.org/10.1016/j.csbj.2016.12.005.
 T. P. Fowdur, 2017, “Big Data Analytics with Machine
Learning Tools”, Internet of Things and Big Data Analytics
Toward Next-Generation Intelligence pp 49-97.
 Fatima, M. and Pasha, M. (2017) Survey of Machine
Learning Algorithms for Disease Diagnostic. Journal of
Intelligent Learning Systems and Applications, 9, 1-16.
 Chaurasia, Vikas and Pal, Saurabh, A Novel Approach for
Breast Cancer Detection Using Data Mining Techniques
(June 29, 2017). International Journal of Innovative Research
in Computer & Communication Engineering, Vol. 2, Issue 1.
 Adam Felman (7 February 2018), Reviewed by Debra
Sullivan, PhD, MSN, RN, CNE, COI.
 Min Chen ; Yixue Hao, (2017), “Disease Prediction by
Machine Learning Over Big Data From Healthcare
Communities”, IEEE Access ( Volume: 5 )
 Mehrbakhsh Nilashi, 2017, “An analytical method for
diseases prediction using machine learning techniques”,
Computers & Chemical Engineering, Volume 106, 2
November 2017, Pages 212-223,
 Shifei Ding, 2015, “Unsupervised extreme learning machine
with representational features”, “International Journal of
Machine Learning and Cybernetics”, April 2017, Volume
8, Issue 2, pp 587–595.
 Adilah Sabtu , (2017) , “The challenges of Extract,
Transform and Loading (ETL) system implementation for
near real-time environment”, “2017 International Conference
on Research and Innovation in Information Systems
(ICRIIS)”, DOI: 10.1109/ICRIIS.2017.8002467.
 Arik Sofan Tohir, (2017), “On-Line Analytic Processing
(OLAP) modeling for graduation data presentation”, “ 2017
2nd International conferences on Information Technology,
Information Systems and Electrical Engineering
(ICITISEE)”, DOI: 10.1109/ICITISEE.2017.8285481.
View publication stats
- Assignment status: Already Solved By Our Experts
- (USA, AUS, UK & CA PhD. Writers)
- CLICK HERE TO GET A PROFESSIONAL WRITER TO WORK ON THIS PAPER AND OTHER SIMILAR PAPERS, GET A NON PLAGIARIZED PAPER FROM OUR EXPERTS