Bad datasets can lead to heart attacks!
We discuss the pitfalls of a heart attack analysis dataset.
The heart is the symbol of love. Many impassioned poems have been written, hackneyed phrases coined and heart-shaped wooden gifts hung in homes claiming that “home is where the heart is”. The heart is often related to the way we feel. “Distance makes the heart grow stronger” or a personal favourite “you must follow your heart”. These are, of course, metaphorical uses of the word that extend the idea of the visceral within. Newsflash: it’s a muscle. And if you don’t treat it right, you’ll be at risk of a heart attack.
Your heart is crucial to your survival. It supplies every part of your body with oxygen.
Christmas is the day of the year with the most heart attacks.
Like any muscle, a good diet and an outgoing lifestyle help you and your heart keep in shape, as well as not smoking and managing stress. Emotional and physical health are both critical for maintaining a healthy heart. In this article, we analyse with the help of a dataset how age, gender, cholesterol levels and exercise can affect your chances of having a heart attack. We try to make insights through the visualisations and comment on the reliability of our findings given this dataset found on Kaggle.
In our short analysis, we consider the following factors:
- Resting blood pressure (rbp)
- Exercise-induced Angina
From this simple bar chart, we can see that males have more heart attacks than females, with 93 male and 72 female heart attacks. It is widely known that males have more heart attacks than females.
Heart attacks are twice as common in men than in women.
It is tempting to conclude that from this data males have more heart attacks than females with the data presented as follows in this visualisation. However, this would be incorrect. Let’s see why…
There are 303 patients.
207 Males — 93 of which have a heart attack.
96 Females — 72 of which have a heart attack.
93 out of 207 males suffered a heart attack (45%), and 72 out of 96 females suffered a heart attack (75%). So this dataset completely contradicts what we already knew to be true: that men are more likely to have heart attacks.
It is also extremely misleading as there are not equal proportions of men to women. Therefore no accurate comparisons can be made. The probabilities are more illuminating than any visualisation as we can take into account the odds ratios. This dataset has an odds ratio (a measure of exposure and outcome) of Male:Female for having a heart attack as 0.27.
With an ideal data set where both sexes are equally represented and the sample size is large, these probabilities should reflect what we can see in the bar chart. However, the real world does not always give you perfect data.
Correlation does not mean causation.
The category from 50–59 has the most heart attacks. The distribution follows a normal curve with the mean at 54.4
Now let us look at the two following graphs. We can see that there is far more exercise-induced angina in the category of 50 to 59. However, there is no strong correlation between exercised induced angina and the number of heart attacks, although the visualisation says otherwise. If you are interested in the probabilities regarding exercise-induced angina and the number of heart attacks, please see the article's end.
Angina is chest pain caused by reduced blood flow and oxygen to the heart muscles. It is a warning sign that you could be at risk of a heart attack. Exercised induced angina is when chest pain is brought about by exercise.
This graph can tell us three things:
- Men have higher cholesterol than women.
- There are peaks in levels of cholesterol between the ages of 50 to 60.
- There is a trend between cholesterol and age with respect to gender.
What this graph does not tell us is that high cholesterol causes heart attacks. Simply because there is a correlation, we do not have enough evidence from this dataset to make this conclusion. We know from accepted knowledge that high cholesterol makes you more likely to have a heart attack from accepted knowledge in the real world. However, we cannot make this conclusion.
Take this to heart: Visualisations are not conclusions — they do not tell the whole story.
“Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers.” — David Spiegelhalter.
Health warning: check your stats agree!
Sometimes you will be given incomplete data, not large enough, and companies will want you to have you predict an outcome or trend reliably. Often you can blag your way through your results and talk your way through your findings, but ultimately you won’t be helping yourself or the company.
It may seem obvious, but to truly make an impact when analysing your data, you must first make sure that your data is reliable. Data is a newly found super-power that we can yield when making decisions, but it must be used responsibly and judiciously. The risks are at the highest when you rush your decisions and visualise data wrongly.
Remember to check the basic statistics and probabilities for your data to stop yourself from making silly mistakes. Question yourself constantly to make sure you are confident about your insights: What do the simple statistics say? What does this mean in relation to the conclusions I am drawing? Are the proportions that I am using correct? Does my data agree or contradict previously established truths? What are the probabilities of your dataset? This can help you down the road to making the correct visualizations and impacts in your data analysis.
 —The “Merry Christmas Coronary” and “Happy New Year Heart Attack” Phenomenon Robert A. Kloner, 2004