Bad datasets can lead to heart attacks!

Co-authors: Kasonde Daniel Kasonde, Sally Said.

The heart is the symbol of love. Many impassioned poems have been written, hackneyed phrases coined and heart-shaped wooden gifts hung in homes claiming that “home is where the heart is”. The heart is often related to the way we feel. “Distance makes the heart grow stronger” or a personal favourite “you must follow your heart”. These are, of course, metaphorical uses of the word that extend the idea of the visceral within. Newsflash: it’s a muscle. And if you don’t treat it right, you’ll be at risk of a heart attack.

Your heart is crucial to your survival. It supplies every part of your body with oxygen.

Christmas is the day of the year with the most heart attacks.[1]

Like any muscle, a good diet and an outgoing lifestyle help you and your heart keep in shape, as well as not smoking and managing stress. Emotional and physical health are both critical for maintaining a healthy heart. In this article, we analyse with the help of a dataset how age, gender, cholesterol levels and exercise can affect your chances of having a heart attack. We try to make insights through the visualisations and comment on the reliability of our findings given this dataset found on Kaggle.

In our short analysis, we consider the following factors:

  1. Age
  2. Sex
  3. Cholesterol
  4. Resting blood pressure (rbp)
  5. Exercise-induced Angina
Simple bar chart showing the number of heart attacks for males and females in this dataset.

From this simple bar chart, we can see that males have more heart attacks than females, with 93 male and 72 female heart attacks. It is widely known that males have more heart attacks than females.

Heart attacks are twice as common in men than in women.[2]

It is tempting to conclude that from this data males have more heart attacks than females with the data presented as follows in this visualisation. However, this would be incorrect. Let’s see why…

There are 303 patients.

207 Males — 93 of which have a heart attack.

96 Females — 72 of which have a heart attack.

93 out of 207 males suffered a heart attack (45%), and 72 out of 96 females suffered a heart attack (75%). So this dataset completely contradicts what we already knew to be true: that men are more likely to have heart attacks.

The table shows the probabilities of being a given sex and getting a heart attack.

It is also extremely misleading as there are not equal proportions of men to women. Therefore no accurate comparisons can be made. The probabilities are more illuminating than any visualisation as we can take into account the odds ratios. This dataset has an odds ratio (a measure of exposure and outcome) of Male:Female for having a heart attack as 0.27.

With an ideal data set where both sexes are equally represented and the sample size is large, these probabilities should reflect what we can see in the bar chart. However, the real world does not always give you perfect data.

Correlation does not mean causation.

Here we can see the number of heart attacks per age category. Green represents female, and red represents male.

The category from 50–59 has the most heart attacks. The distribution follows a normal curve with the mean at 54.4

Now let us look at the two following graphs. We can see that there is far more exercise-induced angina in the category of 50 to 59. However, there is no strong correlation between exercised induced angina and the number of heart attacks, although the visualisation says otherwise. If you are interested in the probabilities regarding exercise-induced angina and the number of heart attacks, please see the article's end.

The dark blue line shows the proportion of heart attacks and the light blue bars shows exercise-induced angina.

Angina is chest pain caused by reduced blood flow and oxygen to the heart muscles. It is a warning sign that you could be at risk of a heart attack. Exercised induced angina is when chest pain is brought about by exercise.

Cholesterol by age. Pink is female (0) and blue is male (1).

This graph can tell us three things:

  1. Men have higher cholesterol than women.
  2. There are peaks in levels of cholesterol between the ages of 50 to 60.
  3. There is a trend between cholesterol and age with respect to gender.

What this graph does not tell us is that high cholesterol causes heart attacks. Simply because there is a correlation, we do not have enough evidence from this dataset to make this conclusion. We know from accepted knowledge that high cholesterol makes you more likely to have a heart attack from accepted knowledge in the real world. However, we cannot make this conclusion.

Take this to heart: Visualisations are not conclusions — they do not tell the whole story.

“Even in an era of open data, data science and data journalism, we still need basic statistical principles in order not to be misled by apparent patterns in the numbers.” — David Spiegelhalter.

Health warning: check your stats agree!

Sometimes you will be given incomplete data, not large enough, and companies will want you to have you predict an outcome or trend reliably. Often you can blag your way through your results and talk your way through your findings, but ultimately you won’t be helping yourself or the company.

It may seem obvious, but to truly make an impact when analysing your data, you must first make sure that your data is reliable. Data is a newly found super-power that we can yield when making decisions, but it must be used responsibly and judiciously. The risks are at the highest when you rush your decisions and visualise data wrongly.

Remember: Head over your heart.

Remember to check the basic statistics and probabilities for your data to stop yourself from making silly mistakes. Question yourself constantly to make sure you are confident about your insights: What do the simple statistics say? What does this mean in relation to the conclusions I am drawing? Are the proportions that I am using correct? Does my data agree or contradict previously established truths? What are the probabilities of your dataset? This can help you down the road to making the correct visualizations and impacts in your data analysis.

[1] —The “Merry Christmas Coronary” and “Happy New Year Heart Attack” Phenomenon Robert A. Kloner, 2004

[2] —




Certified superstar.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

What Is an MLP, and Why Should You Care?

Mindful Data Scientists: Create a Bias Free Model

Data Analysis with Google Cloud BigQuery

“I’ve received an offer from Starbucks!” —but will you use it?

Mr. President of Indonesia Tweet Analysis Using Orange.

Drawbacks of Standard Reports in Bitrix24

IMDb vs Rotten Tomatoes: The Wisdom of Crowd Goes to The Movies

Using Python to predict the mood of your Twitter Feed

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Luke Veitch

Luke Veitch

Certified superstar.

More from Medium

Story of an Analysis

Getting Zucked

Tracing Slave History in Louisiana: Statistical Significance

Printout of a Pandas series labeled ‘SALEVALP’. The left column records numbers 0 to 100,666. The right column shows values in a range of 0 to 1430.

Determining the best first wordle word to guess, using data