When trying to analyse data, it’s a good practice to check if our data are valid or not. I have a dataset in CSV format, and I have to explore them to know what I have. It’s been a long, tedious work to compute a proper SQL sentence that included and selected my date, and I have to make sure such data have quality: not too many missing values, and not too many outlier.
What’s an outlier? It’s an anomalous value with respect to the rest of the data, and it’s too distant to the other values. Any statistical analysis would not be reliable if I don’t check for outliers.
Let’s have a look to my own experience. After great efforts, I could release my CSV file. First thing I do:
patients <- read.table("research.csv", header=TRUE, sep=",") attach(pacients) sumary(patients)
What I’ve found was this:
There are some variables that have wrong values. For instance, peso (weight): it’s quite hard someone whose weight is 670 Kg. My first attempt to get rid of these values was:
max(peso) # it gives the maximum value, but not where it is 670 which.max(peso) # yeah, it gives me the 'index', so I can find it  12
The problem is that I can find these values one by one. It’s not practical. But I can find them without having to create any further function:
which(peso>150)  12 219 386 688 1209 1729 2254
Now, I know the index of each value, using certain criteria (weight > 150 Kg).
Finally, it’s up to me to decide what to do with these values: fix them? Delete them? Well, it depends. If I decide it’s only a typesetting mistake, I can fix it. If I’m unable to find the correct value, it’s better to delete the whole row.
When researching, one of the most important things to remember: be honest. Do not make a fake analysis.