There are two key words in this title, data and science. Although it should be obvious, the ability to excel in this very hot and exciting field of Data Science requires a very good understanding of these terms and the thought and background behind each of them. Let’s start with science. It is the less complex but more controversial of the two concepts.
Science is a process and a methodology. It is a breath mint and a candy mint. Science is used to gather knowledge. Its principles, or The Scientific Method, were developed during the Islamic Golden Age, about a thousand years ago (1020), by Ibn al-Haytham. About three hundred years later, in the 13th century, Francis Bacon expanded upon al-Haytham’s process and memorialized its four key parts; observation, hypothesis development, experimentation and independent validation; or, watching, guessing, testing and getting a second opinion. This little bit of history is sort of like the song All Along the Watchtower; Dylan wrote the tune, but Jimi Hendrix owns it. Same for al-Haytham and Bacon - al-Haytham was the "thinkeruper" but Bacon got the glory.
Without understanding, accepting and using science and the scientific method, participating productively in the contemporary field of data science seems impossible. There is today however an odd circumstance. We have too many members of our society, and this also goes for the emerging data community, lacking some of the scientific basics. Without the scientific basics, how can one even begin to think about data? We may have unwittingly created part of our own problem.
Fifty eight percent of this group believes college is having a negative effect on our country.
Current thinking in America is casting doubt on the validity of science and on the institutions of higher learning that teach it. We see and hear this in the field – there is something missing and suppliers and buyers each know this, they’re just not sure what it is. There is a sense they’ve missed a chapter or two. There are questions being asked where the questioners know they should know…
There is definitely a rigor missing and we have focused on three elements in this post, Data Hygiene, Data Visualization and Distribution Trimming. The jury may still be out on exactly what is to account for the shallow thought pools associated with scientific inquiry, but the increasingly dismissive views of science are likely part of the reason. Recent findings on the perceived value of a college education may help explain this.
There is definitely a rigor missing
and we have focused on three important elements:
1) Data Hygiene
2) Data Visualization
3) Distribution Trimming
Partisan attitudes on science have changed significantly over the past couple of decades and not to the betterment of science. The purpose of this article is to recap very basic science process and specifically, how it applies to data science. Because attitudes shape our frame of reference, attitudes about science are important to this discussion. Politics and media play oversized roles in our lives today and science has been a victim. Topics such as climate change, genetically modified crops and evolution have attached themselves to the field of science. In efforts to support particular views, science has been denigrated and relegated to the role of a persuasive tactic; a tactic that is described by one side as unreliable and self-serving. Making matters worse and severely polarizing debate, some of these hot button target subjects have also become moralized and nonsecular. Once a subject takes on moral significance, objectivity is lost.
The political party that controls all three branches of Federal government doesn’t care so much for college these days. Fifty eight percent of this political group believes college is having a negative effect on our country. This does not bode well for the advancement of scientific thinking and quantitative methods in the United States. College is where science is taught, where experiments are conducted, where results are debated and challenged, where students learn how to think, where new ideas are imagined, started, financed and developed. We would not have Google Analytics without Google and Google started in college. So did Facebook. So did The Doors, Public Enemy and Vampire Weekend.
Because of this, the validity of the full corpus of scientific inquiry has now been put into question. Epistemology is dying. When science is out of favor, there is less science; there is less research, less support of science, less understanding of science, less acceptance of its findings and less work in general.
But we love science. The Scientific Method is brilliant in its simplicity. It can be reduced to two key factors – validity and reliability. Let’s start with validity one of the 5 V’s of data. That is, are we seeing (observing) what we think we are seeing? Is the experience true or real or valid? For example, if the leaves turn upside down before it rains can this observation be turned into a predictor of rain? Fuck if I know, but let’s test it. Get out the note pad and start recording observations of leaves turning over; record things such as date, time, time-period between upsidedowness and rain and so forth. If there is a pattern, it will probably show up. Next, can the observations be replicated over and over again? Is there a pattern? Is the pattern replicable? That is to say, is it reliable? This is science at its simplest – validity and reliability. Why then, the big debate?
Data. That’s why there’s a big debate.
Data is the subject matter, the sine qua non of data science. But whose data? Where did they get it? Who got it? When did they get it? How did they get it? How did they select what they got? What is its quality? What is its quantity relative to the universe of observations? How representative is the sample? Did they get enough of it? What did they do to the data after they got it? Was bias introduced by design? Was bias introduced by accident? Was bias introduced due to lack of forethought? Whose ax are they grinding? Whose ox is being gored? Let’s leave these questions for another time and get to the data per se and what it looks like? How is the data distributed?
An important step in analyzing data is looking at its distribution. What is its picture? What is the gestalt? This was a theme and a discipline that was drilled into our heads when we were in science school. Plot the data. It is usually illuminating to see the data points laid out in front of you. The terrific thing about today’s software applications is that this is now easy peezee, it takes minimal effort to use the graphing tools in Excel, DataWorld, SPSS, SAS, Alteryx, Lityx IQ, R, and so on. Need a fancier presentation, use Tableau. Once data is displayed its patterns and anomalies present themselves like mushrooms after a rain. All you have to do is look. The first step however is data hygiene.
Data hygiene is set of processes for cleaning data. This includes standardizing formats, assuring the correct content is in the correct fields, that incorrect content is corrected or deleted, that nulls are real, that 0’s are real, and the rules for including or discarding data have been developed, implemented and followed. As simple as this appears, it is a challenge in most businesses. Most organizations lack data governance practices and without them, data collection and retention efforts are ad hoc. As a consequence, there is typically data all over the place and in varying degrees of accuracy and completeness.
Without data governance, there are few standards and little oversight. Each corporate silo or divisional silo does what it considers useful to their parochial set of needs. This is a very common practice. It also worked more or less OK until data utilization and data analytics became part of the current corporate lexicon. Consider the example records below, all from the same fictitious company. These contact records allow this organization to stay in touch with its customers. Once these records are cleaned and standardized, they can be put to use in a Customer Relationship Management platform such as Sales Force, Commence, Insightly etc., or a or Contact Management system such as Pipedrive, Active Campaign, etc. There are lots of excellent tools once the data is clean and ready for use.
DIRTY RECORD #1
This record has sloppy, ad hoc input with fields containing relevant contenct e,g., Client name, client contact, address (all in one field), phone number (all in one field w extension), Region e.g., SW, date of last contact, internal rep (Jim Franks, or JF, etc.)
CLEAN OR STANDARDIZED RECORD
Now the usefulness of this data can be pursued. The various files have been consolidated. The format has been standardized. The address records have been partitioned and placed into useful, sortable, analyzable fields. The contact names have been placed into similarly sortable and actionable fields. In the course of performing very basic list hygiene some records were discarded because the address no longer exists or the contact person is no longer with the organization, etc. It is better to have relatively fewer good and useable records than a larger set of records containing inaccuracies.
Next, look at the data by plotting various sets. This step confirms that the detail makes sense. For example, take sample of company name and its ZIP Code.
It is clear from the chart that all of the companies are in the Chicago area except two outliers. This is a simple and obvious illustration, but it shows the value of plotting data for the purpose of visualization. Do this exercise for as many variables as are important to your research.
The other key step in the early stage of data analytics is evaluating the three measures of central tendency – the average (the mean), the most frequently occurring number (the mode) and the value that splits the data set into two equal groups (the median). This essential step describes the data around its roots or core. It also identifies outliers. The outliers are often removed because they are not reflective of the activity as a whole. The outliers can also point research in a new and important direction. Look at the distribution of IQ scores taken in a 9th grade class in a rural town in Mississippi. Most of the scores are what you would expect to see, except one.
Based on this anomalous record, the researcher in this case assigned a caseworker to explore the individual with the 160 IQ. The young girl was indeed exceptionally bright. After understanding her needs and home life, she was moved into an academic program developed to meet the needs of highly gifted teenagers.
The other key conclusion from looking at this data is that Gifted Girl should be dropped from the data set. She is not reflective of the group as a whole. By doing this, the distribution is trimmed. Arithmetically, the mean with Gifted Girl is 104 and without, 100. The median however stays the same at 100, as does the mode at 100. The next step in prepping the data for utility is trimming the distribution and removing the outliers. This method, produces the “poor man’s median,” it’s generally sufficient.
As a rule of thumb, toss out anything beyond the normal distribution. IQ scores have an average of 100 with a standard deviation of 15. In this example any value >145 or < -55 gets the boot. This is important because most predictive and correlational statistics rely on exponentiating the variances from the mean. Typically this is simply squaring the difference between the average and an individual data point.
This idea of applying exponentiating variances around the mean is no mean feat. It is a significant departure from the theme of this post and the subject of another post on the influence of Karl Pearson, the father of modern correlational techniques.
There are two points to remember from this post. One, science is good and to stand up for it. Number two, the missing rigors of data hygiene, data visualization and distribution trimming are important basics. Once used as part of an analytics process or preferably part of a data governance plan, the quality of investigative thinking will improve. The throughput on analytical process will increase and ultimately there will be more productive utilization of information resources.