Identifying Data Bias Early
Everyone loves an employee who comes to work early, a student who arrives early to class, or a first date that shows up early.
For analytics, and especially for machine learning, early is the best time to discover data bias.
Data bias is a shift in data accuracy. For machine learning, that shift creates a dangerous signal that can mislead a model. Machine learning applies a judgment on data that is the basis for a regression model, a predictive model, or a decision tree.
Some types of bias
What kinds of bias can occur? There are a few kinds, but two of the most common when planning data for modeling are selection bias and confirmation bias.
Selection bias occurs when data is selected from a subjective perspective rather than objectively, or when non-random data has been selected. Subjectively selected data introduces a population that does not represent the actual population. The results from such data become skewed.
Surveys provide a good example of where selection bias can be introduced. Surveys are sometimes sent to a select group of people rather than at random. That initial selection introduces a bias before the response is even given.
Confirmation bias is the opposite of a hypothesis-based analysis. It occurs, for example, when a survey sets out to test a fully formed opinion rather than explore a hypothesis. It looks for data to support the opinion rather than forming a theory and planning an experiment to address if a hypothesis is supported by data.
It’s not always easy to detect bias early, or even before algorithms start working on live data sets. Data drift is an example. That’s where changes in the data aggregate due to changes and updates in the system associated with the data. It can happen even when sample data has no errors, but fails to account for an outlier, or for system changes that can occur with live data. This can cause an effective analytics model to become inaccurate in a live application.
There are a few steps that can be implemented to keep the impact of bias minimal.
- Start with simple prototype models. Doing so highlights categorical problems or bad values. A good evaluation of a prototype should give some indication of what basic data errors exist. Very few models can handled empty fields. Users should ultimately understand data quality within a model, noting how it handles missing input data and how it relates to the predictor variable – the variable dependent on the input data. Another simple algorithm model, used specifically for categories, is K Nearest Neighbors. It is a simple machine learning algorithm used to assign scores to data. It can help in some instances in noting how poorly categorical variables have been selected. There can be missing variables, bad numbers in the data, or too many levels that introduce more variables than necessary. The discussion about what a model can handle can initiate discussion about how categories qualify against the objective of a model.
- Identify Why Outlier Data Exists. People who poorly understand statistics will overlook the significance of an outlier. As a result, the outlier is included in building a predictive model. Not all outliers are bad, but their existence can be an important indicator of problems. Including an outlier skews data, diminishing accuracy for machine learning initiatives. Be ready to explain outlier activity or at least hypothesize why it exists during a data mining exercise.
- Identify How Collected Data Is Distributed. This may sound like basic data exploration task, but it also serves as an important defense against exposure to bias. Statistical metrics help in describing how normal the data is. Take skewness for example. Skewness describes a dataset’s departure from a normal bell curve, and indicates how diffuse data is around a mean, median, and mode. Skewness is not an extensive statistic metric. You can import the data into a R programming dataframe and use a library to calculate it. But walking through an exercise in determining sknewness on a given dataset can encourage a more objective viewpoint rather than overreliance on expectation.
- Confirm your Objective With Other Professionals. Finally, the data mining environment is an open source developer environment, meaning that someone may have encountered your issues or can help with simple questions on evaluating data. Use a combination of collaborative platforms such as GitHub, Slack, and Stack Overflow to ask your questions (and be a good steward on any help you receive by contributing help back to the communities that you join).
Bias won’t go entirely away. But marketers must minimize it early so that it does not scale and cause larger problems downstream, especially when machine learning applications are to be deployed on large data sets. The best way to do this is to ensure that data exploration methodologies are clear and easily explainable so that others can help ensure objectivity and advance analytical success.