Learning data and analytics is essential in any modern L&D department. There is more data available than ever to provide insights on what works, what doesn’t. How L&D is making an impact and where there are opportunities to improve. Being able to correctly interpret data and gain insights through analytics is rapidly becoming a crucial skill for many L&D professionals. Not being able to do so, or giving a wrong interpretation of data could be rather problematic. In a worst case scenario, it could even cause you to make the wrong decisions and continue doing things that do not work (or that are even counterproductive)..

There are many challenges with correctly interpreting data. Sometimes results are so counter intuitive that it is hard to believe. A classic example is the fact that data clearly shows that increased ice-cream sales lead to more people dying from drowning. I’ll let you think about that one for a while and will come back to this example later.

More often, data is presented in such a way that it supports a pre-defined story. Data could even be presented in such a way that it has a high probability that the data will interpreted incorrectly. “Lies, damned lies, and statistics” is a much used quote in data science and statistics for a very good reason (not a quote from Mark Twain as by popular believe, but actually by British prime minister Benjamin Disraeli).

The biggest challenge however is our own ability to correctly perform analysis following the scientific method and as part of that correctly analyze the data and draw the right (unbiased) conclusions. This is not just a challenge for us as L&D professionals, but even for seasoned scientists according to a 2011 paper that actually claims most scientific research is wrong!

A very clear although somewhat technical video on why most scientific research is wrong! You might need to brush off some statistic 101 to fully understand the importance and message of the video. For more information on the mis-use of the p-value see also wikipedia.

It’s essential that we know how to read data, data visualizations and statistics so we can correctly interpret data driven insights. Hence I’ve written a series of blogpost on typical misinterpretations and misleading data and analytics practices to enable you to build a correct data driven story.

This first part will cover the use and misuse of assumptions.

Assumptions are the mother of all F*** Ups

I’m actually not sure who first mentioned the title of this section. But there is a huge number of articles and books available on why we humans make so many assumptions and why we should do less of it. Especially in business and analytics context. In short, the reason we make assumptions is the simple fact that we never, ever can know all relevant pieces of information. Never, ever! And because we lack information but still want to draw conclusions and take action, we compensate for the missing data by ‘filling’ the gaps using assumptions based on our own personal experiences. That sounds rather disturbing, but making assumptions is a deeply human trait and one of the reasons we still exist as a species: If we meet a dangerous animal in the wild, we do not have sufficient information on what the animal will do. Will it look at you and remain passive? Will it just walk by? Or will it attack you? Our ancestors did not have the luxury to wait until they had all the missing pieces of information, but rather assumed that large mean-looking animals in the wild were dangerous and ready to attack you. So they’d stayed well clear of them.

There are however 2 major challenges with assumptions in the context of data analysis. First you can have too many assumptions, and secondly you can be unaware you are making assumption.

That is why I wanted to start this series on data interpretation in learning analytics with the topic of assumptions. In any analysis there is a real risk of both of these challenges surfacing! It actually is almost impossible to do any analysis without assumptions, hence I always stick to the following 3 rules:

  1. Limit the number of assumptions to the absolute minimum
  2. Always record every assumption you make
  3. Test you assumptions periodically to ensure they still hold up

A couple of examples from the world of L&D

1. The mother of all assumptions in L&D: Learning only takes place through training

A classical image from Jane Hart (C4LPT) on ‘the learning police’ (source)

By far the most used assumption in L&D, and arguable the most misleading, is the assumption that people only learn things when they attend activities organized by L&D. Or only when people consume content created for them by L&D. This assumption is so widely present and embedded in our thinking and way of working, that some people even get angry when you address it. I am always puzzled why we feel so strongly on this. Possibly we have inherited this assumption from our education system that also proves to be very persistent with the assumption that learning only happens at school. This despite people like Albert Einstein claiming that “Education is what remains after one has forgotten everything he learned in school. It is a miracle that curiosity survives formal education.”

People learn everywhere, all the time. Through work, by looking things up on Google, by asking colleagues or simply by trying and failing. So whatever analysis we do on learning using data from our own learning infrastructure, how extensive it may be, we must take this assumption always into consideration.

I think that the 70-20-10 model was an effort to bring the idea that learning happens outside our programs more into our L&D strategies and thinking. I do not think the 70-20-10 thinking is fully embedded, but it did help (as long as we not assume the 70-20-10 percentages are to be followed with high precision).

When we take this assumption into account in our learning analytics efforts, we should also ask ourselves the question how much people actually learn through L&D programs versus other means….

Survivorship Bias

There is a famous example of data science that even got it’s own definition and Wikipedia page called “survivorship bias”. It’s the “logical error of concentrating on the people or things that made it past a specific selection process and overlooking those that did not, typically because of their lack of visibility”. (source wikipedia). The definition originates from a WWII study on bullet holes in allied aircraft returning from raids over Germany. The purpose of the study was to figure out how the airforce could reinforce specific parts of the aircraft to increase survival rates.

Survivorship Bias

Looking at where the aircraft were showing bullet holes, it would be perfectly natural to focus our attention on the area’s that were hit most. And reinforce those. This was exactly what the US military planned to do. Until the statistician Abraham Wald actually suggested they should reinforce the area’s that had the least damage. His logic was that there was no information available on the aircraft NOT returning and it was more likely that the aircraft not returning were actually hit in these area’s.

So next time when considering insights and conclusions on specific employees who are less engaged with your LMS or LXP, remember this bias. It does not mean that they are not active learners…. They could just be learning new skills and behaviors without using your L&D programs and infrastructure!

2. Assumptions on the correctness and objectivity of data

We have a tendency to assume that data is always right. Data is data right? Data does not lie and data is always objective. So when a thorough analysis is done on data, we should have everything covered right?

Well…not quite.

All data that we use should be scrutinized on hidden assumptions and subjectivity. Most data we generate is after all still generated by people. And people make assumptions. Even more, data generated by people is always subjective. Not necessarily in a negative way (although an extreme end of the subjectivity scale is when this turns into bias). We often do not even realize the level of subjectivity involved. So even if data does not lie and is neutral (which is in essence true as data has no voice, no opinion), the source of the data, aka us, is for sure not neutral! So we should always consider possible assumptions related to the sources of data we use for our analysis. If we don’t, we run a risk that our conclusions are completely wrong.

A few examples

Example 1: Data on the Skills of the future

There is a lot to do with future skills. And the use of data, analytics and AI is ever more important to be able to create insights into what skills people have, what skills the organization require and what skills will be really important in the future. The CEO of a Dutch start up that I’m following for a while, Kimo, recently shared a nice article on how data and AI can help in skills matching. There is so much potential and very serious analytics and AI work being done in this field. However, there is one big challenge with all the work. The data on which these advanced analytics are typically performed, holds a lot of very crucial assumptions that are almost never mentioned, but are hugely important to take into your story:

Skills Self Assessments: When we want to understand what skills we have in the organization, we still mainly use skill self assessments where employees can identify and rate the skills they have along a proficiency scale. In other fields of data science, the method of collecting data through surveys is steered away from as much as possible. Yes, it is easy and cheap, but it also holds a lot of disadvantages. A crucial disadvantage of this method, is that it includes a lot of assumptions. Like the following:

  1. That everybody equally understands the meaning of each skill in the list
  2. That everybody equally understands the meaning of each proficiency level
  3. That everybody equally understand the characteristics of every skill-proficiency combination
  4. That people have a very accurate understanding of their own skills and proficiency level
  5. That everybody is fully honest when taking the assessment

None of the above assumptions can be easily proved to be correct, making every conclusion drawn from data gathered through self assessments to be of limited value.

Skills in demand: When we want to understand what skills our organization need, or what skills are in high demand in the workplace, we often use 2 data sources. First we question companies on their recruitment plans, like LinkedIn does every year (the 2022 version should come out soon!). The second method used, or more accurately, dataset used, is by processing job postings and search for skills mentioned in these postings through natural language processing (NLP). NLP is a great way to turn text based information into structured and informative data that can be analyzed. However, the one big assumption we make in using either method for data collection is the assumption that the recruiter and manager know exactly what skills are required to be successful in the role they post. That is a big assumption!

Future skills identification: Similarly to the above example of skills needs, many of the ‘future’ skills research is done using the same data sources: interviews with business and HR leaders, and job posting data. So the same assumption holds: how well can we define the skills required to be successful in a job? But there is an second assumption that is possibly even more important to take into consideration: we assume that current skills will also be key in future job, but how do we deal with future skills that do not even exist today?

The troubling thing with the above example is that sometimes people are making up for the low level of reliability of the data by increasing the complexity of the analytics process. This is a classic example of GIGO: garbage in, garbage out. No matter how good your analytics (or AI for that matter!) is, if the data going into it is low quality, so will be your results.

Example 2: Learning Impact based on Evaluation data

We in L&D still love our course evaluations. But it really hurts every time I see claims that learning makes an impact when that claim is only based on on data taken from course evaluations. The reason this hurts so much is because that claim is false. But so many of us are buying into it. It is impossible to prove a direct correlation, let alone a causation, between how people personally perceive the quality and relevance of a training, and how they perform better in their role by only using evaluation data.

At best, evaluation data can provide you with a ‘perceived business impact’ rating, but even this one is the most shaky of all shaky data driven claims in L&D. The reason? Simple, people making this claim include way too many assumptions!

Key among these assumptions would be the following:

  1. Assuming that participants actually learned something
  2. Assuming that the learning sticks
  3. Assuming that participants can apply what they have learned in their role
  4. Assuming that what people apply from the training actually improves their personal performance
  5. Assuming that the improvement of the personal performance of the participant contributes to improvement of overall business performance
  6. Assuming no other major influencing factors contribute to better business performance
  7. Assuming that no barriers exist in the day to day work of the participant that is preventing them from any of the above…

Assumption 1-3 are actually examples of area’s where we can test the assumptions: To test that participants actually learned something, we introduce tests and knowledge checks. To test if it sticks, we can repeat testing over time (spaced learning for example). To test the assumption that people apply what they have learned, we can use post training surveys with participants and managers. Although ideally we start to bring in business (performance) data to confirm this as we learned that survey data is not a reliable source.

However, while we are very fond of using test and knowledge checks at the end of our learning programs (and through this are pretty safe on assumption 1.), the spaced learning and testing the application of knowledge/skills gained in a learning experience in the workplace is not (yet) really embedded and not used consistently as is the case with knowledge checks.

More important and relevant for this article are the assumptions 4-7 and all the relevant assumptions I have not mentioned here. They are even more crucial and for sure much more complex to prove right. Especially number 6 (no other major influencing factors contribute to better business performance) will require serious thinking. But it is absolutely essential when you are serious about learning impact measurement, that you take all other factors into consideration that influence business performance, as there are potentially so many.

Example 3: Learning (meta)data

the third and final example of data used for analytics holding assumptions is the data we generate ourselves when we create learning experiences. All data describing these learning experience can be called ‘metadata’. Typically we record things like title, description, type of experience, duration, training brand or provider etc. I’ve earlier described why having accurate metadata is crucial for a positive learner experience, and crucial for analytics. But there is an additional reason why complete and accurate metadata is crucial: to minimize and control assumptions. In order for learning analytics results to be as accurate as we can get them, we must ensure that the metadata we create in our learning and knowledge systems hold a minimum of assumptions. Because more likely than not, they actually hold way too many!

A few examples:

Learning Hours: Although avoided by some, I still believe that learning hours are key for many valuable learning insights as I shared in a previous article. There is however a limitation with using and measuring learning hours: we can never determine the exact effort spend on learning. It will always be an estimate. So if we put ’30 minutes’ as duration, we make the assumption that the average time to complete is 30. But do we ever check that I wonder. And do we ever correct the 30 minutes if data shows people on average only take 20 minutes to complete? Or 40? I’ll bet the majority of us do not validate the assumption that “this training experience takes on average xx minutes to complete”.

Proficiency level: A few years back research was done on the level of proficiency of the training catalogue at a large global company. The outcome of this research was that most of the learning content was at a lower proficiency level then mentioned in the learning need analysis documents and (more important for learning analytics) in its metadata on the LMS. So you might be thinking that you’re doing an expert level NLP course, while in reality it is a beginner level. This example demonstrates that we assume that the content of a course is at the proficiency level we stated in the LNA and the metadata. But as with learning hours, do we ever validate this? How often does the intended content changes during the design and development of a learning experience? How many of us have controls in place to safeguard we design and build content according what we set out to do? How many of us have clear standards when it comes to all having the same understanding of proficiency levels, to avoid that what I think is ‘beginner’ is considered ‘advanced’ by somebody else?

Building skills through training: Ideally we have a solid L&D Data Strategy in place that enables us to link learning content with skills and a proficiency level. That way people can select what skills they want to develop and either get recommendations through AI or can search themselves for relevant content themselves. It also provides a great opportunity for analytics by being able to clearly track if people are consuming content in line with your skills strategy and needs. It can create interesting visuals like the ones below from mylearninginsights.com. But the fact that we have linked content with a skill and proficiency level, does NOT mean that the participant has gained the knowledge, or build the skill after consuming the content. That is a big assumption that I have addressed earlier. Unfortunately, many content providers still claim that consuming content = building skills without mentioning this very significant assumption which is a shame as it causes many of us to oversimplify the art and science of building skills.

The “Skills” view of mylearninginsights.com presenting learning hours by subject (skill group) and proficiency level. Notice that the total percentage does not add up to 100. This is because 23% of learning hours do not have a skill and/or proficiency level associated with them. A nice example on how you can build 2 sets of information in one view.

So what to do with assumptions?

As said at the beginning of this post, you cannot perform any analysis without making assumptions as you will never ever have all the relevant data available. However, totally ignoring assumptions altogether will result in incorrect and possibly harmful insights. It could also harm the credibility of L&D if stakeholders find our crucial assumption were made without being clear on this. Especially when others do not agree with your assumptions!

My recommendation is to always include the following 3 (or actually 4) guidelines in your analysis project or product.

  1. Limit the number of assumptions to the absolute minimum
  2. Always record every assumption you make, and ideally for each assumption estimate the level of impact of the assumption on your analysis and outcomes
  3. Test you assumptions periodically to ensure they still hold up. As a minimum the high impact assumptions should be tested through additional data analysis. If no data is available, assumptions can be tested through engagement with you key stakeholders to make sure that they accept them.

The fourth guideline is to be transparent on assumptions and humble in your claims. Transparency will help others understand what assumptions you have made and why, so they can take that information and insights into their context.

Being humble is to be realistic on how well your insights would hold up in the real world. Using ‘perceived impact’ rather than ‘impact’ when analyzing course evaluation data only. Using statements like “people are building skills in these area’s through L&D”, rather than “people have build skills” will help.

Part 2 of the series will address the the very frequent mistake made in statistics and analytics (on purpose or by accident) by mixing up correlation and causation.

Data Interpretation in L&D part I: Assumptions

Leave a Reply

Your email address will not be published. Required fields are marked *