On filling missing data of collected feedback
This post is originally posted in the Namaste Tech Blog
The Namaste product Uppy collects anonymous information about patients’ symptoms, side effects, and the desired effects of a medicinal cannabis intake. When analysing the data collected, it appears some data points are missing. In the machine learning literature, such missing data points are called item nonresponses. They happen when a person provides incomplete information about their experiences. An experience is composed of three steps:
- The patient gives their symptoms and desired effects.
- The patient vapes their dose of cannabis.
- 20 minutes later, they report on the session.
Incomplete information may happen when the third step is skipped, i.e. when the patient forgets to provide feedback. In order to make better predictions, our algorithms need to make informed guesses about what the missing values might be, based on the data we have from the overall respondents and strains. In this article, several approaches to solve this problem are explained.
All the data related to a cannabis intake experience can be described in a matrix. Rows are strains and columns are properties from the reported experiences.
In Uppy, an experience can be categorized among 14 effects, 12 side effects, and 21 symptoms to treat. 47 columns are present in the matrix:
- Symptoms: anxiety, convulsions, excessive appetite, dizziness, vertigo, agitation/irritability, seizures, spasticity, eye pressure, cramps, inflammation, muscle spasms, nausea, fatigue, pain, insomnia, depression, stress, headaches, lack of appetite, impulse.
- Effects: relaxed, dreamy, high, focused, sleepy, uplifted, creative, hungry, giggly, aroused, talkative, energetic, tingly, happy.
- Side Effects: dry eyes, foggy, distracted, anxious, dizzy, dry mouth, silly, headache, unmotivated, paranoid, red eyes.
Each strain is described by several known characteristics like its category (hybrid, indica, or sativa), its flavours (sweet, citrus etc.) and its chemical properties (THC, CBD). Our recommendation engine has more than 2400 strains.
This matrix is called a utility matrix. In this matrix, all values are positive and lie between 0 and 100. A value of 0 means that no report shows that a specific strain helped to achieve a desired state. A value of 100 means that all users reported a positive effect.
Some cells in the matrix are not filled with values; these are the item nonresponses.
The goal of this article is to show how we fill the empty cells in the intake experience matrix to make better recommendations for customers with specific targets to treat and make an informed guess on how a strain can potentially affect effects, side effects, and symptoms.
Approach 1 — Aggregated value from the utility matrix values
As values in the utility matrix lie in the range from 0 to 100, the simplest way to fill the missing values is to calculate an aggregated value from the rows or columns:
- The average value over the rows or columns (mean)
- The middle value over the rows or columns (median)
- The most common number over the rows or columns (mode)
The idea behind this method is to find one value that represents the full set of reported experiences. Aggregation over columns means that we try to predict unreported experiences as an integral value over the reported experiences for the particular strain. Aggregation over rows means that the full set of strains with reports is taken into account. This approach helped us include strains with no reported experiences into the recommendation list right after their introduction. However, the quality of recommendations is not as good due to the errors in the experience estimation.
Approach 2 — k-nearest neighbors based method
The k-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm that can be used to solve regression problems. KNN works by finding the distances between an item nonresponse and all the examples in the data, selecting the specified number of examples K close to the item nonresponse, then averaging the labels. To average labels, several methods are available: Euclidean, Manhattan, or Minkowski.
The algorithm is simple to understand. However, several downsides appear:
- When adding completely new strains to the utility matrix, filling is impossible as we do not know where to place the strain in the experiences space.
- When averaging many values (when K is a big number), the KNN performance comes closer from Approach 1 as it averages a complete dimension. For a high-dimension feature space, as in our case, large errors appeared. The magnitude of the error depends on the number of neighbours. With a large number of experiences, the distance to each of them has a small variation. Therefore, the strain appears good for all properties: this seems unrealistic and reach the same performance as Approach 1.
- The KNN algorithm is slow when having lots of dimensions.
Approach 3 — Decomposition of the utility matrix
KNN finds connections between strains in the properties space. However, it does not take into account some hidden connections, like the strain’s characteristics such as the category (indica, sativa, hybrid). This approach takes them into account.
A smart way to solve the downsides introduced by KNN is to reduce the number of dimensions. This can be done as a separate preprocessing step from Approach 2, with the decomposition of the utility matrix. The decomposition of the utility matrix is then an extraction of latent correlations in the strain experiences.
Let’s take our initial utility matrix containing strains and properties and apply Singular Value Decomposition (SVD). Our matrix of size 2400x47 can be represented as 3 matrices: a 2400xK strain factors matrix U, the diagonal matrix 𝞢 of size KxK, and a Kx47 experience factors matrix V*, where K is the number of components.
The matrix 𝞢 is diagonal and consists of singular values that are contributions of the component K into the variability of the data. Intuitively, each value is an information gain of the component K. The number of components K can be found with the Guttman–Kaiser criterion: find K so that matrix 𝞢 does not contain values less than 1.0.
By applying SVD, the feature space is reduced, and strains and experiences are grouped. It helps represent effects of the strain based on the K factor and related group information in a better way than the KNN approach. Alternatively, from a specified property, the best matching strain can be found.
The biggest problem here is that to perform a matrix decomposition, all the cells must be filled: back to square one! To avoid it, we use the matrix factorization algorithms from the “recommendation problem”. The experiences matrix in this case can be represented as a product of strain latent factors matrix H and experience latent factors matrix W:
For the case of reconstruction of the experiences matrix, we use the FunkSVD algorithm (initially proposed by Simon Funk in 2006), where the predicted value can be optimized thanks to the objective function:
The main disadvantage is still present: for a completely new strain, we cannot calculate latent vectors until enough data is gathered and the model is retrained. This brings us to the next approach.
Approach 4 — Regression problem over strain characteristics
The idea is to solve a regression problem after the calculation of the strain and experiences latent matrices. For each new strain, laboratory specialists provide a specification related to their characteristics: concentration of CBD and THC, information on genetics (kind: indica, hybrid, sativa, dominant ancestor), terpenes, taste, type (dried flower, milled, pre-rolls) etc.
A question is then raised: Can the latent vector for the new strain be predicted based on the attributes of a strain? Our experience shows that the answer is yes.
We are solving a multi-output regression problem with gradient-boosting trees. The attributes of the strains become an input matrix X, and known latent vectors become an output matrix Y. The trained regressor can predict the latent vector of any new strain based on the characteristics of the strain, or, in simple terms, item nonresponses can be guessed based on the similar strain’s characteristics, even if no feedback has been collected. This can be interesting for new producers to have their strains recommended, even though customers did not report any experience with them.
Ongoing and future work
Our next steps:
- The list of attributes can be extended with additional data coming from the laboratory tests: additional chemical components, images of the flowers, reviews from the users, etc. This additional data will help to improve the performance of the strain latent vector prediction model.
- Active Learning is a case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user to obtain the desired outputs at new data points. In our case, the algorithm will propose complete new strains or strains with a lot of item nonresponses to our users, with the goal of receiving feedback that will benefit the whole recommendation system. In order to reduce the churn while recommending these new strains, only experienced users can receive these recommendations. Newcomers will be recommended the best matching strains, creating trust in the system.
Filling missing data of collected feedback gives us the ability to considerably extend our list of recommended strains to our customers in the Namaste Technologies products. We currently recommend more than 2400 strains, and each new strain is being immediately added to the recommendation list, even without prior feedback collected.