If you ever tried to learn a new language, building up sufficient vocabulary to have basic conversations is one of the biggest hurdles. You can memorize basic sentences but will not know what to say when the conversation goes off script. Even more challenging, what the heck is the other person saying? I have ended many a conversation with 我不知道 (I don't know) or 我不明白你说什么 (I don't understand what you're saying) because my Mandarin Chinese vocabulary is so limited.
Given the huge amount of time required to study languages, what is the most efficient way to memorize and recall a long list of facts? The answer is spaced repetition. Spaced repetition is a technique where the time intervals between active recall get increasingly longer as your memory strengthens. This technique applies to broader topics such as studying for exams, quotes you want to remember1, and even learning coding languages as long as the information can be broken down into bite-sized chunks. For reasons that I will not dive into here, active recall is more effective at strengthening memory compared to passive review.
Ebbinghaus' forgetting curve
The foundation of spaced repetition is often attributed to the German psychologist Hermann Ebbinghaus2, who ran a N=1 study on himself in the 1880s to remember nonsense syllables. Focusing on nonsense consonant-vowel-consonant syllables such as KOJ and YAT instead of words people can reference such as DOT or LAW was Ebbinghaus' attempt at removing any existing knowledge of language from the equation. This led to him publishing the forgetting curve3 in 1885 which is simplified in a recent paper (Reddy et al., 2016)4. The formula is fairly simple: \[P[ \textrm{recall} ] = \exp (-\theta \cdot \frac{d}{s})\] where \(\theta \in \mathbb{R}^+ \) is the item difficulty, \(d \in \mathbb{R}^+ \) is the delay or time elapsed since previous review, and \(s \in \mathbb{R}^+ \) is the memory strength.
By Icez at English Wikipedia. - Originally from en.wikipedia; description page is/was here., Public Domain, Link
As you can see above, the curve has an exponential decay and will shift depending on item difficulty, time since last review, and memory strength. If the item was just learned (\(d=0\)), then \(P[\mathrm{recall}] = 1\). On the other hand, if the item was learned a long time ago, then \(P[\mathrm{recall}]\) will approach 0.
Leitner system
Fast forward nearly a century, the German science journalist Sebastian Leitner proposed a method based on spaced repetition in his book So lernt man Lernen (How to learn to learn) in 19725. This method uses 5 boxes of increasing size (1, 2, 5, 8 and 14 cm) where the flashcards within each box were reviewed only if it was full. During review, correctly answered cards will graduate to the next box while incorrectly answered cards will go back to the first box.
This system ensures that the more difficult flashcards are reviewed more frequently which results in efficient time use. However, it may take a long time to fill up the last box at which point the first card in the box may be significantly older than the last card. How can you space out the reviews at an individual card level to take advantage of the optimal repetition spacing?
SuperMemo
In 1985, Polish researcher Piotr Wozniak developed the first iteration of the SuperMemo algorithm SM-0 that loosely followed Leitner’s methods6. By 1987, this evolved into SM-2 - a computer algorithm at the individual card level with the following interval algorithm7.
\[ I(1) \rightarrow 1 \]
\[ I(2) \rightarrow 6 \]
\(I(n) \rightarrow I(n-1) \cdot \textrm{EF} \) for \( n>2 \)
where \( I(n) \) is the inter-repetition interval after the \( n \)-th repetition
and \( \textrm{EF} \) is the easiness factor reflecting the easiness of memorizing and retaining a given item in memory.
While the latest version of the SuperMemo algorithm is up to SM-18, a variant of SM-2 is still used in the popular open-source flashcard program Anki. These algorithms may be crude but there is no disputing the success people have had with spaced repetition software. Spaced repetition is the backbone of many language acquisition programs such as Pimsleur and Duolingo.
Challenges
Tools like SuperMemo and Anki are a significant step up from the traditional methods but they are heuristic based approaches that use hand-picked weights. Looking back to the evolution of various disciplines within machine learning (e.g. natural language processing, computer vision), they started with a lot of structures and assumptions, including hand-picked features. As computing power, available data, and our understanding of these disciplines improved, the models got rid of more assumptions.9
With the massive amount of data that are being collected by spaced repetition software and technological progress since the 80s, leveraging machine learning approaches for optimizing to the best policy starts make sense. We explore two different ML-based approaches: half-life regression and Leitner queue network.
Half-life regression
Settles et al. (2016)8 goes back to basics with Ebbinghaus’ forgetting curve. \[ p = 2 ^{-\Delta/h} \] In this equation, \( p \) denotes the probability of correctly recalling an item which is a function of \( \Delta \), the lag time since the item was last practiced, and \( h \) is the half-life or measure of strength in the learner's long-term memory
We can then estimate the half-life with
\( \hat{h}_\Theta = 2^{\Theta \cdot x} \)
where \( x \) is the student's feature vector and \( \Theta \) is the set of weights that correspond to the features.
Features broadly fell into two categories:
- Interaction features that count the student's history with a certain word
- Lexeme tag features which is a sparse matrix of roughly twenty thousand lexeme tags to capture the inherent difficulty of the word
The model then predicts \( \hat{p}_\Theta = 2^{- \Delta / \hat{h}_{\Theta}} \) with loss function \( \ell (X; \Theta) = (p - \hat{p}_\Theta)^2 + \alpha (h - \hat{h}_\Theta)^2 + \lambda \| \Theta \|^2_2 \) where \( X = \langle p, \Delta, x \rangle \) and \( \lambda \) is the regularization term.
Results of the model was positive with the half-life (HLR) model significantly improving upon the mean absolute error (MAE) and getting very close in the area under the curve (AUC) compared to the baseline. The authors observed two insights:
- Including the lexeme tag features did not make a significant difference to the results
- Optimizing for the half-life (\( \hat{h} \)) is clearly important to the model
When the model was implemented into the production system of Duolingo where Settles worked, they found that the lexeme tag features were possibly overfitting for difficult words. To skirt around this as well as the cold-start problem, Duolingo removed the lexeme tag features and saw statistically significant improvement in retention metrics over the original Leitner system.
Leitner queue network
In Reddy et al. (2016) on the other hand, the Leitner system is taken to the next level. For deck \( k \) at time \( t \), let \( D_k=t-T_{k,1} \) denote the delay since the last review of that item. Then we have the transition probability \[P [k \rightarrow k + 1] = \exp{ (- \theta \cdot D_k / k) } \] \[P [k \rightarrow \max \{k - 1, 1 \}] = 1 - \exp{(- \theta \cdot D_k / k)} \] Note that the term \(\theta \cdot D_k / k \) is equivalent to \( \Delta / h \) in half-life regression. They both describe how quickly the probability of recall degrades over time.
We can then define the problem as finding the scheduling policy that maximizes the learning rate \[ \lambda_{\textrm{out}} = \lim_{T\to\infty} \frac{1}{T} \cdot | \{ \textrm{Items mastered in interval [0,T]} \} | \] given a static arrival rate \( \lambda_{\textrm{ext}} \) that follows a Poisson distribution, a service rate \( \mu_k \) representing the rate at which items from deck \( k \) come up for review, and the user’s review frequency constraint \( \lambda_{\textrm{ext}} + \sum_k \mu_k \le U \).
From both the theory as well the experiments, a few interesting observations emerge:
- Given a fixed review frequency budget \( U \), there is a phase transition in learning where your throughput (\( \lambda_{\mathrm{out}} \)) initially increases fairly linearly with arrival rate (\( \lambda_{\mathrm{ext}} \)) but starts to nose-dive as the arrival rate overwhelms the review frequency budget. As such, there is an optimal arrival rate for ever throughput.
- As a corollary, as your review frequency budget increases, your optimal arrival rate will increase fairly linearly.
- Item difficulty has a big impact on optimal review schedules. The more difficult the material, you should introduce new material less often and spend more time on newer material.
Conclusion
These newer algorithms are certainly an improvement over their 3-4 decade-old parents but not without challenges. In half-life regression, the measured probability of recall is at the session level. This drops the temporal aspect (forgetting at the beginning is better than forgetting at the end) and the magnitude (getting 1/1 vs. 10/10) which are arguably useful information to bake in. On the other hand, Leitner queue network uses approximate methods to make the problem tractable which may impact accuracy and the FIFO structure does not allow for cards to reshuffle within a deck or move between decks without review.
Despite their challenges, any version of these spaced repetition algorithms is a significant step up from the other approaches such as random review or focusing on recent cards10. That being said, these approaches make you more efficient with your time but they are not a substitute for hard work. You need to put in the effort and create habits to make the effort sustainable. Following the theme of quotes I want to remember, I will end with this:
Nothing in the world is worth having or worth doing unless it means effort, pain, difficulty… I have never in my life envied a human being who led an easy life. I have envied a great many people who led difficult lives and led them well.
-Theodore Roosevelt
1I recently learned the full quote of Oscar Wilde's oft-quoted "Imitation is the sincerest form of flattery" (full quote here)
2Herman Ebbinghaus. (n.d). In Wikipedia. https://en.wikipedia.org/wiki/Hermann_Ebbinghaus, accessed December 16, 2020
3Forgetting Curve. (n.d). In Wikipedia. https://en.wikipedia.org/wiki/Forgetting_curve, accessed December 16, 2020
4Reddy, S. et al. 2016. Unbounded Human Learning: Optimal Scheduling for Spaced Repetition. https://dl.acm.org/doi/pdf/10.1145/2939672.2939850
5Leitner, S. (1972). So lernt man lernen. Herder.
6Wozniak, P. 1990. https://www.supermemo.com/en/archives1990-2015/english/ol/beginning#Algorithm, accessed December 17, 2020
7Wozniak, P. 1990. https://www.supermemo.com/en/archives1990-2015/english/ol/sm2, accessed December 18, 2020
8Settles, B., Meeder, B. 2016. A Trainable Spaced Repetition Model for Language Learning. https://www.aclweb.org/anthology/P16-1174.pdf
9For a more recent example, look at the various iterations of AlphaGo. It first learned from human games, then learned to play from scratch (AlphaZero), then learned without even knowing the rules of the game (MuZero) - with each version surpassing its predecessor.
10AndrewFMs. (November 19, 2011). Simulation of SRS vs. Traditional Review [Video] YouTube. https://www.youtube.com/watch?v=ai2K3qHpC7c&ab_channel=AndrewFMs