The SARS-CoV-2 lineage dynamics analysis suggests lineages with a selective advantage to spread in the host population, but also those increasing in frequency due to other processes such as unrepresentative sampling, e.g. data biases towards certain areas, large clonal outbreaks or population bottlenecks. The method currently uses all available sequences for a particular geographic region and there are likely geographic and local sampling biases (e.g. from studies of individual, large outbreaks). Analyses for the most recent time windows are also affected by delayed data deposition in GISAID. Information should therefore be interpreted in combination with further epidemiological information and experimental evidence that amino acid changes observed in a certain lineage are likely to confer a selective advantage.
Starting from January 2020, we calculate lineage frequencies per month from Pangolin lineages for all countries with sufficient sequences available. We then visualize these dynamics and identify significant changes in lineage frequencies according to the method described in Klingen et al. (2020) and Klingen et al. (2018). For this, we test the significance of frequency changes from one time window to the next using Fisher's exact test and the Benjamini-Hochberg procedure to correct the false discovery rate for multiple testing. We then identify lineages of interest as those that increase significantly in frequency and above a predominance threshold for the first time in a particular season. Note that to detect lineages early, we select the predominance threshold for the SARS-CoV-2 analysis as 0.1. The frequencies of the selected lineages are plotted in colour over time, with the name of the lineage written on top of the plot in the season it was predominant. The frequencies of all other non-significant lineages are displayed as grey lines.
Sliding window method
Instead of comparing sequences between consecutive months, the sliding window approach compares w sequences in the current window to w sequences in the previous window. The data are ordered by sampling date so that the analysis picks up significant changes over time, although in a more sensitive manner than the monthly analysis. The window is moved over the data with step size s. In each step, the significance of frequency changes is evaluated using Fisher’s exact test, in line with the standard analysis. To adjust the FDR for multiple testing, the Benjamini-Yekutieli procedure for dependent tests is used. For the analysis on countries, the window size w is set to 1000 and the step size s to 100. For the analysis on German states, the window size w is set to 200 and the step size s to 10.
Heatmap with the frequencies of the selected potential variants of interest (pVOI’s) for the previous month that have an assigned antigenic score above the predetermined threshold (3.85). The lineages are ordered from highest, top of the heatmap, to lowest antigenic score, bottom of the heatmap. The dark red color indicates a frequency of 1 while the lighter color red represents a lower frequency closer to 0.