Example: Dentate Gyrus

The dataset used in here for velocity analysis is from the dentate gyrus, a part of the hippocampus which is involved in learning, episodic memory formation and spatial coding. It is measured using 10X Genomics Chromium and described in Hochgerner et al. (2018). The data consists of 25,919 genes across 3,396 cells and provides several interesting characteristics.

We use a stochastic version of the model for transcriptional dynamics used in velocyto (developed by the Linnarsson lab and Kharchenko Lab).

RNA velocity reference https://www.nature.com/articles/s41586-018-0414-6 This notebook is based on http://pklab.med.harvard.edu/velocyto/notebooks/R/DG1.nb.html

Files needed: 10X43_1.loom (will be downloaded automatically) Optional files: DG_umap.npy, DG_clusters.npy

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
%matplotlib inline

import scvelo as scv
Running scvelo 0.1.14.dev35+c86e453 on 2019-01-21 09:26.

We change the matplotlib defaults for beautified visualization.

In [2]:

Load and cleanup the data

Read your data file (loom, h5ad, xlsx, csv, tab, txt …) to an AnnData object.

If you want to merge your loom file into an already existing AnnData object, use scv.utils.merge(adata, adata_loom).

In [3]:
adata = scv.read("data/DentateGyrus/10X43_1.loom", sparse=True, cache=True,

We have a first look into the proportions of spliced/unspliced abundances in our data.

Cleaning our data, i.e. throwing away everything not needed for velocity estimation, saves memory. This step is optional.

In [4]:
scv.utils.cleanup(adata, clean='all')

Abundance of ['spliced', 'unspliced', 'ambiguous']: [0.68 0.08 0.24]
AnnData object with n_obs × n_vars = 3396 × 25919
    layers: 'spliced', 'unspliced'

Preprocess the data

This selects genes by detection (detected with a minimum number of counts and expressed in a minimum number of cells) and high variability (dispersion).

It further normalizes every cell by its initial size and logarithmizes X.

X and the spliced/unspliced count data are preprocessed nearly in the same way (i.e. filtering and normalization), except that X is further logarithmized. Don’t worry if X is already processed; it will detect that automatically and not touch X in that case.

If you want to run this step by step, you can do so with the following commands:

scv.pp.filter_genes(adata, min_counts=20, min_counts_u=10)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
In [5]:
scv.pp.filter_and_normalize(adata, min_counts=20, min_counts_u=10, n_top_genes=3000)
Filtered out 13864 genes that are detected in less than 20 counts (spliced).
Filtered out 6520 genes that are detected in less than 10 counts (unspliced).
Normalized count data: X, spliced, unspliced.
Logarithmized X.

The first and second order moments (basically mean and uncentered variance) are computed among nearest neighbors in PCA space. The first order moments are needed for deterministic velocity estimation, stochastic estimation additionally requires the second order moments.

In [6]:
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
computing neighbors
    finished (0:00:05.35) --> added to `.uns['neighbors']`
    'distances', weighted adjacency matrix
    'connectivities', weighted adjacency matrix
computing moments based on connectivities
    finished (0:00:00.62) --> added
    'Ms' and 'Mu', moments of spliced/unspliced abundances (adata.layers)

Compute velocity and velocity graph

The gene-specific velocities are obtained by fitting a ratio between precursor (unspliced) and mature (spliced) mRNA abundances that well explains the steady states (constant transcriptional state) and then computing the difference of the actual abundances with the abundances expected in the steady state.

By default the steady state is fitted using a simple linear (for deterministic approach) or generalized (for stochastic approach) regression. By setting the percentile perc=95 you can perform an extreme quantile fit, that might better represent the steady states. Further you can set fit_offset=True to account for ambiguous reads that have not been assigned to unspliced or spliced.

Every tool has its plotting counterpart. The results from scv.tl.velocity, for instance, can be visualized using scv.pl.velocity.

In [7]:
computing velocities
    finished (0:00:00.45) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)

This computes the (cosine) correlation of potential cell transitions with the velocity vector in high dimensional space. The resulting velocity graph has dimension \(n_{obs} \times n_{obs}\) and summarizes the possible cell state changes (given by a transition from one cell to another) that are well explained through the velocity vectors. If you set approx=True it is computed on a reduced PCA space with 50 components.

The velocity graph can then be converted to a transition matrix by applying a Gaussian kernel on the cosine correlation which assigns high probabilities to cell state changes that correlate with the velocity vector. You can access the Markov transition matrix via scv.tl.transition_matrix. The resulting transition matrix can be used for a variety of applications shown hereinafter. For instance, it is used to place the velocities into a low-dimensional embedding by simply applying the mean transition with respect to the transition probabilities, i.e. scv.tl.velocity_embedding. Further, we can trace cells back along the Markov chain to their origins and potential fates, thus obtaining root cells and end points within a trajectory; via scv.tl.terminal_states.

In [8]:
computing velocity graph
    finished (0:00:02.46) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

Project the velocity graph onto an embedding

This simply loads a UMAP embedding and annoted clusters that we have prepared for you. Alternatively, you can obtain an embedding and clustering yourself with scv.tl.louvain(adata) and scv.tl.umap(adata).

In [9]:
url_data = 'https://github.com/theislab/scvelo_notebooks/raw/master/data/DentateGyrus/'

adata.obs['clusters'] = scv.load('./data/DentateGyrus/DG_clusters.npy', backup_url=url_data + 'DG_clusters.npy')
adata.obsm['X_umap'] = scv.load('./data/DentateGyrus/DG_umap.npy', backup_url=url_data + 'DG_umap.npy')
In [10]:
scv.tl.velocity_embedding(adata, basis='umap')
computing velocity embedding
    finished (0:00:00.77) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
<Figure size 700x500 with 0 Axes>

Plot results

In [12]:
scv.pl.velocity_embedding_stream(adata, legend_loc='on data', alpha=.05)
In [13]:
scv.pl.velocity_embedding(adata, basis='umap', dpi=200)
In [16]:
scv.pl.velocity_embedding_grid(adata, color='Tmsb10', layer=['velocity', 'spliced'], colorbar=True)
<Figure size 700x500 with 0 Axes>
In [21]:
scv.tl.cell_fate(adata, n_neighbors=20)
computing cell fates
    finished (0:00:02.64) --> added
    'cell_fate', most likely cell fate (adata.obs)
    'cell_fate_confidence', confidence of transitioning to the assigned fate (adata.obs)
computing terminal states
    identified 0 root cells and 1 end points (Astrocytes)
    identified 0 root cells and 0 end points (Cajal Retzius)
    identified 1 root cells and 1 end points (Endothelial)
/Users/volker.bergen/anaconda3/envs/py36/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py:1254: RuntimeWarning: k >= N - 1 for N * N square matrix. Attempting to use scipy.linalg.eig instead.
/Users/volker.bergen/anaconda3/envs/py36/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py:1254: RuntimeWarning: k >= N - 1 for N * N square matrix. Attempting to use scipy.linalg.eig instead.
    identified 2 root cells and 1 end points (Granule mature)
    identified 0 root cells and 1 end points (Microglia)
    identified 0 root cells and 0 end points (OL)
    identified 1 root cells and 1 end points (OPC)
    finished (0:00:00.71) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)
/Users/volker.bergen/anaconda3/envs/py36/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py:1254: RuntimeWarning: k >= N - 1 for N * N square matrix. Attempting to use scipy.linalg.eig instead.
/Users/volker.bergen/anaconda3/envs/py36/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py:1254: RuntimeWarning: k >= N - 1 for N * N square matrix. Attempting to use scipy.linalg.eig instead.
In [22]:
scv.pl.velocity_embedding_grid(adata, color=['cell_fate', 'root_cells', 'end_points'], legend_loc='on data')
<Figure size 700x500 with 0 Axes>
In [15]:
scv.pl.velocity_embedding_grid(adata, color=['cell_fate', 'root_cells', 'end_points'], legend_loc='on data')
<Figure size 700x500 with 0 Axes>
In [23]:
scv.tl.rank_velocity_genes(adata, match_with='clusters', resolution=.8)
computing velocity clusters
    finished (0:00:01.68) --> added
    'velocity_clusters', clusters based on modularity on velocity field (adata.obs)
ranking velocity genes
    finished (0:00:00.54) --> added
    'rank_velocity_genes', sorted scores by group ids (adata.uns)
rec.array([('2010300C02Rik', 'Schip1', 'Slc17a7', 'Ano3', 'Dlgap1', 'Ppp3ca', 'Slc38a2', 'Myl6', 'Clic4', 'Serpine2', 'Schip1', 'A830018L16Rik', 'Hmgcs1', 'Myl6', 'Dpysl3', 'Rab3c', 'Add3'),
           ('Slc7a5', 'Smoc2', 'Fam105a', 'Ryr2', 'Gria1', 'Dlgap1', 'Tubb4a', 'Qk', 'Srgap2', 'Myo1d', 'Chgb', 'Vsnl1', 'Ppp1r14c', 'Gm11659', 'Tmem47', 'Golga7b', 'Snca'),
           ('Dmxl2', 'Adarb2', 'Stmn2', 'Rpa3', 'Tenm2', 'Gria1', 'Mt3', 'Nfib', 'Hn1', 'Ptprg', 'Fxyd1', 'Bsn', 'Elavl3', 'Arrdc3', 'Utrn', 'Smarcc1', 'Schip1'),
           ('Gpc4', 'Sparcl1', 'Slc7a5', 'Unc80', 'Gabrb3', 'Arsb', 'Tspan7', 'Klrg2', 'Orai2', 'Nfkbia', 'Mapk6', 'Dlgap1', 'Snca', 'Gm12089', 'Dip2b', 'Irf9', 'Smoc2'),
           ('Lsamp', 'A330050F15Rik', 'Prex2', 'Grid2', 'Mt3', 'Nell2', '2410089E03Rik', 'Fdps', 'Sirt2', 'Gab1', 'Osbpl6', 'Lancl1', 'Syn2', 'Gatm', 'Gm10010', 'Dtna', 'Jph1'),
           ('P2ry12', 'Fam105a', 'Mpzl1', 'Ptprd', 'Epha4', 'Shisa9', 'Pkia', 'Ctnnd2', 'Qk', 'Arhgap31', 'Ano3', 'Schip1', 'S100b', 'Hepacam', 'Pgp', 'Atp1b3', 'Bzw2'),
           ('Gprc5b', 'Cpe', 'Fabp5', 'Lancl1', 'Nsf', 'Gabrb3', 'Gria2', 'Cspg5', 'Reep5', 'Zdhhc20', 'Mtus2', 'Stmn2', 'Nfib', 'Sept7', 'Stmn4', 'Cplx2', 'Efcab14'),
           ('Sgsm2', 'Josd2', 'Ryr2', 'Smoc2', 'Spock2', 'Syn2', 'Pcdh9', 'Aplp1', 'Ybx1', 'Prex2', 'Atp1b3', 'Man2b1', 'Cpox', 'Gprc5b', 'Scg3', 'Rps4x', 'Ppapdc1a'),
           ('Fabp5', 'Rpa3', 'Rpa3', 'Slit1', 'Gria2', 'Ptprd', 'Tmem47', 'Ptchd4', 'Rab31', 'Efr3b', 'Pcdh9', 'Fam49b', 'Dtna', 'Brinp3', 'Sh3glb1', 'Nfkbia', 'Gtf2h2'),
           ('Stmn4', 'Fabp5', 'Cpe', 'Shisa9', 'Luzp2', 'Pacsin1', 'Tmsb10', 'Ptn', 'Mt3', 'Tmsb10', 'Tnfrsf19', 'Grid2', 'Fabp5', 'Grasp', 'Klrg2', 'Kcnc4', 'Smarcc1')],
          dtype=[('Granule mature 0', '<U50'), ('Granule immature 1', '<U50'), ('Granule mature 2', '<U50'), ('Granule immature 3', '<U50'), ('Neuroblast 4', '<U50'), ('Granule immature 5', '<U50'), ('Neuroblast 6', '<U50'), ('Astrocytes 7', '<U50'), ('Microglia 8', '<U50'), ('Endothelial 9', '<U50'), ('Mossy 10', '<U50'), ('GABA 11', '<U50'), ('OPC 12', '<U50'), ('OL 13', '<U50'), ('Cajal Retzius 14', '<U50'), ('Cck-Tox 15', '<U50'), ('Granule immature 16', '<U50')])
In [24]:
scv.pl.scatter(adata, color='velocity_clusters', legend_loc='on data')
In [25]:
scv.pl.velocity(adata, var_names=['Tmsb10', 'Camk2a', 'Ppp3ca', 'Lsamp', 'Gabrb3', 'Igfbpl1'], ncols=2)
In [26]:
In [27]:
scv.tl.paga(adata, groups='clusters')
scv.tl.paga(adata, groups='clusters', use_rna_velocity=True)
In [28]:
scv.pl.paga_compare(adata, basis='umap', threshold=.15, arrowsize=10, edge_width_scale=.5,
                    transitions='transitions_confidence', dashed_edges='connectivities')
/Users/volker.bergen/anaconda3/envs/py36/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  if cb.is_numlike(alpha):
In [ ]: