Visualizing single cell data using Seurat – a beginner’s guide
In the single cell field, large amounts of data are produced but bioinformaticians are scarce. Therefore, it is an important (and much sought-after) skill for biologists who are able take data into their own hands.
Luckily, there have been a range of tools developed that allow even data analysis noobs to get to grips with their single cell data. For a lot of us the obvious (and easiest) answer will be to use some form of guide user interface (GUI), such as those provided by companies like Partek (watch this webinar to learn more) that enables us to go from raw data all the way to visualization.
Although convenient, options offered for customization of analysis tools and plot appearance in GUI are somewhat limited. Combining dropSeqPipe (dSP) for pre-processing with Seurat for post-processing offers full control over data analysis and visualization.
The dSP pipeline with all its tools is designed to provide a reproducible, almost automatic, workflow that goes from raw reads (FASQ files) to basic data visualization. This is the point at which a specific experimental design requires manual intervention, for instance, when generating graphs. This is usually the exciting bit and it cannot be automated as requirements are often specific to a researcher’s needs.
dSP produces output that is tailored for a quasi-standard data visualization software in the single-cell world called Seurat and Scater. Seurat and Scater are package that can be used with the programming language R (learn some basic R here) enabling QC, analysis, and exploration of single-cell RNA-seq data. To help you get started with your very own dive into single cell and single nuclei RNA-Seq data analysis we compiled a tutorial on post-processing of data with R using Seurat tools from the famous Satija lab.
1. The tools you need to follow the tutorial
2. An Introduction to R studio and its features
3. Step 1: Downloading R and R studio
4. Step 2: Defining the working directory
5. Step 3: Extracting the meta data from the Seurat object
6. Step 4: Data QC
7. Step 5 – Normalizing the data
7. Step 6: Visualization of marker gene expression
8. Step 7: Dimension reduction plots
What tools will you need?
- An installed R and R studio package
- A Seurat object from one of your scRNA-Seq or sNuc-Seq projects.
- A computer…but that probably goes without saying
Don’t have any of this? (Well hopefully you’ll have the computer…we can’t help very much with that) but otherwise don’t you worry, you can find a detailed step by step introduction below on how to install R and R studio and we have placed a Seurat object here ready for you to download and play with.
Ticking all the boxes? Great! You can go straight to Step 1: Downloading R and R studio.
Disclaimer: This is for absolute beginners, if you are comfortable working with R and Seurat objects, I would suggest going to the Satija lab webpage straight away.
An Introduction to R studio and its features
If you have never used R, have a quick read of this introduction which familiarizes you with the most basic features of the program. There is plethora of analysis types that can be done with R and it is a very good skill to have! Intrigued? To learn more about R read this in depth guide to R by Nathaniel D. Phillips.
Of course, you could write all your code in the console, however. none of that would be saved. Best practice is to save it in a script that will allow you to access it again once a new data set comes your way. Generally speaking, an R script is just a bunch of R code in a single file.
Writing a new script
To start writing a new R script in RStudio, click File – New File – R Script.
Execution of commands
If you would like to execute one of the commands in the script, just highlight the command and press Ctrl + Enter. You will see it appearing in the Console window.
Anything starting with a # is a comment, meaning that even if executed in the command line it won’t be read by R. It is simply helpful for the user to explain the purpose of the command that is written below.
2. Global environment
This is where R stores all the objects and variables created during a session. In the same location you can also find “History”, which lists all the commands executed during a session.
This is the window in which you can type R commands, execute them and view the results (except plots). You can find some information on how to make your work with R more productive here.
4. Plots/ Help
This is the window in which R will print the plots generated and open the help tab if in the console
?function is executed.
Step 1: Downloading R and R studio
Start with installing R and R-Studio on your computer. Downloads for Windows and macOS can be found in the links below, install both files and run R studio. R will provide you with the necessary software to write and execute R commands, R studio is helpful as it provides a nice graphical interface for the daily use of R.
R Studio download
Downloading a Seurat object
You can find a Seurat object here, which is some mouse lung scRNA-Seq from Nadia data for you to play with.
Installing relevant packages
When you first open R Studio it will pretty much be a blank page. Below are some packages that you will need to install to be able to use the code presented in this tutorial. This only needs to be done once after R is installed.
Note!: All code must be entered in the window labelled
Console. Copy past the code at the
> prompt and press enter, this will start the installation of the packages below. You will know that the script is completed if R displays a fresh
> prompt in the console.
Note: After installing
BiocManager::install('multtest') R will ask to
Update all/some/none? [a/s/n]: enter
n to not update other packages.
Step 2: Defining the working directory
In order for R to find your Seurat object you will need to tell the program where it is saved, this location is called your working directory. This step will show you how to set this directory.
# Note you can copy the path from windows however you will have to change all to /setwd("insert path")
Loading the relevant libraries
This step will install required packages and load relevant libraries for data analysis and visualization.
Note!: Libraries need to be loaded every time R is started
Loading the Seurat object into R global environment
Its good practice to save every data set that is uploaded into R under a specific name (variable) in the global environment in R. This will allow you to transform or visualize that data simply by calling its variable. This is also true for the Seurat object when it is first loaded into R.
Note!: The Seurat object file must be saved in the working directory defined above, or else R won’t be able to find it.
#This loads the Seurat object into R and saves it in a variable called ‘seuratobj’ in the global environmentseuratobj <- readRDS("R_Seurat_objects_umap.rds")
Step 3: Extracting the meta data from the Seurat object
A Seurat object contains a lot of information including the count data and experimental meta data. The count data is saved as a so-called matrix within the seurat object, whereas, the meta data is saved as a data frame (something like a table). Meta data stores values such as numbers of genes and UMIs and cluster numbers for each cell (barcode). Just like with the Seurat object itself we can extract and save this data frame under a variable in the global environment.
#Saves the data frame meta data in a variable called ‘meta.data’ in the global environmentmeta.data <- email@example.com
Data frames are standard data types in R and there is a lot we can do with it. To learn more on what to do with data frames, have look here
#This will show you the first 7 lines of your data framehead(meta.data)
orig.ident nCount_RNA nFeature_RNA cellNames samples barcode expected_cells read_lengthCT2_S249_AAAACTCACC CT2_S249 784 628 CT2_S249_AAAACTCACC CT2_S249 AAAACTCACC 600 150CT2_S249_AAAAGGTGTG CT2_S249 1747 1114 CT2_S249_AAAAGGTGTG CT2_S249 AAAAGGTGTG 600 150CT2_S249_AAAATCAATC CT2_S249 3906 2134 CT2_S249_AAAATCAATC CT2_S249 AAAATCAATC 600 150CT2_S249_AAAATTCTTA CT2_S249 3344 1799 CT2_S249_AAAATTCTTA CT2_S249 AAAATTCTTA 600 150CT2_S249_AAACAAAGGC CT2_S249 5323 2732 CT2_S249_AAACAAAGGC CT2_S249 AAACAAAGGC 600 150CT2_S249_AAACGACCGA CT2_S249 906 658 CT2_S249_AAACGACCGA CT2_S249 AAACGACCGA 600 150batch nCounts pct.sribo pct.lribo pct.Ribo pct.mito top50 umi.per.gene RNA_snn_res.0.8CT2_S249_AAAACTCACC CT2 24665 0.02933673 0.01658163 0.04591837 0.005102041 0.1964286 1.248408 9CT2_S249_AAAAGGTGTG CT2 85066 0.05266171 0.04121351 0.09387521 0.024613623 0.2329708 1.568223 5CT2_S249_AAAATCAATC CT2 208598 0.03763441 0.04121864 0.07885305 0.012032770 0.1738351 1.830366 5CT2_S249_AAAATTCTTA CT2 187974 0.02601675 0.03887560 0.06489234 0.018540670 0.2221890 1.858810 5CT2_S249_AAACAAAGGC CT2 132907 0.02423445 0.02141649 0.04565095 0.009956791 0.1666354 1.948389 9CT2_S249_AAACGACCGA CT2 61304 0.02759382 0.04856512 0.07615894 0.014348786 0.2240618 1.376900 3seurat_clustersCT2_S249_AAAACTCACC 9CT2_S249_AAAAGGTGTG 5CT2_S249_AAAATCAATC 5CT2_S249_AAAATTCTTA 5CT2_S249_AAACAAAGGC 9CT2_S249_AAACGACCGA 3
It is usually a good idea to play around and inspect the data, you can for example try
View(meta.data). The example below allows you to check which samples are stored in the Seurat object.
# Show samples stored in the data frame# orig.ident contains sample namesunique(meta.data$orig.ident)
 CT2_S249Levels: CT2_S249
Step 4: Data QC
Before starting to dive deeper into your data its beneficial to take some time for selection and filtration of cells based on some QC metrics. This can be easily done with Seurat looking at common QC metrics such as:
- The number of unique genes/ UMIs detected in each cell.
- The percentage mitochondrial/ ribosomal reads per cell
Read more to this topic here under “Standard pre-processing workflow”.
#Creates a violin plot for the number of UMIs ('nFeature_RNA'), the number of genes ('nCount_RNA'), % ribosomal RNA (‘pct.Ribo’) and % mitochondrial RNA (’pct.mito’) for each sampleVlnPlot(object = seuratobj, features = c('nFeature_RNA','nCount_RNA','pct.Ribo','pct.mito'), group.by = 'orig.ident',pt.size = 0,ncol=2)
# FeatureScatter can be used to visualize feature-feature relationships such as number of genes ("nFeature_RNA") vs number of UMIs ("nCount_RNA") FeatureScatter(seuratobj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",group.by = 'orig.ident')
Step 5 – Normalizing the data
#LogNormalize (default) normalizes measurements for each cell by the total expression, multiplies by a scale factor of 10,000 by default seuratobj <- NormalizeData(seuratobj, normalization.method = “LogNormalize”,scale.factor = 10000)```
Step 6 – Visualization of marker gene expression
Finding marker genes
# Finds all markers of cluster 1cluster1.markers <- FindMarkers(seuratobj, ident.1 = 1, min.pct = 0.25)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=07s
# Shows top 5 markers genes of cluster 1head(cluster1.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adjCntn1 1.918870e-07 2.3485859 0.3 0.026 0.004927659Sftpa1 2.927622e-04 1.7821664 0.5 0.141 1.000000000Tmsb10 3.797597e-04 -2.9132579 0.0 0.637 1.000000000B2m 3.827411e-04 -0.9140352 0.1 0.831 1.000000000Tpm3 5.451834e-04 -2.3310006 0.0 0.617 1.000000000
Define set of marker genes
In order to create dot plots, heat maps or feature plots a list of genes of interests (features) need to be defined.
features <- c("Scgb1a1","Cd79b","Mzb1","Plac8","Trbc2","Cd8b1","Cd4","Cd3e","Cd3d","Tmem100","Vwf","Ednrb","Pecam1","Col1a2","Inmt","Dcn","Itgam","Itgax","C1qb","Sftpd","Sftpb","Sftpc","Foxj1","Scgb3a2","Ear2","Ager","Hopx","Pdpn","Rtkn2","Igfbp2")
Scaling, or linear transformation, is a standard pre-processing step prior to dimension reduction techniques.
all.genes <- rownames(seuratobj) seuratobj <- ScaleData(seuratobj, features = all.genes)
Note: Scaling also helps make sure that highly-expressed genes don’t dominate the heat map.
DoHeatmap(object = seuratobj, features = features)
DotPlot(seuratobj, features = features) + RotatedAxis()
Highlight marker gene expression in dimension reduction plot such as UMAP or tSNE.
To reduce computing time we only select a few features
#selected marker genes for cell typefeatures <- c("Cd8b1","Trbc2","Ly6c2","Cd4")#UMAP feature plot colour coded by defined featureFeaturePlot(seuratobj, features = features,reduction = "umap")
Step 7: Dimension reduction plots
The goal of dimension reduction plots is to visualize single cell data by placing similar cells in close proximity in a low-dimensional space. Seurat offers non-linear dimension reduction techniques such as UMAP and tSNE.
seuratobj <- RunPCA(seuratobj, features = VariableFeatures(object = seuratobj)) DimPlot(seuratobj, reduction = “pca”) ElbowPlot(seuratobj) #to determine the dimensionality of the dataset
seuratobj <- RunPCA(seuratobj, features = VariableFeatures(object = seuratobj)) DimPlot(seuratobj, reduction = “pca”) ElbowPlot(seuratobj) #to determine the dimensionality of the dataset If you haven’t installed UMAP, type reticulate::py_install(packages=’umap-’learn’) seuratobj <- RunUMAP(seuratobj, dims = 1:10)DimPlot(seuratobj, dims=c(1,2), reduction = "umap",label=TRUE) SaveRDS(seuratobj, file=”../output/filename.rds”)
#similar to the PCA and UMAP method, run RunTSNE seuratobj <- RunTSNE(seuratobj) DimPlot(seuratobj, reduction = "tsne")
Many more visualization option for your data can be found under vignettes on the Satija lab website.
This vignette is very useful if you are trying to compare two conditions.
We hope this tutorial was useful to you and that it will enable to you to take data into your own hands. Should you have any questions you can contact us here .