--- title: "Analyze Silac data from MaxQuant" author: "Roman Mylonas" date: "2/7/2018" output: html_document --- ## Prerequisites It is recommended to execute the following exercises using RStudio (even though any R console will work). For the plots you will need the package “ggplot2”. In case it’s not installed yet, execute `install.packages(‘ggplot2’)`. Load "ggplot2": ```{r load_packages} library('ggplot2') ``` ## 1. Load MaxQuant (MQ) results Load the results from MQ together with it's headers. You will have to adapt `result_dir`: ```{r load_results} result_dir <- "/Users/admin/Documents/presentations/proteomics_SIB_2018/Course_II_advanced/R_exercise" result_file_name <- "proteinGroups_R_exp14_M1.txt" result_file <- paste(result_dir, result_file_name, sep="/") res <- read.csv(result_file, sep="\t") ``` ## 2. Filter Remove all proteins marked as `+` in the column *Reverse*: ```{r filter_reverse} res_2 <- subset(res, Reverse != "+") ``` Remove all contaminants. The table *proteinGroups.txt* from MQ has a column *Contaminant*. In this example we added an additional column named *Contaminant_2* to mark further proteins we manually identified as contaminants: ```{r filter_contaminants} res_3 <- subset(res_2, Contaminant != "+" & Contaminant_2 != "+") ``` Now we should have `4989` protein groups left. Check the number of proteins with `nrow(res_3)`. ## 3. Analysis 1 ### 3.1 Histogram of H/L ratios There are 6 different columns containing H/L ratios. Plot a histogram for each of them individually by adapting the variable `H_L_Ratio_column`: ```{r ratio_hist_1, echo=TRUE, eval=FALSE} H_L_Ratio_column <- "Ratio.H.L.exp14.rep1.20h" p <- ggplot(res_3, aes_string(x=H_L_Ratio_column)) + geom_histogram() + xlim(0,5) print(p) ``` You can also generate all plots automatically by first selecting the 6 columns containing the term "Ratio.H.L" and then loop through each of them: ```{r ratio_hist_2, echo=TRUE, eval=FALSE} H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE) for(column_name in H_L_Ratio_columns){ p <- ggplot(res_3, aes_string(x=column_name)) + geom_histogram() + xlim(0,5) print(p) } ``` ### 3.2 Summary statistics To get a basic summary statistics of our results table `res_3` we can simply call: ```{r sum_stat_1, echo=TRUE, eval=FALSE} summary(res_3) ``` But we might want to call it only on the columns containing H/L Ratios: ```{r sum_stat_2, echo=TRUE, eval=FALSE} H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE) summary(res_3[,H_L_Ratio_columns]) ``` ### 3.3 Scatterplot of H/L ratios Let's create a scatterplot of H/L ratios for `Ratio.H.L.Normalized.exp14.rep1.20h` and `Ratio.H.L.Normalized.exp14.rep1.20h`: ```{r scatterplot_1, echo=TRUE, eval=FALSE} H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h" # adapt this variable to plot the second replicate p <- ggplot(res_3, aes_string(x="id", y=H_L_Ratio_column)) + geom_point(alpha=0.3) print(p) ``` Note the distribution of ratios. Is it easy to evaluete the data? ## 4. LOG transformation Make log2 transformation of the normalized H/L Ratios columns: ```{r log_transform, echo=TRUE, eval=TRUE} H_L_Ratio_columns <- grep('Ratio.H.L.Normalized', colnames(res_3), value=TRUE) log2_H_L_Ratios <- log2(res_3[,H_L_Ratio_columns]) # add a "log2" to the column names and add them back to the table colnames(log2_H_L_Ratios) <- paste0(colnames(log2_H_L_Ratios), ".log2") res_4 <- cbind(res_3, log2_H_L_Ratios) ``` ## 5. Analysis 2 ### 5.1 Scatterplot of log2 Create the same scatterplot as before, but this time on the log2 data: ```{r scatterplot_2, echo=TRUE, eval=FALSE} H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2" p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) + geom_hline(yintercept = 0) + geom_point(alpha=0.3) print(p) ``` Select proteins above a certain threshold (e.g. 1) in replicate 1 and look if they're also high in the other replicates: ```{r select_high_1, echo=TRUE, eval=FALSE} res_above_threshold <- subset(res_4, Ratio.H.L.Normalized.exp14.rep1.20h.log2 > 1) H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2" p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) + geom_hline(yintercept = 0) + geom_point(alpha=0.3) + geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red") print(p) ``` Now let's plot the corresponding H/L ratios for the second replicate: ```{r select_high_2, echo=TRUE, eval=FALSE} H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep2.20h.log2" p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) + geom_hline(yintercept = 0) + geom_point(alpha=0.3) + geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red") print(p) ``` ### 5.2 Multiscatterplot and Pearson correlation We can look at the correlations between the different replicates using a multiscatterplot: ```{r multiscatter_1, echo=TRUE, eval=FALSE} H_L_Ratio_columns <- grep("Ratio.H.L.*.log2", colnames(res_4), value=TRUE) pairs(res_4[,H_L_Ratio_columns]) ``` The Pearson correlation can be computed with the following command: ```{r correlation_1, echo=TRUE, eval=FALSE} cor(res_4[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs") ``` ## 6. Filter more We filter on the number of peptides observed per protein groups. We're keeping only proteins which have > 2 proteins: ```{r filter_on_peptides, echo=TRUE, eval=TRUE} res_5 <- subset(res_4, Peptides..seq. > 2) ``` And we make again a multi scatterplot and compute the Pearson correlation: ```{r multiscatter_2, echo=TRUE, eval=FALSE} # multi scatterplot pairs(res_5[,H_L_Ratio_columns]) # Pearson correlation cor(res_5[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs") ``` Is the correlation better now? ## 7. Statistical tests Compute the t-test for the normalized log2 ratios: ```{r stat_test_1, echo=TRUE, eval=FALSE} p_vals <- apply(res_5[,H_L_Ratio_columns], 1, function(x) { # we have to wrap the test into a tryCatch, since the t-test is not working when there are too many NA's tryCatch({ one_t_test <- t.test(x, mu=0, var.equal = FALSE) one_t_test$p.value }, error=function(cond){ NA }) }) # number of significant p-values sum(p_vals <= 0.05, na.rm=TRUE) ``` But we have to adjust for multiple testing (e.g. with Benjamini-Hochberg): ```{r stat_test_2, echo=TRUE, eval=FALSE} p_vals_adj <- p.adjust(p_vals, method="BH") ``` How many proteins pass t-test with p-value < 0.05 now? ## *Generation of this document* *This file was generated from "analyze_silac_MQ.Rmd" using *knitr*. From RStudio you can generate it by clicking on "Knit". If you don't have the package "knitr" installed, you will have to execute the following command first* `install.packages('knitr')`.