Included initial exploratory analysis notebook #794

danhtruong · 2024-10-07T17:51:44Z

Hi there, thanks for your contribution!
Please fill out this template to help us review your code.

Remember, the more context you provide, the faster the review will go!
For more information about filling out this template, please see the OpenScPCA documentation on filing pull requests:
https://openscpca.readthedocs.io/en/latest/contributing-to-analyses/creating-pull-requests/

Before filing the pull request, you can also feel free to delete the italicized instruction lines in this template.

Purpose/implementation Section

In this section, tell reviewers about the scope and purpose of the code in the pull request.

Please link to the GitHub issue that this pull request addresses.

#701
If applicable, you can also link to the associated GitHub Discussion.

What is the goal of this pull request?

Assess the quality of the data set
Curate a list of marker genes associated with DSRCT cells
Detect the expression levels of DSRCT marker genes in the data set
Identify clusters in the DSRCT samples

Briefly describe the general approach you took to achieve this goal.

I analyzed the DSRCT marker expression in the dataset. I attempted to define a delineation between tumor and normal cells within the data.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. The initial method has some flaws. Some of the data only contains DSRCT cells so there are no normal cells to compare to. The data quality is not as high as other data sets that I have analyzed. We will need to re-think how to use this dataset but I suspect we have enough DSRCT markers to identify dsrct cells. Unfortunately, the approach may be dataset dependent due to the lower quality of this dataset.

Results

Delete this section if no results are associated with your pull request.

In this section, tell reviewers about what kinds of results (if any) your code produces.

The results include plots for analyzing the data set as well as a processed RDS file containing a list of the SingleCellExperiment objects.

What is the name of your results bucket on S3?

researcher-905418339657-us-east-2

What types of results does your code produce (e.g., table, figure)?

RDS of processed data

What is your summary of the results?

Provide directions for reviewers

In this section, tell reviewers what kind of feedback you are looking for.
This information will help guide their review.

What are the software and computational requirements needed to be able to run the code in this PR?

R
ggplot2
SingleCellExperiment

This information will help reviewers run the code during review, if applicable.
For software, how should reviewers set up their environment (e.g., renv or conda) to run this code?
For compute, can reviewers run this code on their laptop, or do they need additional computational resources such as RAM or storage?
Please make sure this information, if applicable, is documented in the README.md file.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

[x ] This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
[ x] The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jashapiro

Hello @danhtruong, and thank you for your submission!

I am going to split my comments into two parts: first there are a few technical issues that I want to address, and then I will talk a bit more about some of the more scientific questions.

Technical points

When you set up this module, it was quite minimal which was fine, but now that there is some code in here, we want to start adding a bit more of the reproducibility structure that we will need.

First, we will want to include the necessary renv setup. If you have not already done this, you can run renv::init() in the module directory and it should discover all the necessary packages, add an .Rprofile file, and create the local library. You will then need to add .Rprofile, renv.lock, and the renv directory to the branch with git add (most of the renv directory will be ignored).
The .Rprofile file that renv creates is not exactly what we will want to have in the repo when we create a docker image, so after running renv::init() you will want to manually update that file just a bit to have the following content:

# Don't activate renv in an OpenScPCA docker image
if(Sys.getenv('OPENSCPCA_DOCKER') != 'TRUE'){
  source('renv/activate.R')
}

It is also usually helpful to include a .Rproj file in the module directory which will make it easier to automatically use the renv library in subdirectories (if you use RStudio). In this case the file would be cell-type-dsrct.Rproj and should have the following content (other options may be fine too):

Version: 1.0

RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX

If you have any questions about these steps, don't hesitate to ask. I can also make these changes and push them to the branch if that would be easier for you.

Please include the output of rendering your notebook: that can make review much smoother, as it may save reviewers from having to rerun the analysis. In this case, the html file may be quite large due to the number of plots. Were you maybe running into problems with pre-commit not allowing the files to be included? If so, we might want to save out the plots separately.

Scientific content

Can you provide more detail about the markers that you are using? You have listed the source for many, and the ones you find the most reliable, as "internal", but it would be very helpful if there were a somewhat more traceable source there. I recognize that this may be unpublished results, so we may have to discuss what the best way to describe that is, but we would at least want to include a name/lab so someone coming in would know where to go for more information.
Please use the library ids as the primary identifier as you work through the project. While in this case the sample and library have a 1:1 correspondence, that is not always the case, so we are trying to be consistent with the primary id.
Removing the library with low counts seems just fine to me for the final analysis, but I might have liked to see the distributions for that library as well in this exploratory notebook, just for reference
For your plots: I wonder if it might be worth having all libraries in a single plot rather than looping through and plotting them separately? You could combine all the data frames with something like the following:

combined_umap_df <- a <- umap_df_list |> dplyr::bind_rows(.id = "library_id")

then plot with

ggplot(combined_umap_df, aes(x = UMAP1, y = UMAP2, color = gene_expression)) +
  facet_grid(rows = vars(library_id), cols = vars(gene_symbol)) +
  ...

It may be that this ends up looking bad, but worth a try? I think it may actually be most useful for the expression distribution plots.

It also can simplify a lot of your lapply expressions later into single operations, with an extra group_by term for the library, etc.

Speaking of the expression distributions: It looks to me like you do have pretty good and comparable distributions without using z-scaling. In almost all cases you have a large number of zero or near-zero values (a histogram might make this more clear than a density plot) with a separate group that has positive values. For now, you could find the minimum of the distribution between the peaks and use that, and we could probably find something more sophisticated later.
The analysis of clusters does look potentially promising, but I think you will definitely want to explore the effects of clustering on that analysis. We are in the process of making some functions to make that easier, but you might start just by trying some different clustering algorithms/parameters: while we have a default clustering that we provided, we do not think these are the best clusters possible; just a starting point. I might put this part of the exploration into a second notebook, as I expect it to take a bit more space.

Other notes

I also made a number of line-level comments throughout. Some of these may be redundant with the above, but most are more fine-grained code suggestions.

Thanks again for the contribution! If you have any questions, please don't hesitate to reply to my comments here, or send a note in Slack. I look forward to seeing the next steps in your analysis!

jashapiro · 2024-10-07T20:16:58Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+
+# The current data directory, found within the repository base directory
+data_dir <- file.path(repository_base, "data", "2024-07-08")


We want to use the current directory where possible (a link to the latest data), so that the most current data is always used

Suggested change

data_dir <- file.path(repository_base, "data", "2024-07-08")

data_dir <- file.path(repository_base, "data", "current")

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

jashapiro · 2024-10-07T21:11:41Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+
+```{r}
+# source in helper functions: make_jaccard_matrix() and jaccard() 
+jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
+source(jaccard_functions)
+```


It looks like you are not using these at the moment, and the file is not present in this PR, so I am removing this for the moment.

Suggested change

```{r}

# source in helper functions: make_jaccard_matrix() and jaccard()

jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")

source(jaccard_functions)

```

```{r}

# source in helper functions: make_jaccard_matrix() and jaccard()

jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")

source(jaccard_functions)

```

jashapiro · 2024-10-07T21:36:23Z

analyses/cell-type-dsrct/references/tumor-marker-genes.tsv

+tumor	ST6GALNAC5	ENSG00000117069	internal
+tumor	CACNA2D2	ENSG00000007402	internal
+tumor	IQCJ-SCHIP1	ENSG00000283154	internal
+tumor	PTPRQ	ENSG00000139304	Internal


Is there any reference that you can point to that describes why these genes were chosen? If not, we would probably want to have something referring to the lab/PI so this can be traced past this repository.

Two articles describe them but its buried in the supplementary. Other than that, we are seeing it clearly in our datasets. We have one publicly available DSRCT PDX single-cell data set that I can compare against control samples where we'd see these same genes too.

It is probably worth including the references even if they are buried in the supplement. You can also include your lab as an unreferenced source, if you think that would be more useful... I think we just want to avoid "internal" as a reference as that makes it harder to trace.

jashapiro · 2024-10-08T14:53:36Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+# get an RDS of the processed data
+saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')


This seems to just be the file list? Did you want to also save the extracted data tables?

Should we save the extracted data not jus the file names?

Suggested change

# get an RDS of the processed data

saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')

# save an RDS of the processed data

saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds')

Co-authored-by: Joshua Shapiro <[email protected]>

updated the notebook to include plot outputs updated the reference to include papers and internal reference

jashapiro · 2024-10-22T19:30:52Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+  tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150)
+  print(p)
+  dev.off()


I'd suggest using ggsave to allow one-command saving, and png should be smaller than tiff (though this may be something to test).

Suggested change

tiff(filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.tiff'), width = 6, height = 6, units = 'in', res = 150)

print(p)

dev.off()

ggsave(

filename = paste0('../plots/marker_expression_', names(umap_df_list)[i], '.png'),

plot = p,

width = 6,

height = 6,

units = 'in',

dpi = 150

)

Please also include the plot files (and the rendered html if possible) in the submission.

Thanks. Will add.

updated the notebook to include plot outputs updated the reference to include papers and internal reference added plot outputs

jashapiro

Hi Danh,

Thank you for the updates; having the plots included is quite helpful.

As this is an exploratory notebook, I don't want to dwell on most of my suggestions below; they are designed mostly for future work, and I would not expect the ones about scaling to be incorporated.

The small changes of paths etc. should be easy to include though, and if there are any additional comments you would like to include, that would be good to do now as well. Then I would appreciate if you could run the notebook one more time to be sure that the html file is also up to date, and then we can get this PR merged.

jashapiro · 2024-10-23T15:42:46Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+Now we will transform each of the gene expression vectors by generating z-scores. 
+Then we might be able to find a cut off we can use across samples for if marker genes are present in a cell or not. 


I think this is fine for a rough look, but it is quite brittle. As you note later, some samples may contain only tumor cells. In that case, using the zscore may not be a good measure; what we really want is to find the division between the bimodal peaks. This is beyond the scope of this PR, but for your final classifications I would not expect you to want to use z-scoring.

jashapiro · 2024-10-23T15:48:28Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+
+We could either count any cell that expresses at least one marker gene > 0 as a tumor cell, or look at the combined expression. 
+Let's start with classifying tumor cells as tumor if any marker gene is present (z-score > 0). 
+This also means the sum of all z-scores > 0. 


I am confused by this statement. Why would having one marker gene > 0 mean the sum of all is > 0?

Again, I'd probably not want to use the z scores for this, but maybe instead scaling just by range (maybe the 90% quartile or something like that?), but leaving zero as the lower bound? The cutoff may still be hard to draw, but z scoring often breaks when there are not two relatively equal classes. (I should say, one of my current projects is to find a robust way to automatically segment these kinds of distributions.)

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

jashapiro · 2024-10-23T15:53:43Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+# get an RDS of the processed data
+saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')


Should we save the extracted data not jus the file names?

Suggested change

# get an RDS of the processed data

saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')

# save an RDS of the processed data

saveRDS(umap_df_list, file.path(module_base, 'results/SCPCP000013_umap_df_list.rds')

modified the saves Co-authored-by: Joshua Shapiro <[email protected]>

jashapiro · 2024-10-24T20:31:40Z

analyses/cell-type-dsrct/exploratory_analysis/01-marker-gene-tumor-classification.Rmd

+<<<<<<< HEAD
+      '/plots/UMAP_classification_cluster_',
+=======
+      '../plots/UMAP_classification_cluster_',
+>>>>>>> danhtruong/danhtruong/cell-type-dsrct


Danh: there are a number of places where it seems like your latest changes have introduced merge artifacts like this one. Can you please fix these and rerender the notebook? I would expect to see a newer version of the html that matches the latest updates.

Apologies... on closer inspection it looks like you did rerun the notebook, but then a later merge brought in the above error (which should have been caught by our pre-commit tests, but wasn't for some reason!). I have reverted the offending commit and merged in the changes from the main repository, so this should be ready to go, but you will need to pull down the changes from this branch (or start a new branch from main after this is merged in) before proceeding with your next steps.

This reverts commit 58eda51, reversing changes made to b16d3d4.

…e-dsrct

jashapiro

I am approving this PR to get this exploratory notebook in. I do think there may be more work to do about finding the cutoff points for marker gene expression, but it is useful to have these explorations recorded in the repository in their current state.

Included initial exploratory analysis notebook

f693993

danhtruong requested a review from jaclyn-taroni as a code owner October 7, 2024 17:51

jaclyn-taroni requested review from jashapiro and removed request for jaclyn-taroni October 7, 2024 18:59

jashapiro reviewed Oct 8, 2024

View reviewed changes

danhtruong and others added 3 commits October 15, 2024 10:55

Apply suggestions from code review

d969213

Co-authored-by: Joshua Shapiro <[email protected]>

Adding the renv and ancillary files

05c92ac

updated the notebooks

b0fac94

updated the notebook to include plot outputs updated the reference to include papers and internal reference

danhtruong requested a review from jashapiro October 22, 2024 15:59

jashapiro reviewed Oct 22, 2024

View reviewed changes

danhtruong added 2 commits October 22, 2024 15:15

updated the notebooks

907bca9

updated the notebook to include plot outputs updated the reference to include papers and internal reference added plot outputs

updated the notebooks

69d505c

updated the notebook to include plot outputs updated the reference to include papers and internal reference added plot outputs

jashapiro reviewed Oct 23, 2024

View reviewed changes

danhtruong and others added 3 commits October 23, 2024 11:28

Apply suggestions from code review

cbf9fa2

modified the saves Co-authored-by: Joshua Shapiro <[email protected]>

Apply suggestions from code review

b16d3d4

modified the saves Co-authored-by: Joshua Shapiro <[email protected]>

Apply suggestions from code review

58eda51

modified the saves Co-authored-by: Joshua Shapiro <[email protected]>

danhtruong requested a review from jashapiro October 23, 2024 18:59

jashapiro reviewed Oct 24, 2024

View reviewed changes

jashapiro added 2 commits October 24, 2024 15:37

Revert "Apply suggestions from code review "

da3771b

This reverts commit 58eda51, reversing changes made to b16d3d4.

Merge remote-tracking branch 'upstream/main' into danhtruong/cell-typ…

85ac5e1

…e-dsrct

jashapiro approved these changes Oct 24, 2024

View reviewed changes

jashapiro merged commit bb58c2d into AlexsLemonade:main Oct 24, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Included initial exploratory analysis notebook #794

Included initial exploratory analysis notebook #794

danhtruong commented Oct 7, 2024

jashapiro left a comment

jashapiro Oct 7, 2024

jashapiro Oct 7, 2024

jashapiro Oct 7, 2024

danhtruong Oct 11, 2024

jashapiro Oct 11, 2024

jashapiro Oct 8, 2024

jashapiro Oct 23, 2024

jashapiro Oct 22, 2024

jashapiro Oct 22, 2024

danhtruong Oct 22, 2024

jashapiro left a comment

jashapiro Oct 23, 2024

jashapiro Oct 23, 2024

jashapiro Oct 23, 2024

jashapiro Oct 24, 2024

jashapiro Oct 24, 2024

jashapiro left a comment

	data_dir <- file.path(repository_base, "data", "2024-07-08")
	data_dir <- file.path(repository_base, "data", "current")

		# get an RDS of the processed data
		saveRDS(sce_file_list, '../results/SCPCP000013_sce_file_list.rds')

		Now we will transform each of the gene expression vectors by generating z-scores.
		Then we might be able to find a cut off we can use across samples for if marker genes are present in a cell or not.

Included initial exploratory analysis notebook #794

Included initial exploratory analysis notebook #794

Conversation

danhtruong commented Oct 7, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jashapiro left a comment

Choose a reason for hiding this comment

Technical points

Scientific content

Other notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment