Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing <4 gene limit #79

Open
bschilder opened this issue Apr 3, 2023 · 2 comments
Open

Consider removing <4 gene limit #79

bschilder opened this issue Apr 3, 2023 · 2 comments
Assignees
Labels

Comments

@bschilder
Copy link
Collaborator

bschilder commented Apr 3, 2023

Currently EWCE::bootstrap_enrichment_test doesn't let you run tests where the number of hit genes is <4. @NathanSkene has noted this cutoff is arbitrary and could be removed. But we should first consider the potential statistical ramifications of small gene lists within the EWCE framework.

@bschilder

what are the dangers of reducing the number of genes? from a stats standpoint

@Al-Murphy

The way I understand it, the bootstrapping works well since you are looking for the specificity averaged over a gene list. For example, consider you are just looking at the specificity of one gene. This changes the question, you are now basically asking if that gene has a higher specificity than the average specificity across all genes (due the random sampling of the background gene list). So 49% of genes tested would then be specific. I think when the number of genes you test is large the chance of seeing a FP drops. Does that make sense? It's hard to articulate
I just think you shouldn't run EWCE for it as the probability of getting an enrichment in a cell type is much higher. I think this is a bit of an issue with EWCE in general since people can just reduce the size of their gene lists to get significant results. Like a form of p-value hacking. Ideally, I guess you would add some penalisation weight for smaller gene lists to avoid the issue but that would require some testing or theoretical statistical background calculations (where you keep the probability of finding enrichment equal regardless of gene list length)

We should

  1. Test the effect of hit gene list size on EWCE p-values.
  2. Test the effect of hit gene list size on Fisher's exact test p-values.
  3. Compare the distributions of p-values in both cases.
  4. Perhaps look at some of the benchmarking results that Shuhan performed, or use her framework for testing these potential biases @ss8518
@Al-Murphy
Copy link
Collaborator

I think we should be able to calculate the probability of enrichment based on gene list of length M theoretically although I would need to have a think of how. For example where there are an infinite number of bootstrap tests (N) and if M=1, it would be Prob(enrich)=rank of specificity of gene from M. For M>1, it gets a little more complex since it's the mean specificity of the gene list and bootstrap background gene list

@NathanSkene
Copy link
Owner

NathanSkene commented Apr 3, 2023 via email

@bschilder bschilder added benchmarking EWCE benchmarking analyses enhancement and removed help wanted labels May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants