Consider removing <4 gene limit #79

bschilder · 2023-04-03T11:25:48Z

Currently EWCE::bootstrap_enrichment_test doesn't let you run tests where the number of hit genes is <4. @NathanSkene has noted this cutoff is arbitrary and could be removed. But we should first consider the potential statistical ramifications of small gene lists within the EWCE framework.

@bschilder

what are the dangers of reducing the number of genes? from a stats standpoint

@Al-Murphy

The way I understand it, the bootstrapping works well since you are looking for the specificity averaged over a gene list. For example, consider you are just looking at the specificity of one gene. This changes the question, you are now basically asking if that gene has a higher specificity than the average specificity across all genes (due the random sampling of the background gene list). So 49% of genes tested would then be specific. I think when the number of genes you test is large the chance of seeing a FP drops. Does that make sense? It's hard to articulate
I just think you shouldn't run EWCE for it as the probability of getting an enrichment in a cell type is much higher. I think this is a bit of an issue with EWCE in general since people can just reduce the size of their gene lists to get significant results. Like a form of p-value hacking. Ideally, I guess you would add some penalisation weight for smaller gene lists to avoid the issue but that would require some testing or theoretical statistical background calculations (where you keep the probability of finding enrichment equal regardless of gene list length)

We should

Test the effect of hit gene list size on EWCE p-values.
Test the effect of hit gene list size on Fisher's exact test p-values.
Compare the distributions of p-values in both cases.
Perhaps look at some of the benchmarking results that Shuhan performed, or use her framework for testing these potential biases @ss8518

The text was updated successfully, but these errors were encountered:

Al-Murphy · 2023-04-03T11:42:23Z

I think we should be able to calculate the probability of enrichment based on gene list of length M theoretically although I would need to have a think of how. For example where there are an infinite number of bootstrap tests (N) and if M=1, it would be Prob(enrich)=rank of specificity of gene from M. For M>1, it gets a little more complex since it's the mean specificity of the gene list and bootstrap background gene list

NathanSkene · 2023-04-03T14:41:10Z

I think the probability of finding significant hits with gene lists with length of one is very low. It’s bootstrapping, so there are not really statistical ramifications. It is measuring empirically the distribution. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Alan Murphy ***@***.***> Sent: Monday, April 3, 2023 12:42:34 PM To: NathanSkene/EWCE ***@***.***> Cc: Skene, Nathan G ***@***.***>; Mention ***@***.***> Subject: Re: [NathanSkene/EWCE] Consider removing <4 gene limit (Issue #79) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. I think we should be able to calculate the probability of enrichment based on gene list of length M theoretically although I would need to have a think of how. For example where there are an infinite number of bootstrap tests (N) and if M=1, it would be Prob(enrich)=rank of specificity of gene from M. For M>1, it gets a little more complex since it's the mean specificity of the gene list and bootstrap background gene list — Reply to this email directly, view it on GitHub<#79 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE3LBZXGAE3XFBINZWLW7KZSVANCNFSM6AAAAAAWRFW7QA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bschilder self-assigned this Apr 3, 2023

bschilder added help wanted good first issue labels Apr 3, 2023

bschilder added benchmarking EWCE benchmarking analyses enhancement and removed help wanted labels May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing <4 gene limit #79

Consider removing <4 gene limit #79

bschilder commented Apr 3, 2023 •

edited

Loading

Al-Murphy commented Apr 3, 2023

NathanSkene commented Apr 3, 2023 via email

Consider removing <4 gene limit #79

Consider removing <4 gene limit #79

Comments

bschilder commented Apr 3, 2023 • edited Loading

Al-Murphy commented Apr 3, 2023

NathanSkene commented Apr 3, 2023 via email

bschilder commented Apr 3, 2023 •

edited

Loading