Merge pull request #94 from KevinMenden/development

Bug fix release v1.1.1
KevinMenden · May 22, 2021 · 03bb0a6 · 03bb0a6
2 parents 3028486 + 0b1993c
commit 03bb0a6
Show file tree

Hide file tree

Showing 11 changed files with 130 additions and 99 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Scaden Changelog
 
+## Version 1.1.1
+
+* Fixed bugs in scaden model definition [[#88](https://github.com/KevinMenden/scaden/issues/88)]
+* removed installation instructions for bioconda as not functional at the moment [[#86](https://github.com/KevinMenden/scaden/issues/86)]
+* Fixed bug in `scaden example` [[#85](https://github.com/KevinMenden/scaden/issues/85)]
+
 ## Version 1.1.0
 
 * Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
@@ -25,13 +31,13 @@ of simulated datasets
 * Rebuild Scaden model and training to use TF2 Keras API instead of the old compatibility functions 
 * added `scaden example` command which allows to generate example data for test-running scaden and to inpstec the expected file format
 * added more tests and checks input reading function in `scaden simulate`
-* fixed bug in reading input data 
+* fixed bug in reading input data
 
 ### Version 0.9.6
 
-+ fixed Dockerfile (switched to pip installation)
-+ added better error messages to `simulate` command
-+ cleaned up dependencies
+* fixed Dockerfile (switched to pip installation)
+* added better error messages to `simulate` command
+* cleaned up dependencies
 
 ### v0.9.5
 

diff --git a/README.md b/README.md
@@ -1,48 +1,40 @@
-![Scaden](docs/img/scaden_logo.png)
+# Single-cell assisted deconvolutional network
 
+![Scaden](docs/img/scaden_logo.png)
 
-![Scaden version](https://img.shields.io/badge/scaden-v1.1.0-cyan)
+![Scaden version](https://img.shields.io/badge/scaden-v1.1.1-cyan)
 ![MIT](https://img.shields.io/badge/License-MIT-black)
 ![Install with pip](https://img.shields.io/badge/Install%20with-pip-blue)
 [![Downloads](https://pepy.tech/badge/scaden)](https://pepy.tech/project/scaden)
 ![Docker](https://github.com/kevinmenden/scaden/workflows/Docker/badge.svg)
 ![Scaden CI](https://github.com/kevinmenden/scaden/workflows/Scaden%20CI/badge.svg)
 
-## Single-cell assisted deconvolutional network
-
 Scaden is a deep-learning based algorithm for cell type deconvolution of bulk RNA-seq samples. It was developed 
-at the DZNE Tübingen and the ZMNH in Hamburg. 
+at the DZNE Tübingen and the ZMNH in Hamburg.
 The method is published in Science Advances:
  [Deep-learning based cell composition analysis from tissue expression profiles](https://advances.sciencemag.org/content/6/30/eaba2619)
 
 A complete documentation is available [here](https://scaden.readthedocs.io)
 
-
 ![Figure1](docs/img/figure1.png)
 
 Scaden overview. a) Generation of artificial bulk samples with known cell type composition from scRNA-seq data. b) Training 
 of Scaden model ensemble on simulated training data. c) Scaden ensemble architecture. d) A trained Scaden model can be used
 to deconvolve complex bulk mixtures.
 
-
-
 ## Installation guide
-Scaden can be easily installed on a Linux system, and should also work on Mac. 
+
+Scaden can be easily installed on a Linux system, and should also work on Mac.
 There are currently two options for installing Scaden, either using [Bioconda](https://bioconda.github.io/) or via [pip](https://pypi.org/).
 
 ### pip
+
 To install Scaden via pip, simply run the following command:
 
 `pip install scaden`
 
-
-### Bioconda
-Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
-It is therefore highly recommended to install via pip.
-
-`conda install -c bioconda scaden`
-
 ### GPU
+
 If you want to make use of your GPU, you will have to additionally install `tensorflow-gpu`.
 
 For pip:
@@ -54,6 +46,7 @@ For conda:
 `conda install tensorflow-gpu`
 
 ### Docker
+
 If you don't want to install Scaden at all, but rather use a Docker container, we provide that as well.
 For every release, we provide two version - one for CPU and one for GPU usage.
 To pull the CPU container, use this command:
@@ -65,16 +58,19 @@ For the GPU container:
 `docker pull ghcr.io/kevinmenden/scaden/scaden-gpu`
 
 ### Webtool (beta)
+
 Additionally, we now proivde a web tool:
 
 [https://scaden.ims.bio](https://scaden.ims.bio)
 
 It contains pre-generated training datasets for several tissues, and all you need to do is to upload your expression data. Please note that this is still in preview.
 
 ## Usage
+
 We provide a detailed instructions for how to use Scaden at our [Documentation page](https://scaden.readthedocs.io/en/latest/usage/)
 
 A deconvolution workflow with Scaden consists of four major steps:
+
 * data simulation
 * data processing
 * training
@@ -83,10 +79,12 @@ A deconvolution workflow with Scaden consists of four major steps:
 If training data is already available, you can start at the data processing step. Otherwise you will first have to process scRNA-seq datasets and perform data simulation to generate a training dataset. As an example workflow, you can use Scaden's function `scaden example` to generate example data and go through the whole pipeline.
 
 First, make an example data directory and generate the example data:
+
 ```bash
 mkdir example_data
 scaden example --out example_data/
 ```
+
 This generates the files "example_counts.txt", "example_celltypes.txt" and "example_bulk_data.txt" in the "example_data" directory. Next, you can generate training data:
 
 ```bash
@@ -113,10 +111,8 @@ scaden predict --model_dir model example_data/example_bulk_data.txt
 
 Now you should have a file called "scaden_predictions.txt" in your working directory, which contains your estimated cell compositions.
 
-
-
-
 ### 1. System requirements
+
 Scaden was developed and tested on Linux (Ubuntu 16.04 and 18.04). It was not tested on Windows or Mac, but should
 also be usable on these systems when installing with Pip or Bioconda. Scaden does not require any special
 hardware (e.g. GPU), however we recommend to have at least 16 GB of memory.

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,11 @@
 # Scaden Changelog
 
+## Version 1.1.1
+
+* Fixed bugs in scaden model definition [[#88](https://github.com/KevinMenden/scaden/issues/88)]
+* removed installation instructions for bioconda as not functional at the moment [[#86](https://github.com/KevinMenden/scaden/issues/86)]
+* Fixed bug in `scaden example` [[#85](https://github.com/KevinMenden/scaden/issues/85)]
+
 ## Version 1.1.0
 
 * Reduced memory usage of `scaden simulate` significantly by performing simulation for one dataset at a time.
@@ -10,7 +16,6 @@
 of simulated datasets
 * Added `scaden merge` command which allows merging of previously created datasets  
 
-
 ### Version 1.0.2
 
 * General improvement of logging using the 'rich' library for colorized output
@@ -26,13 +31,13 @@ of simulated datasets
 * Rebuild Scaden model and training to use TF2 Keras API instead of the old compatibility functions 
 * added `scaden example` command which allows to generate example data for test-running scaden and to inpstec the expected file format
 * added more tests and checks input reading function in `scaden simulate`
-* fixed bug in reading input data 
+* fixed bug in reading input data
 
 ### Version 0.9.6
 
-+ fixed Dockerfile (switched to pip installation)
-+ added better error messages to `simulate` command
-+ cleaned up dependencies
+* fixed Dockerfile (switched to pip installation)
+* added better error messages to `simulate` command
+* cleaned up dependencies
 
 ### v0.9.5
 

diff --git a/docs/installation.md b/docs/installation.md
@@ -1,22 +1,16 @@
 # Installation
+
 Scaden be easily installed on a Linux system, and should also work on Mac. 
 There are currently two options for installing Scaden, either using [Bioconda](https://bioconda.github.io/) or via [pip](https://pypi.org/).
 
-
 ## pip
+
 To install Scaden via pip, simply run the following command:
 
 `pip install scaden`
 
-
-## Bioconda
-Bioconda installation is currently not supported for the newest Scaden versions, but this will hopefully change soon.
-It is therefore highly recommended to install via pip.
-
-`conda install -c bioconda scaden`
-
-
 ## Docker
+
 If you don't want to install Scaden at all, but rather use a Docker container, we provide that as well.
 For every release, we provide two version - one for CPU and one for GPU usage.
 To pull the CPU container, use this command:
@@ -28,6 +22,7 @@ For the GPU container:
 `docker pull ghcr.io/kevinmenden/scaden/scaden-gpu`
 
 ## Webtool (beta)
+
 We now also provide a webtool for you:
 
 [https://scaden.ims.bio](https://scaden.ims.bio)

diff --git a/scaden/__main__.py b/scaden/__main__.py
@@ -5,13 +5,16 @@
 import rich.logging
 import logging
 import os
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
 import tensorflow as tf
 from scaden.train import training
 from scaden.predict import prediction
 from scaden.process import processing
 from scaden.simulate import simulation
 from scaden.example import exampleData
 from scaden.merge import merge_datasets
+
 """
 
 author: Kevin Menden
@@ -31,8 +34,6 @@
     )
 )
 
-os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
-
 
 def main():
     text = """
@@ -147,7 +148,7 @@ def predict(data_path, model_dir, outname, seed):
     "--var_cutoff",
     default=0.1,
     help="Filter out genes with a variance less than the specified cutoff. A low cutoff is recommended,"
-         "this should only remove genes that are obviously uninformative.",
+    "this should only remove genes that are obviously uninformative.",
 )
 def process(data_path, prediction_data, processed_path, var_cutoff):
     """ Process a dataset for training """
@@ -187,7 +188,7 @@ def process(data_path, prediction_data, processed_path, var_cutoff):
     multiple=True,
     default=["unknown"],
     help="Specifiy cell types to merge into the unknown category. Specify this flag for every cell type you want to "
-         "merge in unknown. [default: unknown]",
+    "merge in unknown. [default: unknown]",
 )
 @click.option(
     "--prefix",
@@ -211,7 +212,7 @@ def simulate(out, data, cells, n_samples, pattern, unknown, prefix, data_format)
         pattern=pattern,
         unknown_celltypes=unknown,
         out_prefix=prefix,
-        fmt=data_format
+        fmt=data_format,
     )
 
 
@@ -221,9 +222,18 @@ def simulate(out, data, cells, n_samples, pattern, unknown, prefix, data_format)
 
 
 @cli.command()
-@click.option("--data", "-d", default=".", help="Directory containing simulated datasets (in .h5ad format)")
-@click.option("--prefix", "-p", default="data", help="Prefix of output file [default: data]")
-@click.option("--files", "-f", default=None, help="Comma-separated list of filenames to merge")
+@click.option(
+    "--data",
+    "-d",
+    default=".",
+    help="Directory containing simulated datasets (in .h5ad format)",
+)
+@click.option(
+    "--prefix", "-p", default="data", help="Prefix of output file [default: data]"
+)
+@click.option(
+    "--files", "-f", default=None, help="Comma-separated list of filenames to merge"
+)
 def merge(data, prefix, files):
     """ Merge simulated datasets into on training dataset """
     merge_datasets(data_dir=data, prefix=prefix, files=files)
@@ -244,4 +254,6 @@ def merge(data, prefix, files):
 )
 def example(cells, genes, samples, out, types):
     """ Generate an example dataset """
-    exampleData(n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out, n_types=types)
+    exampleData(
+        n_cells=cells, n_genes=genes, n_samples=samples, out_dir=out, n_types=types
+    )
diff --git a/scaden/example.py b/scaden/example.py
@@ -24,17 +24,17 @@ def exampleData(n_cells=10, n_genes=100, n_samples=10, n_types=5, out_dir="./"):
         sys.exit(1)
 
     # Generate example scRNA-seq data
-    counts = np.random.randint(low=0, high=1000, size=(n_cells, n_genes))
+    counts = np.random.randint(low=1, high=10, size=(n_cells, n_genes))
     gene_names = ["gene"] * n_genes
     for i in range(len(gene_names)):
         gene_names[i] = gene_names[i] + str(i)
     df = pd.DataFrame(counts, columns=gene_names)
 
     # Generate example celltype labels
-    celltypes = ["celltype"] * np.random.randint(n_types)
+    celltypes = ["celltype"] * n_types
     for i in range(len(celltypes)):
         celltypes[i] = celltypes[i] + str(i)
-    celltype_list = random.choices(celltypes, k=n_cells)
+    celltype_list = np.random.choice(celltypes, size=n_cells)
     ct_df = pd.DataFrame(celltype_list, columns=["Celltype"])
 
     # Generate example bulk RNA-seq data

diff --git a/scaden/model/scaden.py b/scaden/model/scaden.py
@@ -14,8 +14,7 @@
 from rich.progress import Progress, BarColumn
 
 logger = logging.getLogger(__name__)
-tf.get_logger().setLevel('ERROR')
-os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
+
 
 class Scaden(object):
     """
@@ -304,7 +303,9 @@ def train(self, input_path, train_datasets):
             BarColumn(bar_width=None),
         )
 
-        training_progress = progress_bar.add_task(self.model_name, total=self.num_steps, step=0, loss=1)
+        training_progress = progress_bar.add_task(
+            self.model_name, total=self.num_steps, step=0, loss=1
+        )
         with progress_bar:
 
             for step in range(self.num_steps):
@@ -319,13 +320,14 @@ def train(self, input_path, train_datasets):
 
                 optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
 
-                progress_bar.update(training_progress, advance=1, step=step, loss=f"{loss:.4f}")
+                progress_bar.update(
+                    training_progress, advance=1, step=step, loss=f"{loss:.4f}"
+                )
 
                 # Collect garbage after 100 steps - otherwise runs out of memory
                 if step % 100 == 0:
                     gc.collect()
 
-
         # Save the trained model
         self.model.save(self.model_dir)
         pd.DataFrame(self.labels).to_csv(

diff --git a/scaden/predict.py b/scaden/predict.py
@@ -51,9 +51,7 @@ def prediction(model_dir, data_path, out_name, seed=0):
         do_rates=M256_DO_RATES,
     )
     # Predict ratios
-    preds_256 = cdn256.predict(
-        input_path=data_path
-    )
+    preds_256 = cdn256.predict(input_path=data_path)
 
     # Mid model predictions
     cdn512 = Scaden(
@@ -64,22 +62,18 @@ def prediction(model_dir, data_path, out_name, seed=0):
         do_rates=M512_DO_RATES,
     )
     # Predict ratios
-    preds_512 = cdn512.predict(
-        input_path=data_path
-    )
+    preds_512 = cdn512.predict(input_path=data_path)
 
     # Large model predictions
     cdn1024 = Scaden(
         model_dir=model_dir + "/m1024",
         model_name="m1024",
         seed=seed,
         hidden_units=M1024_HIDDEN_UNITS,
-        do_rates=M256_DO_RATES,
+        do_rates=M1024_DO_RATES,
     )
     # Predict ratios
-    preds_1024 = cdn1024.predict(
-        input_path=data_path
-    )
+    preds_1024 = cdn1024.predict(input_path=data_path)
 
     # Average predictions
     preds = (preds_256 + preds_512 + preds_1024) / 3