From 027185d55d18b4d8b704fbb3d504f9d8e43f4359 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Fri, 17 Feb 2023 10:34:46 +0100 Subject: [PATCH 01/15] Add interoperability notebook Replace Markdown file --- .../introduction/interoperability.ipynb | 19 +++++++++++++++++++ jupyter-book/introduction/interoperability.md | 1 - 2 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 jupyter-book/introduction/interoperability.ipynb delete mode 100644 jupyter-book/introduction/interoperability.md diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb new file mode 100644 index 00000000..a6100842 --- /dev/null +++ b/jupyter-book/introduction/interoperability.ipynb @@ -0,0 +1,19 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/jupyter-book/introduction/interoperability.md b/jupyter-book/introduction/interoperability.md deleted file mode 100644 index 2f83d383..00000000 --- a/jupyter-book/introduction/interoperability.md +++ /dev/null @@ -1 +0,0 @@ -# Interoperability From becae676b21ea82c7a129713d41cfc7365e9684d Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Fri, 17 Feb 2023 11:13:36 +0100 Subject: [PATCH 02/15] Add interoperability text --- .../introduction/interoperability.ipynb | 587 +++++++++++++++++- 1 file changed, 584 insertions(+), 3 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index a6100842..d32c3524 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -1,18 +1,599 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Interoperability" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Motivation" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we have discussed in the [analysis frameworks and tools chapter] there are three main ecosystems for single-cell analysis, the [Bioconductor] and [Seurat] ecosystems in R and the Python-based [scverse] ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", + "\n", + "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Disk-based interoperability" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow] or [snakemake]). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Simple formats" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before discussing file formats specifically developed for single-cell data we want to briefly mention that common simple text file formats (such as CSV, TSV, JSON etc.) can often be the answer to transferring data between languages. They work well in cases where some analysis has been performed and what you want to transfer is a subset of the information about an experiment. For example, you may want to transfer only the cell metadata but do not require the feature metadata, expression matrices etc. The advantage of using simple text formats is that they are well supported by almost any language and do not require single-cell specific packages. However, they can quickly become impractical as what you want to transfer becomes more complex." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### HDF5-based formats" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The most common disk formats for single-cell data are based on [Hierarchical Data Format version 5] or HDF5. This is an open-source file format designed for storing large, complex and heterogeneous data. It has a file directory type structure (similar to how files and folders are organised on your computer) which allows many different kinds of data to be stored in a single file with an arbitrarily complex hierarchy. While this format is very flexible, to properly interact with it you need to know where and how the different information is stored. For this reason, standard specifications for storing single-cell data in HDF5 files have been developed." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "!!! HDF5 OVERVIEW IMAGE !!!" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### H5AD" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the **anndata** package." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# READING/WRITING H5AD WITH ANNDATA" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Several packages exist for reading and writing H5AD files from R. While they result in a file on disk these packages usually rely on wrapping the Python **anndata** package to handle the actual reading and writing of files with an in-memory conversion step to convert between R and Python." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Reading/writing H5AD with Bioconductor" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Bioconductor **{zellkonverter}** package] helps makes this easier by using the [**{basilisk}** package] to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# READING/WRITING H5AD WITH ZELLKONVERTER" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package]. While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Reading/writing H5AD with Seurat" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial]. Firstly the object is written to disk as a `.h5Seurat` file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratObject}** package] and then this file is converted to an H5AD file." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# READING/WRITING H5AD WITH SEURAT" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Loom" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Loom file format] is an HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis package, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R] and [Python]. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto] for [RNA velocity analysis]." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### RDS files" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another file format you may see used to share single-cell datasets is the RDS format. This is a binary format used to serialise arbitrary R objects (similar to Python Pickle files). As `SingleCellExperiment` and `Seurat` objects did not always have matching on-disk representations RDS files are sometimes used to share the results from R analyses. While this is ok within an analysis project we discourage its use for sharing data publicly or with collaborators due to the lack of interoperability with other ecosystems. Instead, we recommend using one of the HDF5 formats mentioned above that can be read from multiple languages." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### New on-disk formats" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While HDF5-based formats are currently the standard for on-disk representations of single-cell data other newer technologies such as [Zarr] and [TileDB] have some advantages, particularly for very large datasets and other modalities. We expect specifications to be developed for these formats in the future which may be adopted by the community (**anndata** already provides support for Zarr files)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## In-memory interoperability" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The second approach to interoperability is to work on in-memory representations of an object. This approach involves active sessions from two programming languages running at the same time and either accessing the same object from both or converting between them as needed. Usually, one language acts as the main environment and there is an interface to the other language. This can be very useful for interactive analysis as it allows an analyst to work in two languages simultaneously. It is also often used when creating documents that use multiple languages (such as this book). However, in-memory interoperability has some drawbacks as it requires the analyst to be familiar with setting up and using both environments, more complex objects are often not supported by both languages and there is a greater memory overhead as data can easily become duplicated (making it difficult to use for larger datasets)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Interoperability between R ecosystems" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we look at in-memory interoperability between R and Python first let’s consider the simpler case of converting between the two R ecosystems. The **{Seurat}** package provides functions for performing this conversion [as described in this vignette]." + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "# CONVERTING TO/FROM SINGLECELLEXPERIMENT/SEURAT" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The difficult part here is due to the differences between the structures of the two objects. It is important to make sure the arguments are set correctly so that the conversion functions know which information to convert and where to place it.\n", + "\n", + "In many cases, it may not be necessary to convert a `Seurat` object in order to use Bioconductor packages. This is because many of the most commonly used Bioconductor functions for single-cell analysis have been written to accept raw matrices as well as more complex objects. This means you can often provide the necessary part of a `Seurat` object directly to a Bioconductor function.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# USING A BIOCONDUCTOR FUNCTION ON A SEURAT OBJECT" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, it is important to be sure you are accessing the right information and storing any results in the correct place if needed." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accessing R from Python" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python interface to R is provided by the [**rpy2** package]. This allows you to access R functions and objects from Python. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# SIMPLE RPY2 USAGE" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# SIMPLE MAGIC CELL" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation].\n", + "\n", + "To work with single-cell data in this way the [**anndata2ri**] package is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# USING AN R FUNCTION ON AN ANNDATA" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that you will still run into issues if an object (or part of it) cannot be interfaced correctly (for example if there is an unsupported data type). In that case, you may need to modify your object first before it can be accessed." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accessing Python from R" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Accessing Python from an R session is similar to accessing R from Python but here the interface is provided by the [**{reticulate}** package]. Once it is loaded we can access Python functions and objects from R." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# SIMPLE RETICULATE USAGE" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you are working in an [RMarkdown] or [Quarto] document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# SIMPLE PYTHON CHUNK" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike **anndata2ri**, there are no R packages that provide a direct interface for Python to view `SingleCellExperiment` or `Seurat` objects as `AnnData` objects. However, we can still access most parts of an `AnnData` using **{reticulate}**." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# ACCESSING ANNDATA" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As mentioned above the R **{anndata}** package provides an R interface for `AnnData` objects but it is not currently used by many analysis packages.\n", + "\n", + "For more complex analysis that requires a whole object to work on it may be necessary to completely convert an object from R to Python. This is not memory efficient as it creates a duplicate of the data but it does provide access to a greater range of packages. The **{zellkonverter}** package provides a function for doing this conversion (note that, unlike the function for reading H5AD files, this uses the normal Python environment rather than a specially created one). \n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "# SCE2ANNDATA" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The created object can then be used by Python functions and the results converted back to R." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# ANNDATA2SCE" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The **{sceasy}** package also provides this functionality but can also convert between `Seurat` and `AnnData`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# SCEASY ANNDATA <-> SEURAT" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interoperability for multimodal data" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter] as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package] for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package] for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation]." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interoperability with other languages" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we briefly list some resources and tools for the interoperability of single-cell data with languages other than R and Python.\n", + "\n", + "### Julia\n", + "\n", + "- [Muon.jl] provides Julia implementations of AnnData and MuData objects, as well as IO for the H5AD and H5MU formats\n", + "- [scVI.jl] provides a Julia implementation of AnnData as well as IO for the H5AD format\n", + "\n", + "### JavaScript\n", + "\n", + "- [Vitessce] contains loaders from `AnnData` objects stored using the Zarr format\n", + "- The [kana family] supports reading H5AD files and `SingleCellExperiment` objects saved as RDS files\n", + "\n", + "### Rust\n", + "\n", + "- [anndata-rs] provides a Rust implementation of AnnData as well as advanced IO support for the H5AD format\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Session information" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{bibliography}\n", + ":filter: docname in docnames\n", + ":labelprefix: int\n", + "```" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Contributors\n", + "\n", + "We gratefully acknowledge the contributions of:\n", + "\n", + "### Authors\n", + "\n", + "* Luke Zappia\n", + "\n", + "### Reviewers\n", + "\n", + "* Lukas Heumos\n", + "* Isaac Virshup" + ] } ], "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" }, - "orig_nbformat": 4 + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "9a4b8eb2e3b598ccf69997907eca0492d4b3d651094270ed44ac0cb064672ed3" + } + } }, "nbformat": 4, "nbformat_minor": 2 From b97934d1e5ee7b3e1facaed2bc9115d4ac5f1eea Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Fri, 17 Feb 2023 12:19:20 +0100 Subject: [PATCH 03/15] Add interoperability environment --- .../interoperability_environment.yml | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 jupyter-book/introduction/interoperability_environment.yml diff --git a/jupyter-book/introduction/interoperability_environment.yml b/jupyter-book/introduction/interoperability_environment.yml new file mode 100644 index 00000000..90a287f1 --- /dev/null +++ b/jupyter-book/introduction/interoperability_environment.yml @@ -0,0 +1,26 @@ +name: interoperability +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - conda-forge::python=3.9.15 + - conda-forge::jupyterlab==3.6.1 + - conda-forge::scanpy=1.9.1 + - anndata==0.8.0 + - anndata2ri==1.1 + - bioconductor-basilisk==1.9.12 + - bioconductor-singlecellexperiment==1.20.0 + - bioconductor-zellkonverter==1.8.0 + - igraph==0.10.3 + - matplotlib<3.7 + - numpy==1.23.5 + - pip==23.0 + - python-igraph=0.10.4 + - conda-forge::r-base=4.2.2 + - r-reticulate=1.26 + - r-sessioninfo=1.2.2 + - r-seurat=4.3.0 + - rpy2=3.5.8 + - scipy=1.10.0 + - session-info=1.0.0 From b80d6993b875ecb28297ab51fdf8aa83e47e870f Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Mon, 3 Apr 2023 17:10:46 +0200 Subject: [PATCH 04/15] Add most example code Except the bits that require rpy2 because I can't get it to work... --- .../introduction/interoperability.ipynb | 972 +++++++++++++++--- .../interoperability_environment.yml | 34 +- 2 files changed, 869 insertions(+), 137 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index d32c3524..05f8221b 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -8,7 +8,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -16,7 +15,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -26,7 +24,23 @@ ] }, { - "attachments": {}, + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import anndata\n", + "import numpy\n", + "import scanpy\n", + "import tempfile\n", + "import os\n", + "import rpy2.robjects\n", + "from scipy.sparse import csr_matrix\n", + "\n", + "%load_ext rpy2.ipython" + ] + }, + { "cell_type": "markdown", "metadata": {}, "source": [ @@ -34,7 +48,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -42,7 +55,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -50,7 +62,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -58,7 +69,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -66,7 +76,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -74,7 +83,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -82,7 +90,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -90,24 +97,81 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the **anndata** package." + "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the **anndata** package. To demonstrate interoperability we will use a small randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + }, + { + "data": { + "text/plain": [ + "AnnData object with n_obs × n_vars = 100 × 2000\n", + " obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'\n", + " var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n", + " uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap'\n", + " obsm: 'X_pca', 'X_umap'\n", + " varm: 'PCs'\n", + " layers: 'counts'\n", + " obsp: 'distances', 'connectivities'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create a randomly generated AnnData object to use as an example\n", + "counts = csr_matrix(numpy.random.poisson(1, size=(100, 2000)), dtype=numpy.float32)\n", + "adata = anndata.AnnData(counts)\n", + "adata.obs_names = [f\"Cell_{i:d}\" for i in range(adata.n_obs)]\n", + "adata.var_names = [f\"Gene_{i:d}\" for i in range(adata.n_vars)]\n", + "# Do some standard processing to populate the object\n", + "scanpy.pp.calculate_qc_metrics(adata, inplace=True)\n", + "adata.layers[\"counts\"] = adata.X.copy()\n", + "scanpy.pp.normalize_total(adata, inplace=True)\n", + "scanpy.pp.log1p(adata)\n", + "scanpy.pp.highly_variable_genes(adata, inplace=True)\n", + "scanpy.tl.pca(adata)\n", + "scanpy.pp.neighbors(adata)\n", + "scanpy.tl.umap(adata)\n", + "adata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will write this mock object to disk as a H5AD file to demonstrate how those files can be read from R." + ] + }, + { + "cell_type": "code", + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ - "# READING/WRITING H5AD WITH ANNDATA" + "temp_dir = tempfile.TemporaryDirectory()\n", + "h5ad_file = os.path.join(temp_dir.name, \"example.h5ad\")\n", + "\n", + "adata.write_h5ad(h5ad_file)" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -115,7 +179,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -123,7 +186,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -131,24 +193,126 @@ ] }, { - "cell_type": "code", - "execution_count": 2, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# READING/WRITING H5AD WITH ZELLKONVERTER" + "Unfortunately, because of the way this book is made we are unable to run the code directly here. Instead we will show the code and what the output looks like when run in an R session:\n", + "\n", + "```r\n", + "sce <- zellkonverter::readH5AD(h5ad_file, verbose = TRUE)\n", + "```\n", + "\n", + "```\n", + "ℹ Using the Python reader\n", + "ℹ Using anndata version 0.8.0\n", + "✔ Read /.../luke.zappia/Downloads/example.h5ad [113ms]\n", + "✔ uns$hvg$flavor converted [17ms]\n", + "✔ uns$hvg converted [50ms]\n", + "✔ uns$log1p converted [25ms]\n", + "✔ uns$neighbors converted [18ms]\n", + "✔ uns$pca$params$use_highly_variable converted [16ms]\n", + "✔ uns$pca$params$zero_center converted [16ms]\n", + "✔ uns$pca$params converted [80ms]\n", + "✔ uns$pca$variance converted [17ms]\n", + "✔ uns$pca$variance_ratio converted [16ms]\n", + "✔ uns$pca converted [184ms]\n", + "✔ uns$umap$params$a converted [16ms]\n", + "✔ uns$umap$params$b converted [16ms]\n", + "✔ uns$umap$params converted [80ms]\n", + "✔ uns$umap converted [112ms]\n", + "✔ uns converted [490ms]\n", + "✔ Converting uns to metadata ... done\n", + "✔ X matrix converted to assay [29ms]\n", + "✔ layers$counts converted [27ms]\n", + "✔ Converting layers to assays ... done\n", + "✔ var converted to rowData [25ms]\n", + "✔ obs converted to colData [24ms]\n", + "✔ varm$PCs converted [18ms]\n", + "✔ varm converted [47ms]\n", + "✔ Converting varm to rowData$varm ... done\n", + "✔ obsm$X_pca converted [15ms]\n", + "✔ obsm$X_umap converted [16ms]\n", + "✔ obsm converted [80ms]\n", + "✔ Converting obsm to reducedDims ... done\n", + "ℹ varp is empty and was skipped\n", + "✔ obsp$connectivities converted [22ms]\n", + "✔ obsp$distances converted [23ms]\n", + "✔ obsp converted [92ms]\n", + "✔ Converting obsp to colPairs ... done\n", + "✔ SingleCellExperiment constructed [164ms]\n", + "ℹ Skipping conversion of raw\n", + "✔ Converting AnnData to SingleCellExperiment ... done\n", + "```\n", + "\n", + "Because we have turned on the verbose output you can see how **{zellkonverter}** reads the file using Python and converts each part of the `AnnData` object to a Bioconductor `SingleCellExperiment` object. We can see what the result looks like:\n", + "\n", + "```r\n", + "sce\n", + "```\n", + "\n", + "```\n", + "class: SingleCellExperiment\n", + "dim: 2000 100\n", + "metadata(5): hvg log1p neighbors pca umap\n", + "assays(2): X counts\n", + "rownames(2000): Gene_0 Gene_1 ... Gene_1998 Gene_1999\n", + "rowData names(11): n_cells_by_counts mean_counts ... dispersions_norm\n", + " varm\n", + "colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99\n", + "colData names(8): n_genes_by_counts log1p_n_genes_by_counts ...\n", + " pct_counts_in_top_200_genes pct_counts_in_top_500_genes\n", + "reducedDimNames(2): X_pca X_umap\n", + "mainExpName: NULL\n", + "altExpNames(0):\n", + "```\n", + "\n", + "This object can then be used as normal by any Bioconductor package. If we want to write a new H5AD file we can use the `writeH5AD()` function:\n", + "\n", + "```r\n", + "zellkonverter_h5ad_file <- tempfile(fileext = \".h5ad\")\n", + "zellkonverter::writeH5AD(sce, zellkonverter_h5ad_file, verbose = TRUE)\n", + "```\n", + "\n", + "```\n", + "ℹ Using anndata version 0.8.0\n", + "ℹ Using the 'X' assay as the X matrix\n", + "✔ Selected X matrix [29ms]\n", + "✔ assays$X converted to X matrix [50ms]\n", + "✔ additional assays converted to layers [30ms]\n", + "✔ rowData$varm converted to varm [28ms]\n", + "✔ reducedDims converted to obsm [68ms]\n", + "✔ metadata converted to uns [24ms]\n", + "ℹ rowPairs is empty and was skipped\n", + "✔ Converting AnnData to SingleCellExperiment ... done\n", + "✔ Wrote '/.../.../rj/.../T/.../file102cfa97cc51.h5ad ' [133ms]\n", + "```\n", + "\n", + "We can then read this file in Python:\n", + "\n", + "```python\n", + "scanpy.read_h5ad(zellkonverter_h5ad_file)\n", + "```\n", + "\n", + "```\n", + "AnnData object with n_obs × n_vars = 100 × 2000\n", + " obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'\n", + " var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n", + " uns: 'X_name', 'hvg', 'log1p', 'neighbors', 'pca', 'umap'\n", + " obsm: 'X_pca', 'X_umap'\n", + " varm: 'PCs'\n", + " layers: 'counts'\n", + " obsp: 'connectivities', 'distances'\n", + "```" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "**{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package]. While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." + "If this the first time that you run a **{zellkonverter}** you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. **{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package]. While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -160,16 +324,133 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial]. Firstly the object is written to disk as a `.h5Seurat` file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratObject}** package] and then this file is converted to an H5AD file." + "Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial]. Firstly `.h5ad` file is converted to a `.h5Seurat` file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratObject}** package] and then this file is read as a `Seurat` object." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, - "outputs": [], - "source": [ - "# READING/WRITING H5AD WITH SEURAT" + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: Converting H5AD to H5Seurat...\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " WARNING: The R package \"reticulate\" only fixed recently\n", + " an issue that caused a segfault when used with rpy2:\n", + " https://github.com/rstudio/reticulate/pull/1188\n", + " Make sure that you use a version of that package that includes\n", + " the fix.\n", + " " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: Registered S3 method overwritten by 'SeuratDisk':\n", + " method from \n", + " as.sparse.H5Group Seurat\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: Unknown file type: h5ad\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: 'assay' not set, setting to 'RNA'\n", + "\n", + "R[write to console]: Creating h5Seurat file for version 3.1.5.9900\n", + "\n", + "R[write to console]: Adding X as data\n", + "\n", + "R[write to console]: Adding X as counts\n", + "\n", + "R[write to console]: Adding meta.features from var\n", + "\n", + "R[write to console]: Adding X_pca as cell embeddings for pca\n", + "\n", + "R[write to console]: Adding X_umap as cell embeddings for umap\n", + "\n", + "R[write to console]: Adding PCs as feature loadings fpr pca\n", + "\n", + "R[write to console]: Adding miscellaneous information for pca\n", + "\n", + "R[write to console]: Adding standard deviations for pca\n", + "\n", + "R[write to console]: Adding miscellaneous information for umap\n", + "\n", + "R[write to console]: Adding hvg to miscellaneous data\n", + "\n", + "R[write to console]: Adding log1p to miscellaneous data\n", + "\n", + "R[write to console]: Adding layer counts as data in assay counts\n", + "\n", + "R[write to console]: Adding layer counts as counts in assay counts\n", + "\n", + "R[write to console]: Reading H5Seurat...\n", + "\n", + "R[write to console]: Validating h5Seurat file\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: Feature names cannot have underscores ('_'), replacing with dashes ('-')\n", + "\n", + "R[write to console]: Initializing RNA with data\n", + "\n", + "R[write to console]: Adding counts for RNA\n", + "\n", + "R[write to console]: Adding feature-level metadata for RNA\n", + "\n", + "R[write to console]: Adding reduction pca\n", + "\n", + "R[write to console]: Adding cell embeddings for pca\n", + "\n", + "R[write to console]: Adding feature loadings for pca\n", + "\n", + "R[write to console]: Adding miscellaneous information for pca\n", + "\n", + "R[write to console]: Adding reduction umap\n", + "\n", + "R[write to console]: Adding cell embeddings for umap\n", + "\n", + "R[write to console]: Adding miscellaneous information for umap\n", + "\n", + "R[write to console]: Adding command information\n", + "\n", + "R[write to console]: Adding cell-level metadata\n", + "\n", + "R[write to console]: Read Seurat object:\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "An object of class Seurat \n", + "2000 features across 100 samples within 1 assay \n", + "Active assay: RNA (2000 features, 0 variable features)\n", + " 2 dimensional reductions calculated: pca, umap\n" + ] + } + ], + "source": [ + "%%R -i h5ad_file\n", + "\n", + "message(\"Converting H5AD to H5Seurat...\")\n", + "SeuratDisk::Convert(h5ad_file, dest = \"h5seurat\", overwrite = TRUE)\n", + "message(\"Reading H5Seurat...\")\n", + "h5seurat_file <- gsub(\".h5ad\", \".h5seurat\", h5ad_file)\n", + "seurat <- SeuratDisk::LoadH5Seurat(h5seurat_file, assays = \"RNA\")\n", + "message(\"Read Seurat object:\")\n", + "seurat" ] }, { @@ -177,11 +458,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done." + "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done.\n", + "\n", + "The **{sceasy}** package also provides a function for reading `.h5ad` files as `Seurat` or `SingleCellExperiment` objects in a single step. **{sceasy}** also does this by wrapping Python but unlike **{zellkonverter}** it doesn't use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here)." ] }, { "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```r\n", + "sceasy_seurat <- sceasy::convertFormat(h5ad_file, from=\"anndata\", to=\"seurat\")\n", + "sceasy_seurat\n", + "```\n", + "```\n", + "Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')\n", + "X -> counts\n", + "An object of class Seurat\n", + "2000 features across 100 samples within 1 assay\n", + "Active assay: RNA (2000 features, 0 variable features)\n", + " 2 dimensional reductions calculated: pca, umap\n", + "```" + ] + }, + { "cell_type": "markdown", "metadata": {}, "source": [ @@ -193,11 +494,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [Loom file format] is an HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis package, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R] and [Python]. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto] for [RNA velocity analysis]." + "The [Loom file format] is an older HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis package, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R] and [Python]. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto] for [RNA velocity analysis]." ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -205,7 +505,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -213,7 +512,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -221,7 +519,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -229,7 +526,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -237,7 +533,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -245,7 +540,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -253,7 +547,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -262,11 +555,54 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "class: SingleCellExperiment \n", + "dim: 2000 100 \n", + "metadata(0):\n", + "assays(2): counts logcounts\n", + "rownames(2000): Gene-0 Gene-1 ... Gene-1998 Gene-1999\n", + "rowData names(0):\n", + "colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99\n", + "colData names(9): n_genes_by_counts log1p_n_genes_by_counts ...\n", + " pct_counts_in_top_500_genes ident\n", + "reducedDimNames(2): PCA UMAP\n", + "mainExpName: RNA\n", + "altExpNames(0):\n" + ] + } + ], + "source": [ + "%%R\n", + "sce_from_seurat <- Seurat::as.SingleCellExperiment(seurat)\n", + "sce_from_seurat" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "An object of class Seurat \n", + "2000 features across 100 samples within 1 assay \n", + "Active assay: RNA (2000 features, 0 variable features)\n", + " 2 dimensional reductions calculated: PCA, UMAP\n" + ] + } + ], "source": [ - "# CONVERTING TO/FROM SINGLECELLEXPERIMENT/SEURAT" + "%%R\n", + "seurat_from_sce <- Seurat::as.Seurat(sce_from_seurat)\n", + "seurat_from_sce" ] }, { @@ -276,20 +612,67 @@ "source": [ "The difficult part here is due to the differences between the structures of the two objects. It is important to make sure the arguments are set correctly so that the conversion functions know which information to convert and where to place it.\n", "\n", - "In many cases, it may not be necessary to convert a `Seurat` object in order to use Bioconductor packages. This is because many of the most commonly used Bioconductor functions for single-cell analysis have been written to accept raw matrices as well as more complex objects. This means you can often provide the necessary part of a `Seurat` object directly to a Bioconductor function.\n" + "In many cases it may not be necessary to convert a `Seurat` object to a `SingleCellExperiment`. This is because many of the core Bioconductor packages for single-cell analysis have been designed to also accept a matrix as input." ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, - "outputs": [], - "source": [ - "# USING A BIOCONDUCTOR FUNCTION ON A SEURAT OBJECT" + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10 x 10 sparse Matrix of class \"dgCMatrix\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: [[ suppressing 10 column names ‘Cell_0’, ‘Cell_1’, ‘Cell_2’ ... ]]\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \n", + "Gene-0 930.9480 601.4938 . 960.8865 . . . \n", + "Gene-1 584.8096 . 979.4957 . . 602.7839 1181.5488\n", + "Gene-2 584.8096 601.4938 . 607.1453 604.1277 1206.2747 934.9052\n", + "Gene-3 584.8096 . 979.4957 607.1453 . 955.7456 934.9052\n", + "Gene-4 584.8096 953.0988 619.8831 607.1453 957.9856 . 588.2744\n", + "Gene-5 584.8096 953.0988 . . 604.1277 . 588.2744\n", + "Gene-6 930.9480 953.0988 . . . 602.7839 . \n", + "Gene-7 . . 619.8831 607.1453 957.9856 602.7839 588.2744\n", + "Gene-8 584.8096 953.0988 1233.8239 . 1404.0783 1400.6417 588.2744\n", + "Gene-9 . 953.0988 979.4957 960.8865 . 602.7839 934.9052\n", + " \n", + "Gene-0 596.6198 . 610.5472\n", + "Gene-1 . 595.8238 610.5472\n", + "Gene-2 . 595.8238 . \n", + "Gene-3 596.6198 . . \n", + "Gene-4 947.1021 . 610.5472\n", + "Gene-5 947.1021 . 1717.0308\n", + "Gene-6 596.6198 595.8238 . \n", + "Gene-7 596.6198 . . \n", + "Gene-8 947.1021 595.8238 610.5472\n", + "Gene-9 . . 968.3861\n" + ] + } + ], + "source": [ + "%%R\n", + "# Calculate Counts Per Million using the Bioconductor scuttle package\n", + "# with a matrix in a Seurat object\n", + "cpm <- scuttle::calculateCPM(Seurat::GetAssayData(seurat, slot = \"counts\"))\n", + "cpm[1:10, 1:10]" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -297,7 +680,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -305,7 +687,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -314,11 +695,27 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 8, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "<2000x100 sparse matrix of type ''\n", + "\twith 126283 stored elements in Compressed Sparse Column format>" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# SIMPLE RPY2 USAGE" + "counts_mat = adata.layers[\"counts\"].T\n", + "# rpy2.robjects.globalenv[\"counts_mat\"] = counts_mat\n", + "# cpm = rpy2.robjects.r(\"scuttle::calculateCPM(counts_mat)\")\n", + "# cpm\n", + "counts_mat" ] }, { @@ -326,20 +723,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required)." + "If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required). For example, starting a cell with `%%R -i input -o output` says to take `input` as input, run R code and then return `output` as output." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NULL\n" + ] + } + ], + "source": [ + "%%R\n", + "# R code running using IPython magic\n", + "# magic_cpm <- scuttle::calculateCPM(counts_mat)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ - "# SIMPLE MAGIC CELL" + "# Python code accessing the results\n", + "# magic_cpm" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -350,7 +766,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -358,7 +774,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -366,7 +781,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -374,7 +788,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -383,11 +796,24 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "List (26 items)\n", + "\n" + ] + } + ], "source": [ - "# SIMPLE RETICULATE USAGE" + "%%R\n", + "reticulate_list <- reticulate::r_to_py(LETTERS)\n", + "print(reticulate_list)\n", + "py_builtins <- reticulate::import_builtins()\n", + "py_builtins$zip(letters, LETTERS)" ] }, { @@ -395,16 +821,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you are working in an [RMarkdown] or [Quarto] document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "# SIMPLE PYTHON CHUNK" + "If you are working in an [RMarkdown] or [Quarto] document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language (the following code is an example that is not run)." ] }, { @@ -412,16 +829,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Unlike **anndata2ri**, there are no R packages that provide a direct interface for Python to view `SingleCellExperiment` or `Seurat` objects as `AnnData` objects. However, we can still access most parts of an `AnnData` using **{reticulate}**." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# ACCESSING ANNDATA" + "````\n", + "```{r}\n", + "# An R chunk that accesses a Python object\n", + "print(py$py_object)\n", + "```\n", + "\n", + "```{python}\n", + "# A Python chunk that accesses an R object\n", + "print(r$r_object)\n", + "```\n", + "````" ] }, { @@ -429,18 +847,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As mentioned above the R **{anndata}** package provides an R interface for `AnnData` objects but it is not currently used by many analysis packages.\n", - "\n", - "For more complex analysis that requires a whole object to work on it may be necessary to completely convert an object from R to Python. This is not memory efficient as it creates a duplicate of the data but it does provide access to a greater range of packages. The **{zellkonverter}** package provides a function for doing this conversion (note that, unlike the function for reading H5AD files, this uses the normal Python environment rather than a specially created one). \n" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "# SCE2ANNDATA" + "Unlike **anndata2ri**, there are no R packages that provide a direct interface for Python to view `SingleCellExperiment` or `Seurat` objects as `AnnData` objects. However, we can still access most parts of an `AnnData` using **{reticulate}** (this code is not run)." ] }, { @@ -448,16 +855,69 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The created object can then be used by Python functions and the results converted back to R." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "# ANNDATA2SCE" + "```r\n", + "# Print an AnnData object in a Python environment\n", + "py$adata\n", + "```\n", + "```\n", + "AnnData object with n_obs × n_vars = 100 × 2000\n", + " obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'\n", + " var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n", + " uns: 'hvg', 'log1p', 'neighbors', 'pca', 'umap'\n", + " obsm: 'X_pca', 'X_umap'\n", + " varm: 'PCs'\n", + " layers: 'counts'\n", + " obsp: 'connectivities', 'distances'\n", + "```\n", + "```r\n", + "# Alternatively use the Python anndata package to read a H5AD file\n", + "anndata <- reticulate::import(\"anndata\")\n", + "anndata$read_h5ad(h5ad_file)\n", + "```\n", + "```\n", + "AnnData object with n_obs × n_vars = 100 × 2000\n", + " obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'\n", + " var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n", + " uns: 'hvg', 'log1p', 'neighbors', 'pca', 'umap'\n", + " obsm: 'X_pca', 'X_umap'\n", + " varm: 'PCs'\n", + " layers: 'counts'\n", + " obsp: 'connectivities', 'distances'\n", + "```\n", + "```r\n", + "# Access the obs slot, pandas DataFrames are automatically converted to R data.frames\n", + "head(adata$obs)\n", + "```\n", + "```\n", + " n_genes_by_counts log1p_n_genes_by_counts total_counts\n", + "Cell_0 1246 7.128496 1965\n", + "Cell_1 1262 7.141245 2006\n", + "Cell_2 1262 7.141245 1958\n", + "Cell_3 1240 7.123673 1960\n", + "Cell_4 1296 7.167809 2027\n", + "Cell_5 1231 7.116394 1898\n", + " log1p_total_counts pct_counts_in_top_50_genes\n", + "Cell_0 7.583756 10.025445\n", + "Cell_1 7.604396 9.521436\n", + "Cell_2 7.580189 9.959142\n", + "Cell_3 7.581210 9.183673\n", + "Cell_4 7.614805 9.718796\n", + "Cell_5 7.549083 10.168599\n", + " pct_counts_in_top_100_genes pct_counts_in_top_200_genes\n", + "Cell_0 17.65903 30.89059\n", + "Cell_1 16.99900 29.71087\n", + "Cell_2 17.62002 30.28601\n", + "Cell_3 16.83673 30.45918\n", + "Cell_4 17.11889 30.04440\n", + "Cell_5 18.07165 30.29505\n", + " pct_counts_in_top_500_genes\n", + "Cell_0 61.42494\n", + "Cell_1 59.62114\n", + "Cell_2 60.92952\n", + "Cell_3 61.07143\n", + "Cell_4 59.64480\n", + "Cell_5 61.48577\n", + "```" ] }, { @@ -465,20 +925,101 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The **{sceasy}** package also provides this functionality but can also convert between `Seurat` and `AnnData`." + "As mentioned above the R **{anndata}** package provides an R interface for `AnnData` objects but it is not currently used by many analysis packages.\n", + "\n", + "For more complex analysis that requires a whole object to work on it may be necessary to completely convert an object from R to Python (or the opposite). This is not memory efficient as it creates a duplicate of the data but it does provide access to a greater range of packages. The **{zellkonverter}** package provides a function for doing this conversion (note that, unlike the function for reading H5AD files, this uses the normal Python environment rather than a specially created one) (code not run). \n" ] }, { - "cell_type": "code", - "execution_count": 13, + "attachments": {}, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# SCEASY ANNDATA <-> SEURAT" + "```r\n", + "# Convert an AnnData to a SingleCellExperiment\n", + "sce <- zellkonverter::AnnData2SCE(adata, verbose = TRUE)\n", + "sce\n", + "```\n", + "```\n", + "✔ uns$hvg$flavor converted [21ms]\n", + "✔ uns$hvg converted [62ms]\n", + "✔ uns$log1p converted [22ms]\n", + "✔ uns$neighbors converted [21ms]\n", + "✔ uns$pca$params$use_highly_variable converted [22ms]\n", + "✔ uns$pca$params$zero_center converted [31ms]\n", + "✔ uns$pca$params converted [118ms]\n", + "✔ uns$pca$variance converted [17ms]\n", + "✔ uns$pca$variance_ratio converted [17ms]\n", + "✔ uns$pca converted [224ms]\n", + "✔ uns$umap$params$a converted [15ms]\n", + "✔ uns$umap$params$b converted [17ms]\n", + "✔ uns$umap$params converted [80ms]\n", + "✔ uns$umap converted [115ms]\n", + "✔ uns converted [582ms]\n", + "✔ Converting uns to metadata ... done\n", + "✔ X matrix converted to assay [44ms]\n", + "✔ layers$counts converted [29ms]\n", + "✔ Converting layers to assays ... done\n", + "✔ var converted to rowData [37ms]\n", + "✔ obs converted to colData [23ms]\n", + "✔ varm$PCs converted [18ms]\n", + "✔ varm converted [49ms]\n", + "✔ Converting varm to rowData$varm ... done\n", + "✔ obsm$X_pca converted [17ms]\n", + "✔ obsm$X_umap converted [17ms]\n", + "✔ obsm converted [80ms]\n", + "✔ Converting obsm to reducedDims ... done\n", + "ℹ varp is empty and was skipped\n", + "✔ obsp$connectivities converted [21ms]\n", + "✔ obsp$distances converted [22ms]\n", + "✔ obsp converted [89ms]\n", + "✔ Converting obsp to colPairs ... done\n", + "✔ SingleCellExperiment constructed [241ms]\n", + "ℹ Skipping conversion of raw\n", + "✔ Converting AnnData to SingleCellExperiment ... done\n", + "class: SingleCellExperiment\n", + "dim: 2000 100\n", + "metadata(5): hvg log1p neighbors pca umap\n", + "assays(2): X counts\n", + "rownames(2000): Gene_0 Gene_1 ... Gene_1998 Gene_1999\n", + "rowData names(11): n_cells_by_counts mean_counts ... dispersions_norm\n", + " varm\n", + "colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99\n", + "colData names(8): n_genes_by_counts log1p_n_genes_by_counts ...\n", + " pct_counts_in_top_200_genes pct_counts_in_top_500_genes\n", + "reducedDimNames(2): X_pca X_umap\n", + "mainExpName: NULL\n", + "altExpNames(0):\n", + "```\n", + "\n", + "The same can also be done in reverse:\n", + "\n", + "```r\n", + "adata2 <- zellkonverter::SCE2AnnData(sce, verbose = TRUE)\n", + "adata2\n", + "```\n", + "```\n", + "ℹ Using the 'X' assay as the X matrix\n", + "✔ Selected X matrix [27ms]\n", + "✔ assays$X converted to X matrix [38ms]\n", + "✔ additional assays converted to layers [31ms]\n", + "✔ rowData$varm converted to varm [15ms]\n", + "✔ reducedDims converted to obsm [63ms]\n", + "✔ metadata converted to uns [23ms]\n", + "ℹ rowPairs is empty and was skipped\n", + "✔ Converting AnnData to SingleCellExperiment ... done\n", + "AnnData object with n_obs × n_vars = 100 × 2000\n", + " obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'\n", + " var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'\n", + " uns: 'X_name', 'hvg', 'log1p', 'neighbors', 'pca', 'umap'\n", + " obsm: 'X_pca', 'X_umap'\n", + " varm: 'PCs'\n", + " layers: 'counts'\n", + " obsp: 'connectivities', 'distances'\n", + "```" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -486,7 +1027,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -494,7 +1034,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -502,7 +1041,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -524,15 +1062,206 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Session information" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "Click to view session information\n", + "
\n",
+       "-----\n",
+       "anndata             0.8.0\n",
+       "numpy               1.23.5\n",
+       "rpy2                3.5.10\n",
+       "scanpy              1.9.1\n",
+       "scipy               1.10.1\n",
+       "session_info        1.0.0\n",
+       "-----\n",
+       "
\n", + "
\n", + "Click to view modules imported as dependencies\n", + "
\n",
+       "PIL                         9.4.0\n",
+       "anyio                       NA\n",
+       "appnope                     0.1.3\n",
+       "argcomplete                 NA\n",
+       "asttokens                   NA\n",
+       "attr                        22.2.0\n",
+       "babel                       2.12.1\n",
+       "backcall                    0.2.0\n",
+       "beta_ufunc                  NA\n",
+       "binom_ufunc                 NA\n",
+       "brotli                      NA\n",
+       "certifi                     2022.12.07\n",
+       "cffi                        1.15.1\n",
+       "charset_normalizer          3.1.0\n",
+       "colorama                    0.4.6\n",
+       "comm                        0.1.3\n",
+       "cycler                      0.10.0\n",
+       "cython_runtime              NA\n",
+       "dateutil                    2.8.2\n",
+       "debugpy                     1.6.6\n",
+       "decorator                   5.1.1\n",
+       "defusedxml                  0.7.1\n",
+       "executing                   1.2.0\n",
+       "fastjsonschema              NA\n",
+       "h5py                        3.8.0\n",
+       "hypergeom_ufunc             NA\n",
+       "idna                        3.4\n",
+       "importlib_metadata          NA\n",
+       "invgauss_ufunc              NA\n",
+       "ipykernel                   6.22.0\n",
+       "ipython_genutils            0.2.0\n",
+       "jedi                        0.18.2\n",
+       "jinja2                      3.1.2\n",
+       "joblib                      1.2.0\n",
+       "json5                       NA\n",
+       "jsonschema                  4.17.3\n",
+       "jupyter_events              0.6.3\n",
+       "jupyter_server              2.5.0\n",
+       "jupyterlab_server           2.22.0\n",
+       "kiwisolver                  1.4.4\n",
+       "llvmlite                    0.39.1\n",
+       "markupsafe                  2.1.2\n",
+       "matplotlib                  3.6.3\n",
+       "mpl_toolkits                NA\n",
+       "natsort                     8.3.1\n",
+       "nbformat                    5.8.0\n",
+       "nbinom_ufunc                NA\n",
+       "ncf_ufunc                   NA\n",
+       "nct_ufunc                   NA\n",
+       "ncx2_ufunc                  NA\n",
+       "numba                       0.56.4\n",
+       "packaging                   23.0\n",
+       "pandas                      1.5.3\n",
+       "parso                       0.8.3\n",
+       "pexpect                     4.8.0\n",
+       "pickleshare                 0.7.5\n",
+       "pkg_resources               NA\n",
+       "platformdirs                3.2.0\n",
+       "prometheus_client           NA\n",
+       "prompt_toolkit              3.0.38\n",
+       "psutil                      5.9.4\n",
+       "ptyprocess                  0.7.0\n",
+       "pure_eval                   0.2.2\n",
+       "pvectorc                    NA\n",
+       "pycparser                   2.21\n",
+       "pydev_ipython               NA\n",
+       "pydevconsole                NA\n",
+       "pydevd                      2.9.5\n",
+       "pydevd_file_utils           NA\n",
+       "pydevd_plugins              NA\n",
+       "pydevd_tracing              NA\n",
+       "pygments                    2.14.0\n",
+       "pynndescent                 0.5.8\n",
+       "pyparsing                   3.0.9\n",
+       "pyrsistent                  NA\n",
+       "pythonjsonlogger            NA\n",
+       "pytz                        2023.3\n",
+       "pytz_deprecation_shim       NA\n",
+       "requests                    2.28.2\n",
+       "rfc3339_validator           0.1.4\n",
+       "rfc3986_validator           0.1.1\n",
+       "rpycall                     NA\n",
+       "rpytools                    NA\n",
+       "send2trash                  NA\n",
+       "setuptools                  67.6.1\n",
+       "six                         1.16.0\n",
+       "skewnorm_ufunc              NA\n",
+       "sklearn                     1.2.2\n",
+       "sniffio                     1.3.0\n",
+       "socks                       1.7.1\n",
+       "stack_data                  0.6.2\n",
+       "threadpoolctl               3.1.0\n",
+       "tornado                     6.2\n",
+       "tqdm                        4.65.0\n",
+       "traitlets                   5.9.0\n",
+       "typing_extensions           NA\n",
+       "tzlocal                     NA\n",
+       "umap                        0.5.3\n",
+       "urllib3                     1.26.15\n",
+       "wcwidth                     0.2.6\n",
+       "websocket                   1.5.1\n",
+       "yaml                        6.0\n",
+       "zipp                        NA\n",
+       "zmq                         25.0.2\n",
+       "zoneinfo                    NA\n",
+       "
\n", + "
\n", + "
\n",
+       "-----\n",
+       "IPython             8.12.0\n",
+       "jupyter_client      8.1.0\n",
+       "jupyter_core        5.3.0\n",
+       "jupyterlab          3.6.3\n",
+       "notebook            6.5.3\n",
+       "-----\n",
+       "Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20) [Clang 14.0.6 ]\n",
+       "macOS-12.6.3-x86_64-i386-64bit\n",
+       "-----\n",
+       "Session information updated at 2023-04-03 17:10\n",
+       "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import session_info\n", + "\n", + "session_info.show()" + ] + }, { "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## R" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NULL\n" + ] + } + ], + "source": [ + "%%R\n", + "# sessioninfo::session_info()" + ] + }, + { "cell_type": "markdown", "metadata": {}, "source": [ @@ -540,7 +1269,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -551,7 +1279,6 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -572,7 +1299,7 @@ ], "metadata": { "kernelspec": { - "display_name": "base", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -586,15 +1313,14 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.12" + "version": "3.9.16" }, - "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "9a4b8eb2e3b598ccf69997907eca0492d4b3d651094270ed44ac0cb064672ed3" + "hash": "8d105ab5f2493203880d03226e61049059fafcfb5de79a5a432ed345ee74ab31" } } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/jupyter-book/introduction/interoperability_environment.yml b/jupyter-book/introduction/interoperability_environment.yml index 90a287f1..3a4b61df 100644 --- a/jupyter-book/introduction/interoperability_environment.yml +++ b/jupyter-book/introduction/interoperability_environment.yml @@ -4,23 +4,29 @@ channels: - bioconda - defaults dependencies: - - conda-forge::python=3.9.15 - - conda-forge::jupyterlab==3.6.1 + - conda-forge::python=3.9.16 + - conda-forge::jupyterlab=3.6.3 - conda-forge::scanpy=1.9.1 - - anndata==0.8.0 - - anndata2ri==1.1 - - bioconductor-basilisk==1.9.12 - - bioconductor-singlecellexperiment==1.20.0 - - bioconductor-zellkonverter==1.8.0 - - igraph==0.10.3 - - matplotlib<3.7 - - numpy==1.23.5 - - pip==23.0 + - anndata=0.8.0 + - anndata2ri=1.1 + - bioconductor-basilisk=1.9.12 + - bioconductor-scuttle=1.8.0 + - bioconductor-singlecellexperiment=1.20.0 + - bioconductor-zellkonverter=1.8.0 + - igraph=0.10.3 + - matplotlib=3.6.3 + - numpy=1.23.5 + - pip=23.0.1 - python-igraph=0.10.4 - - conda-forge::r-base=4.2.2 + - r-base=4.2 + - r-remotes=2.4.2 - r-reticulate=1.26 - r-sessioninfo=1.2.2 - r-seurat=4.3.0 - - rpy2=3.5.8 - - scipy=1.10.0 + - r-seuratobject=4.1.3 + - rpy2=3.5.10 + - scipy=1.10.1 - session-info=1.0.0 + # Additional R packages installed manually using {remotes} + # remotes::install_github("mojaveazure/seurat-disk@9b89970eac2a3bd770e744f63c7763419486b14c") + # remotes::install_github("cellgeni/sceasy@v0.0.7") From 004f93bed5506b5c27b676e2473bb2809a5cd4ad Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Mon, 3 Apr 2023 18:10:34 +0200 Subject: [PATCH 05/15] Add links --- .../introduction/analysis_tools.ipynb | 5 +- .../introduction/interoperability.ipynb | 129 ++++++++++-------- jupyter-book/trajectories/rna_velocity.ipynb | 3 + 3 files changed, 81 insertions(+), 56 deletions(-) diff --git a/jupyter-book/introduction/analysis_tools.ipynb b/jupyter-book/introduction/analysis_tools.ipynb index 1a08a142..7d4ddbde 100644 --- a/jupyter-book/introduction/analysis_tools.ipynb +++ b/jupyter-book/introduction/analysis_tools.ipynb @@ -1,9 +1,12 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ + "(introduction:analysis-frameworks)=\n", + "\n", "# Analysis frameworks and tools" ] }, @@ -3282,7 +3285,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.13" + "version": "3.8.12" }, "vscode": { "interpreter": { diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 05f8221b..453fb70c 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -15,10 +15,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "As we have discussed in the [analysis frameworks and tools chapter] there are three main ecosystems for single-cell analysis, the [Bioconductor] and [Seurat] ecosystems in R and the Python-based [scverse] ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", + "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", "\n", "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.\n" ] @@ -48,10 +49,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow] or [snakemake]). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages." + "The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow](https://www.nextflow.io/index.html) or [snakemake](https://snakemake.readthedocs.io/en/stable/)). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages." ] }, { @@ -76,17 +78,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The most common disk formats for single-cell data are based on [Hierarchical Data Format version 5] or HDF5. This is an open-source file format designed for storing large, complex and heterogeneous data. It has a file directory type structure (similar to how files and folders are organised on your computer) which allows many different kinds of data to be stored in a single file with an arbitrarily complex hierarchy. While this format is very flexible, to properly interact with it you need to know where and how the different information is stored. For this reason, standard specifications for storing single-cell data in HDF5 files have been developed." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "!!! HDF5 OVERVIEW IMAGE !!!" + "The most common disk formats for single-cell data are based on [Hierarchical Data Format version 5](https://www.hdfgroup.org/solutions/hdf5/) or HDF5. This is an open-source file format designed for storing large, complex and heterogeneous data. It has a file directory type structure (similar to how files and folders are organised on your computer) which allows many different kinds of data to be stored in a single file with an arbitrarily complex hierarchy. While this format is very flexible, to properly interact with it you need to know where and how the different information is stored. For this reason, standard specifications for storing single-cell data in HDF5 files have been developed." ] }, { @@ -97,10 +93,13 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the **anndata** package. To demonstrate interoperability we will use a small randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots." + "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the [**anndata** package](https://anndata.readthedocs.io/en/latest/index.html) (read more about the format [here](https://anndata.readthedocs.io/en/latest/fileformat-prose.html)).\n", + "\n", + "To demonstrate interoperability we will use a small randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots." ] }, { @@ -186,10 +185,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The [Bioconductor **{zellkonverter}** package] helps makes this easier by using the [**{basilisk}** package] to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python." + "The [Bioconductor **{zellkonverter}** package](https://bioconductor.org/packages/zellkonverter/) helps makes this easier by using the [**{basilisk}** package](https://bioconductor.org/packages/basilisk/) to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python." ] }, { @@ -306,17 +306,19 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "If this the first time that you run a **{zellkonverter}** you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. **{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package]. While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." + "If this the first time that you run a **{zellkonverter}** you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. **{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package](https://github.com/cellgeni/sceasy). While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "##### Reading/writing H5AD with Seurat" + "##### Reading/writing H5AD with **{Seurat}**" ] }, { @@ -324,7 +326,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial]. Firstly `.h5ad` file is converted to a `.h5Seurat` file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratObject}** package] and then this file is read as a `Seurat` object." + "Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial](https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html). Firstly H5AD file is converted to a H5Seurat file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratDisk}** package](https://mojaveazure.github.io/seurat-disk/) and then this file is read as a `Seurat` object." ] }, { @@ -460,7 +462,7 @@ "source": [ "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done.\n", "\n", - "The **{sceasy}** package also provides a function for reading `.h5ad` files as `Seurat` or `SingleCellExperiment` objects in a single step. **{sceasy}** also does this by wrapping Python but unlike **{zellkonverter}** it doesn't use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here)." + "The **{sceasy}** package also provides a function for reading H5AD files as `Seurat` or `SingleCellExperiment` objects in a single step. **{sceasy}** also does this by wrapping Python but unlike **{zellkonverter}** it doesn't use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here)." ] }, { @@ -482,6 +484,16 @@ "```" ] }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Reading/writing H5AD with **{anndata}**\n", + "\n", + "The R [**{anndata}** package](https://anndata.dynverse.org/index.html) can also be used to read H5AD files. However, unlike the packages above it does not convert to a native R object. Instead it provides an R interface to the Python object. This is useful for accessing the data but few analysis packages will accept this as input so further in-memory conversion is usually required." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -494,7 +506,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [Loom file format] is an older HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis package, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R] and [Python]. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto] for [RNA velocity analysis]." + "The [Loom file format](http://loompy.org/) is an older HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis ecosystem, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R](https://github.com/mojaveazure/loomR) and [Python](https://pypi.org/project/loompy/) as well as a [Bioconductor package](https://bioconductor.org/packages/LoomExperiment/) for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto](http://velocyto.org/) for {ref}`RNA velocity analysis `." ] }, { @@ -519,10 +531,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "While HDF5-based formats are currently the standard for on-disk representations of single-cell data other newer technologies such as [Zarr] and [TileDB] have some advantages, particularly for very large datasets and other modalities. We expect specifications to be developed for these formats in the future which may be adopted by the community (**anndata** already provides support for Zarr files)." + "While HDF5-based formats are currently the standard for on-disk representations of single-cell data other newer technologies such as [Zarr](https://zarr.dev/) and [TileDB](https://tiledb.com/) have some advantages, particularly for very large datasets and other modalities. We expect specifications to be developed for these formats in the future which may be adopted by the community (**anndata** already provides support for Zarr files)." ] }, { @@ -547,10 +560,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Before we look at in-memory interoperability between R and Python first let’s consider the simpler case of converting between the two R ecosystems. The **{Seurat}** package provides functions for performing this conversion [as described in this vignette]." + "Before we look at in-memory interoperability between R and Python first let’s consider the simpler case of converting between the two R ecosystems. The **{Seurat}** package provides functions for performing this conversion [as described in this vignette](https://satijalab.org/seurat/articles/conversion_vignette.html)." ] }, { @@ -639,28 +653,28 @@ "name": "stdout", "output_type": "stream", "text": [ - " \n", - "Gene-0 930.9480 601.4938 . 960.8865 . . . \n", - "Gene-1 584.8096 . 979.4957 . . 602.7839 1181.5488\n", - "Gene-2 584.8096 601.4938 . 607.1453 604.1277 1206.2747 934.9052\n", - "Gene-3 584.8096 . 979.4957 607.1453 . 955.7456 934.9052\n", - "Gene-4 584.8096 953.0988 619.8831 607.1453 957.9856 . 588.2744\n", - "Gene-5 584.8096 953.0988 . . 604.1277 . 588.2744\n", - "Gene-6 930.9480 953.0988 . . . 602.7839 . \n", - "Gene-7 . . 619.8831 607.1453 957.9856 602.7839 588.2744\n", - "Gene-8 584.8096 953.0988 1233.8239 . 1404.0783 1400.6417 588.2744\n", - "Gene-9 . 953.0988 979.4957 960.8865 . 602.7839 934.9052\n", - " \n", - "Gene-0 596.6198 . 610.5472\n", - "Gene-1 . 595.8238 610.5472\n", - "Gene-2 . 595.8238 . \n", - "Gene-3 596.6198 . . \n", - "Gene-4 947.1021 . 610.5472\n", - "Gene-5 947.1021 . 1717.0308\n", - "Gene-6 596.6198 595.8238 . \n", - "Gene-7 596.6198 . . \n", - "Gene-8 947.1021 595.8238 610.5472\n", - "Gene-9 . . 968.3861\n" + " \n", + "Gene-0 618.1942 . . . 949.6983 . 608.6559\n", + "Gene-1 . 594.1473 575.2669 . 1392.9609 595.4419 1214.5455\n", + "Gene-2 1231.7654 594.1473 . 617.6376 . 595.4419 1214.5455\n", + "Gene-3 618.1942 . 575.2669 617.6376 598.4406 . 608.6559\n", + "Gene-4 . 594.1473 575.2669 . . 595.4419 963.3034\n", + "Gene-5 618.1942 594.1473 . . 598.4406 . . \n", + "Gene-6 977.4848 . 575.2669 1230.5427 598.4406 . 608.6559\n", + "Gene-7 . . . 617.6376 . . . \n", + "Gene-8 . . 918.1847 . 598.4406 945.0467 . \n", + "Gene-9 977.4848 1191.4993 918.1847 617.6376 949.6983 945.0467 . \n", + " \n", + "Gene-0 956.9127 . . \n", + "Gene-1 956.9127 1008.0197 942.5265\n", + "Gene-2 1208.2042 640.4155 . \n", + "Gene-3 956.9127 1467.0044 593.2510\n", + "Gene-4 . 640.4155 . \n", + "Gene-5 956.9127 640.4155 . \n", + "Gene-6 956.9127 1008.0197 942.5265\n", + "Gene-7 603.1238 640.4155 . \n", + "Gene-8 603.1238 . 942.5265\n", + "Gene-9 1208.2042 . . \n" ] } ], @@ -687,10 +701,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The Python interface to R is provided by the [**rpy2** package]. This allows you to access R functions and objects from Python. For example:" + "The Python interface to R is provided by the [**rpy2** package](https://rpy2.github.io/doc/latest/html/index.html). This allows you to access R functions and objects from Python. For example:" ] }, { @@ -702,7 +717,7 @@ "data": { "text/plain": [ "<2000x100 sparse matrix of type ''\n", - "\twith 126283 stored elements in Compressed Sparse Column format>" + "\twith 126271 stored elements in Compressed Sparse Column format>" ] }, "execution_count": 8, @@ -756,12 +771,13 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation].\n", + "This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation](https://rpy2.github.io/doc/latest/html/index.html).\n", "\n", - "To work with single-cell data in this way the [**anndata2ri**] package is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object." + "To work with single-cell data in this way the [**anndata2ri** package](https://icb-anndata2ri.readthedocs-hosted.com/en/latest/) is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object." ] }, { @@ -788,10 +804,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Accessing Python from an R session is similar to accessing R from Python but here the interface is provided by the [**{reticulate}** package]. Once it is loaded we can access Python functions and objects from R." + "Accessing Python from an R session is similar to accessing R from Python but here the interface is provided by the [**{reticulate}** package](https://rstudio.github.io/reticulate/). Once it is loaded we can access Python functions and objects from R." ] }, { @@ -804,7 +821,7 @@ "output_type": "stream", "text": [ "List (26 items)\n", - "\n" + "\n" ] } ], @@ -821,7 +838,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you are working in an [RMarkdown] or [Quarto] document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language (the following code is an example that is not run)." + "If you are working in an [RMarkdown](https://rmarkdown.rstudio.com/) or [Quarto](https://quarto.org/) document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language (the following code is an example that is not run)." ] }, { @@ -1027,10 +1044,11 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter] as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package] for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package] for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation]." + "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`." ] }, { @@ -1041,6 +1059,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ @@ -1048,17 +1067,17 @@ "\n", "### Julia\n", "\n", - "- [Muon.jl] provides Julia implementations of AnnData and MuData objects, as well as IO for the H5AD and H5MU formats\n", - "- [scVI.jl] provides a Julia implementation of AnnData as well as IO for the H5AD format\n", + "- [Muon.jl](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) provides Julia implementations of `AnnData` and `MuData` objects, as well as IO for the H5AD and H5MU formats\n", + "- [scVI.jl](https://github.com/maren-ha/scVI.jl) provides a Julia implementation of `AnnData` as well as IO for the H5AD format\n", "\n", "### JavaScript\n", "\n", - "- [Vitessce] contains loaders from `AnnData` objects stored using the Zarr format\n", - "- The [kana family] supports reading H5AD files and `SingleCellExperiment` objects saved as RDS files\n", + "- [Vitessce](http://vitessce.io/) contains loaders from `AnnData` objects stored using the Zarr format\n", + "- The [kana family](https://github.com/jkanche/kana) supports reading H5AD files and `SingleCellExperiment` objects saved as RDS files\n", "\n", "### Rust\n", "\n", - "- [anndata-rs] provides a Rust implementation of AnnData as well as advanced IO support for the H5AD format\n" + "- [anndata-rs](https://github.com/kaizhang/anndata-rs) provides a Rust implementation of AnnData as well as advanced IO support for the H5AD format\n" ] }, { @@ -1216,7 +1235,7 @@ "Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) [Clang 14.0.6 ]\n", "macOS-12.6.3-x86_64-i386-64bit\n", "-----\n", - "Session information updated at 2023-04-03 17:10\n", + "Session information updated at 2023-04-03 18:08\n", "\n", "" ], diff --git a/jupyter-book/trajectories/rna_velocity.ipynb b/jupyter-book/trajectories/rna_velocity.ipynb index 392a9877..b2080d22 100644 --- a/jupyter-book/trajectories/rna_velocity.ipynb +++ b/jupyter-book/trajectories/rna_velocity.ipynb @@ -1,9 +1,12 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ + "(trajectories:rna-velocity)=\n", + "\n", "# RNA velocity" ] }, From a192208bfd83d26c1242cd9d159bbc8a208d315d Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Mon, 3 Apr 2023 18:13:42 +0200 Subject: [PATCH 06/15] Add conventions --- jupyter-book/introduction/interoperability.ipynb | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 453fb70c..b0de1d65 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -21,7 +21,16 @@ "source": [ "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", "\n", - "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.\n" + "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.\n", + "\n", + "Because talking about different languages can get confusing we try to use the following conventions:\n", + "\n", + "- **{package}** - An R package\n", + "- `package::function()` - A function in an R package\n", + "- **package** - A Python package\n", + "- `package.function()` - A function in a Python package\n", + "- **Emphasised** - Some other important concept\n", + "- `code` - Other parts of code including objects, variables etc. This is also used for files or directories." ] }, { From 61551aa8c35ad433f7442c5ad2d685553442dace Mon Sep 17 00:00:00 2001 From: zethson Date: Thu, 1 Jun 2023 23:24:45 +0200 Subject: [PATCH 07/15] Update environment Signed-off-by: zethson --- .../introduction/interoperability_environment.yml | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/jupyter-book/introduction/interoperability_environment.yml b/jupyter-book/introduction/interoperability_environment.yml index 3a4b61df..e09c5dcb 100644 --- a/jupyter-book/introduction/interoperability_environment.yml +++ b/jupyter-book/introduction/interoperability_environment.yml @@ -6,18 +6,12 @@ channels: dependencies: - conda-forge::python=3.9.16 - conda-forge::jupyterlab=3.6.3 - - conda-forge::scanpy=1.9.1 - - anndata=0.8.0 + - conda-forge::scanpy=1.9.3 - anndata2ri=1.1 - bioconductor-basilisk=1.9.12 - bioconductor-scuttle=1.8.0 - bioconductor-singlecellexperiment=1.20.0 - bioconductor-zellkonverter=1.8.0 - - igraph=0.10.3 - - matplotlib=3.6.3 - - numpy=1.23.5 - - pip=23.0.1 - - python-igraph=0.10.4 - r-base=4.2 - r-remotes=2.4.2 - r-reticulate=1.26 @@ -25,7 +19,6 @@ dependencies: - r-seurat=4.3.0 - r-seuratobject=4.1.3 - rpy2=3.5.10 - - scipy=1.10.1 - session-info=1.0.0 # Additional R packages installed manually using {remotes} # remotes::install_github("mojaveazure/seurat-disk@9b89970eac2a3bd770e744f63c7763419486b14c") From 6c023d7c1833aa03422c5da5ba72562688eb5973 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Tue, 8 Aug 2023 12:08:13 +0200 Subject: [PATCH 08/15] Update interoperability rpy2/anndata2ri versions --- .../introduction/interoperability.ipynb | 276 ++++++++++-------- .../interoperability_environment.yml | 8 +- 2 files changed, 163 insertions(+), 121 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index b0de1d65..1bdf712a 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -45,8 +45,10 @@ "import tempfile\n", "import os\n", "import rpy2.robjects\n", + "import anndata2ri\n", "from scipy.sparse import csr_matrix\n", "\n", + "anndata2ri.activate()\n", "%load_ext rpy2.ipython" ] }, @@ -120,8 +122,14 @@ "name": "stderr", "output_type": "stream", "text": [ - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", + " @numba.jit()\n", + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", + " @numba.jit()\n", + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", + " @numba.jit()\n", + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", + " @numba.jit()\n" ] }, { @@ -348,6 +356,15 @@ "output_type": "stream", "text": [ "R[write to console]: Converting H5AD to H5Seurat...\n", + "\n", + "R[write to console]: The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,\n", + "which was just loaded, will retire in October 2023.\n", + "Please refer to R-spatial evolution reports for details, especially\n", + "https://r-spatial.org/r/2023/05/15/evolution4.html.\n", + "It may be desirable to make the sf package available;\n", + "package maintainers should consider adding sf to Suggests:.\n", + "The sp package is now running under evolution status 2\n", + " (status 2 uses the sf package in place of rgdal)\n", "\n" ] }, @@ -588,14 +605,14 @@ "class: SingleCellExperiment \n", "dim: 2000 100 \n", "metadata(0):\n", - "assays(2): counts logcounts\n", + "assays(2): X logcounts\n", "rownames(2000): Gene-0 Gene-1 ... Gene-1998 Gene-1999\n", "rowData names(0):\n", "colnames(100): Cell_0 Cell_1 ... Cell_98 Cell_99\n", "colData names(9): n_genes_by_counts log1p_n_genes_by_counts ...\n", " pct_counts_in_top_500_genes ident\n", "reducedDimNames(2): PCA UMAP\n", - "mainExpName: RNA\n", + "mainExpName: NULL\n", "altExpNames(0):\n" ] } @@ -647,43 +664,29 @@ "name": "stdout", "output_type": "stream", "text": [ - "10 x 10 sparse Matrix of class \"dgCMatrix\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "R[write to console]: [[ suppressing 10 column names ‘Cell_0’, ‘Cell_1’, ‘Cell_2’ ... ]]\n", - "\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - " \n", - "Gene-0 618.1942 . . . 949.6983 . 608.6559\n", - "Gene-1 . 594.1473 575.2669 . 1392.9609 595.4419 1214.5455\n", - "Gene-2 1231.7654 594.1473 . 617.6376 . 595.4419 1214.5455\n", - "Gene-3 618.1942 . 575.2669 617.6376 598.4406 . 608.6559\n", - "Gene-4 . 594.1473 575.2669 . . 595.4419 963.3034\n", - "Gene-5 618.1942 594.1473 . . 598.4406 . . \n", - "Gene-6 977.4848 . 575.2669 1230.5427 598.4406 . 608.6559\n", - "Gene-7 . . . 617.6376 . . . \n", - "Gene-8 . . 918.1847 . 598.4406 945.0467 . \n", - "Gene-9 977.4848 1191.4993 918.1847 617.6376 949.6983 945.0467 . \n", - " \n", - "Gene-0 956.9127 . . \n", - "Gene-1 956.9127 1008.0197 942.5265\n", - "Gene-2 1208.2042 640.4155 . \n", - "Gene-3 956.9127 1467.0044 593.2510\n", - "Gene-4 . 640.4155 . \n", - "Gene-5 956.9127 640.4155 . \n", - "Gene-6 956.9127 1008.0197 942.5265\n", - "Gene-7 603.1238 640.4155 . \n", - "Gene-8 603.1238 . 942.5265\n", - "Gene-9 1208.2042 . . \n" + "10 x 10 sparse Matrix of class \"dgCMatrix\"\n", + " \n", + " [1,] . 625.3202 . 593.0955 601.1306 628.6514 1230.1163\n", + " [2,] 1196.7954 985.8732 . 593.0955 . 628.6514 976.7754\n", + " [3,] . . . 593.0955 . . 618.4087\n", + " [4,] 597.5137 625.3202 608.1254 . 601.1306 . 618.4087\n", + " [5,] . 625.3202 963.3317 . . . 618.4087\n", + " [6,] . 625.3202 608.1254 593.0955 601.1306 . 976.7754\n", + " [7,] 597.5137 . 1215.2093 . . 1247.1100 618.4087\n", + " [8,] . 625.3202 . 593.0955 . . 618.4087\n", + " [9,] 597.5137 985.8732 963.3317 . . . 976.7754\n", + "[10,] . 985.8732 . 593.0955 1202.7475 991.2440 . \n", + " \n", + " [1,] 614.4039 1405.3720 612.9151\n", + " [2,] 614.4039 . . \n", + " [3,] 614.4039 959.8685 612.9151\n", + " [4,] 971.4187 605.9740 612.9151\n", + " [5,] . 605.9740 . \n", + " [6,] 614.4039 605.9740 . \n", + " [7,] 614.4039 . . \n", + " [8,] 971.4187 605.9740 . \n", + " [9,] . . 612.9151\n", + "[10,] . . 612.9151\n" ] } ], @@ -725,8 +728,8 @@ { "data": { "text/plain": [ - "<2000x100 sparse matrix of type ''\n", - "\twith 126271 stored elements in Compressed Sparse Column format>" + "<2000x100 sparse matrix of type ''\n", + "\twith 126516 stored elements in Compressed Sparse Column format>" ] }, "execution_count": 8, @@ -736,10 +739,9 @@ ], "source": [ "counts_mat = adata.layers[\"counts\"].T\n", - "# rpy2.robjects.globalenv[\"counts_mat\"] = counts_mat\n", - "# cpm = rpy2.robjects.r(\"scuttle::calculateCPM(counts_mat)\")\n", - "# cpm\n", - "counts_mat" + "rpy2.robjects.globalenv[\"counts_mat\"] = counts_mat\n", + "cpm = rpy2.robjects.r(\"scuttle::calculateCPM(counts_mat)\")\n", + "cpm" ] }, { @@ -754,29 +756,33 @@ "cell_type": "code", "execution_count": 9, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NULL\n" - ] - } - ], + "outputs": [], "source": [ - "%%R\n", + "%%R -i counts_mat -o magic_cpm\n", "# R code running using IPython magic\n", - "# magic_cpm <- scuttle::calculateCPM(counts_mat)" + "magic_cpm <- scuttle::calculateCPM(counts_mat)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "<2000x100 sparse matrix of type ''\n", + "\twith 126516 stored elements in Compressed Sparse Column format>" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Python code accessing the results\n", - "# magic_cpm" + "magic_cpm" ] }, { @@ -786,16 +792,40 @@ "source": [ "This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation](https://rpy2.github.io/doc/latest/html/index.html).\n", "\n", - "To work with single-cell data in this way the [**anndata2ri** package](https://icb-anndata2ri.readthedocs-hosted.com/en/latest/) is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object." + "To work with single-cell data in this way the [**anndata2ri** package](https://icb-anndata2ri.readthedocs-hosted.com/en/latest/) is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object. It also enables the conversion of sparse **scipy** matrices like we saw above." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/functools.py:888: NotConvertedWarning: Conversion 'py2rpy' not defined for objects of type ''\n", + " return dispatch(args[0].__class__)(*args, **kw)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " sum detected total\n", + "Cell_0 2021 1279 2021\n", + "Cell_1 1914 1242 1914\n", + "Cell_2 1995 1248 1995\n", + "Cell_3 2044 1285 2044\n", + "Cell_4 2009 1277 2009\n", + "Cell_5 1916 1237 1916\n" + ] + } + ], "source": [ - "# USING AN R FUNCTION ON AN ANNDATA" + "%%R -i adata\n", + "qc <- scuttle::perCellQCMetrics(adata)\n", + "head(qc)" ] }, { @@ -830,7 +860,7 @@ "output_type": "stream", "text": [ "List (26 items)\n", - "\n" + "\n" ] } ], @@ -1115,81 +1145,91 @@ "Click to view session information\n", "
\n",
        "-----\n",
-       "anndata             0.8.0\n",
-       "numpy               1.23.5\n",
-       "rpy2                3.5.10\n",
-       "scanpy              1.9.1\n",
-       "scipy               1.10.1\n",
+       "anndata             0.9.2\n",
+       "anndata2ri          1.2\n",
+       "numpy               1.24.4\n",
+       "rpy2                3.5.11\n",
+       "scanpy              1.9.3\n",
+       "scipy               1.9.3\n",
        "session_info        1.0.0\n",
        "-----\n",
        "
\n", "
\n", "Click to view modules imported as dependencies\n", "
\n",
-       "PIL                         9.4.0\n",
+       "CoreFoundation              NA\n",
+       "Foundation                  NA\n",
+       "PIL                         10.0.0\n",
+       "PyObjCTools                 NA\n",
        "anyio                       NA\n",
        "appnope                     0.1.3\n",
        "argcomplete                 NA\n",
+       "arrow                       1.2.3\n",
        "asttokens                   NA\n",
-       "attr                        22.2.0\n",
+       "attr                        23.1.0\n",
+       "attrs                       23.1.0\n",
        "babel                       2.12.1\n",
        "backcall                    0.2.0\n",
        "beta_ufunc                  NA\n",
        "binom_ufunc                 NA\n",
-       "brotli                      NA\n",
-       "certifi                     2022.12.07\n",
+       "brotli                      1.0.9\n",
+       "certifi                     2023.07.22\n",
        "cffi                        1.15.1\n",
-       "charset_normalizer          3.1.0\n",
+       "charset_normalizer          3.2.0\n",
        "colorama                    0.4.6\n",
-       "comm                        0.1.3\n",
+       "comm                        0.1.4\n",
        "cycler                      0.10.0\n",
        "cython_runtime              NA\n",
        "dateutil                    2.8.2\n",
-       "debugpy                     1.6.6\n",
+       "debugpy                     1.6.8\n",
        "decorator                   5.1.1\n",
        "defusedxml                  0.7.1\n",
        "executing                   1.2.0\n",
        "fastjsonschema              NA\n",
-       "h5py                        3.8.0\n",
+       "fqdn                        NA\n",
+       "h5py                        3.9.0\n",
        "hypergeom_ufunc             NA\n",
        "idna                        3.4\n",
        "importlib_metadata          NA\n",
-       "invgauss_ufunc              NA\n",
-       "ipykernel                   6.22.0\n",
+       "importlib_resources         NA\n",
+       "ipykernel                   6.25.0\n",
        "ipython_genutils            0.2.0\n",
-       "jedi                        0.18.2\n",
+       "ipywidgets                  8.1.0\n",
+       "isoduration                 NA\n",
+       "jedi                        0.19.0\n",
        "jinja2                      3.1.2\n",
-       "joblib                      1.2.0\n",
+       "joblib                      1.3.0\n",
        "json5                       NA\n",
-       "jsonschema                  4.17.3\n",
-       "jupyter_events              0.6.3\n",
-       "jupyter_server              2.5.0\n",
-       "jupyterlab_server           2.22.0\n",
+       "jsonpointer                 2.0\n",
+       "jsonschema                  4.18.6\n",
+       "jsonschema_specifications   NA\n",
+       "jupyter_events              0.7.0\n",
+       "jupyter_server              2.7.0\n",
+       "jupyterlab_server           2.24.0\n",
        "kiwisolver                  1.4.4\n",
-       "llvmlite                    0.39.1\n",
-       "markupsafe                  2.1.2\n",
-       "matplotlib                  3.6.3\n",
+       "llvmlite                    0.40.1\n",
+       "markupsafe                  2.1.3\n",
+       "matplotlib                  3.7.2\n",
        "mpl_toolkits                NA\n",
-       "natsort                     8.3.1\n",
-       "nbformat                    5.8.0\n",
+       "natsort                     8.4.0\n",
+       "nbformat                    5.9.2\n",
        "nbinom_ufunc                NA\n",
        "ncf_ufunc                   NA\n",
-       "nct_ufunc                   NA\n",
-       "ncx2_ufunc                  NA\n",
-       "numba                       0.56.4\n",
-       "packaging                   23.0\n",
-       "pandas                      1.5.3\n",
+       "numba                       0.57.1\n",
+       "objc                        9.2\n",
+       "overrides                   NA\n",
+       "packaging                   23.1\n",
+       "pandas                      2.0.3\n",
        "parso                       0.8.3\n",
        "pexpect                     4.8.0\n",
        "pickleshare                 0.7.5\n",
        "pkg_resources               NA\n",
-       "platformdirs                3.2.0\n",
+       "platformdirs                3.10.0\n",
        "prometheus_client           NA\n",
-       "prompt_toolkit              3.0.38\n",
-       "psutil                      5.9.4\n",
+       "prompt_toolkit              3.0.39\n",
+       "psutil                      5.9.5\n",
        "ptyprocess                  0.7.0\n",
        "pure_eval                   0.2.2\n",
-       "pvectorc                    NA\n",
        "pycparser                   2.21\n",
        "pydev_ipython               NA\n",
        "pydevconsole                NA\n",
@@ -1197,54 +1237,54 @@
        "pydevd_file_utils           NA\n",
        "pydevd_plugins              NA\n",
        "pydevd_tracing              NA\n",
-       "pygments                    2.14.0\n",
-       "pynndescent                 0.5.8\n",
+       "pygments                    2.15.1\n",
+       "pynndescent                 0.5.10\n",
        "pyparsing                   3.0.9\n",
-       "pyrsistent                  NA\n",
        "pythonjsonlogger            NA\n",
        "pytz                        2023.3\n",
-       "pytz_deprecation_shim       NA\n",
-       "requests                    2.28.2\n",
+       "referencing                 NA\n",
+       "requests                    2.31.0\n",
        "rfc3339_validator           0.1.4\n",
        "rfc3986_validator           0.1.1\n",
+       "rpds                        NA\n",
        "rpycall                     NA\n",
        "rpytools                    NA\n",
        "send2trash                  NA\n",
-       "setuptools                  67.6.1\n",
        "six                         1.16.0\n",
-       "skewnorm_ufunc              NA\n",
-       "sklearn                     1.2.2\n",
+       "sklearn                     1.3.0\n",
        "sniffio                     1.3.0\n",
        "socks                       1.7.1\n",
        "stack_data                  0.6.2\n",
-       "threadpoolctl               3.1.0\n",
-       "tornado                     6.2\n",
+       "threadpoolctl               3.2.0\n",
+       "tornado                     6.3.2\n",
        "tqdm                        4.65.0\n",
        "traitlets                   5.9.0\n",
        "typing_extensions           NA\n",
        "tzlocal                     NA\n",
        "umap                        0.5.3\n",
-       "urllib3                     1.26.15\n",
+       "uri_template                NA\n",
+       "urllib3                     2.0.4\n",
        "wcwidth                     0.2.6\n",
-       "websocket                   1.5.1\n",
+       "webcolors                   1.13\n",
+       "websocket                   1.6.1\n",
        "yaml                        6.0\n",
        "zipp                        NA\n",
-       "zmq                         25.0.2\n",
+       "zmq                         25.1.0\n",
        "zoneinfo                    NA\n",
        "
\n", "
\n", "
\n",
        "-----\n",
-       "IPython             8.12.0\n",
-       "jupyter_client      8.1.0\n",
-       "jupyter_core        5.3.0\n",
+       "IPython             8.14.0\n",
+       "jupyter_client      8.3.0\n",
+       "jupyter_core        5.3.1\n",
        "jupyterlab          3.6.3\n",
-       "notebook            6.5.3\n",
+       "notebook            6.5.4\n",
        "-----\n",
        "Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20) [Clang 14.0.6 ]\n",
-       "macOS-12.6.3-x86_64-i386-64bit\n",
+       "macOS-13.4.1-x86_64-i386-64bit\n",
        "-----\n",
-       "Session information updated at 2023-04-03 18:08\n",
+       "Session information updated at 2023-08-08 12:03\n",
        "
\n", "" ], diff --git a/jupyter-book/introduction/interoperability_environment.yml b/jupyter-book/introduction/interoperability_environment.yml index e09c5dcb..af16e215 100644 --- a/jupyter-book/introduction/interoperability_environment.yml +++ b/jupyter-book/introduction/interoperability_environment.yml @@ -7,18 +7,20 @@ dependencies: - conda-forge::python=3.9.16 - conda-forge::jupyterlab=3.6.3 - conda-forge::scanpy=1.9.3 - - anndata2ri=1.1 + - anndata2ri=1.2 - bioconductor-basilisk=1.9.12 - bioconductor-scuttle=1.8.0 - bioconductor-singlecellexperiment=1.20.0 - bioconductor-zellkonverter=1.8.0 + - ipywidgets=8.1.0 - r-base=4.2 + - r-hdf5r=1.3.8 - r-remotes=2.4.2 - - r-reticulate=1.26 + - r-reticulate=1.30 - r-sessioninfo=1.2.2 - r-seurat=4.3.0 - r-seuratobject=4.1.3 - - rpy2=3.5.10 + - rpy2=3.5.11 - session-info=1.0.0 # Additional R packages installed manually using {remotes} # remotes::install_github("mojaveazure/seurat-disk@9b89970eac2a3bd770e744f63c7763419486b14c") From 3b2ea91fd19496b047427f9cacf780c2ae745394 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 15:30:19 +0100 Subject: [PATCH 09/15] Add mudata to interoperability environment --- jupyter-book/introduction/interoperability_environment.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/jupyter-book/introduction/interoperability_environment.yml b/jupyter-book/introduction/interoperability_environment.yml index af16e215..330cfe4d 100644 --- a/jupyter-book/introduction/interoperability_environment.yml +++ b/jupyter-book/introduction/interoperability_environment.yml @@ -9,10 +9,12 @@ dependencies: - conda-forge::scanpy=1.9.3 - anndata2ri=1.2 - bioconductor-basilisk=1.9.12 + - bioconductor-mudata=1.2.0 - bioconductor-scuttle=1.8.0 - bioconductor-singlecellexperiment=1.20.0 - bioconductor-zellkonverter=1.8.0 - ipywidgets=8.1.0 + - mudata=0.2.3 - r-base=4.2 - r-hdf5r=1.3.8 - r-remotes=2.4.2 @@ -25,3 +27,4 @@ dependencies: # Additional R packages installed manually using {remotes} # remotes::install_github("mojaveazure/seurat-disk@9b89970eac2a3bd770e744f63c7763419486b14c") # remotes::install_github("cellgeni/sceasy@v0.0.7") + # remotes::install_github("PMBio/MuDataSeurat@e34e908") From e4afe3d181c21caa70b051a5d4b70d7b901d4d48 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 15:48:41 +0100 Subject: [PATCH 10/15] Add mudata interoperability example --- .../introduction/interoperability.ipynb | 534 +++++++++++++++--- 1 file changed, 447 insertions(+), 87 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 1bdf712a..f9c63eac 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -42,6 +42,7 @@ "import anndata\n", "import numpy\n", "import scanpy\n", + "import mudata\n", "import tempfile\n", "import os\n", "import rpy2.robjects\n", @@ -49,7 +50,7 @@ "from scipy.sparse import csr_matrix\n", "\n", "anndata2ri.activate()\n", - "%load_ext rpy2.ipython" + "%load_ext rpy2.ipython\n" ] }, { @@ -115,23 +116,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", - " @numba.jit()\n", - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", - " @numba.jit()\n", - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", - " @numba.jit()\n", - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: \u001b[1mThe 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\u001b[0m\n", - " @numba.jit()\n" - ] - }, { "data": { "text/plain": [ @@ -145,7 +132,7 @@ " obsp: 'distances', 'connectivities'" ] }, - "execution_count": 2, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } @@ -165,7 +152,7 @@ "scanpy.tl.pca(adata)\n", "scanpy.pp.neighbors(adata)\n", "scanpy.tl.umap(adata)\n", - "adata" + "adata\n" ] }, { @@ -177,14 +164,14 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "temp_dir = tempfile.TemporaryDirectory()\n", "h5ad_file = os.path.join(temp_dir.name, \"example.h5ad\")\n", "\n", - "adata.write_h5ad(h5ad_file)" + "adata.write_h5ad(h5ad_file)\n" ] }, { @@ -348,7 +335,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -356,7 +343,13 @@ "output_type": "stream", "text": [ "R[write to console]: Converting H5AD to H5Seurat...\n", - "\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ "R[write to console]: The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,\n", "which was just loaded, will retire in October 2023.\n", "Please refer to R-spatial evolution reports for details, especially\n", @@ -478,7 +471,7 @@ "h5seurat_file <- gsub(\".h5ad\", \".h5seurat\", h5ad_file)\n", "seurat <- SeuratDisk::LoadH5Seurat(h5seurat_file, assays = \"RNA\")\n", "message(\"Read Seurat object:\")\n", - "seurat" + "seurat\n" ] }, { @@ -595,7 +588,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -620,12 +613,12 @@ "source": [ "%%R\n", "sce_from_seurat <- Seurat::as.SingleCellExperiment(seurat)\n", - "sce_from_seurat" + "sce_from_seurat\n" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -642,7 +635,7 @@ "source": [ "%%R\n", "seurat_from_sce <- Seurat::as.Seurat(sce_from_seurat)\n", - "seurat_from_sce" + "seurat_from_sce\n" ] }, { @@ -657,7 +650,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -665,28 +658,28 @@ "output_type": "stream", "text": [ "10 x 10 sparse Matrix of class \"dgCMatrix\"\n", - " \n", - " [1,] . 625.3202 . 593.0955 601.1306 628.6514 1230.1163\n", - " [2,] 1196.7954 985.8732 . 593.0955 . 628.6514 976.7754\n", - " [3,] . . . 593.0955 . . 618.4087\n", - " [4,] 597.5137 625.3202 608.1254 . 601.1306 . 618.4087\n", - " [5,] . 625.3202 963.3317 . . . 618.4087\n", - " [6,] . 625.3202 608.1254 593.0955 601.1306 . 976.7754\n", - " [7,] 597.5137 . 1215.2093 . . 1247.1100 618.4087\n", - " [8,] . 625.3202 . 593.0955 . . 618.4087\n", - " [9,] 597.5137 985.8732 963.3317 . . . 976.7754\n", - "[10,] . 985.8732 . 593.0955 1202.7475 991.2440 . \n", - " \n", - " [1,] 614.4039 1405.3720 612.9151\n", - " [2,] 614.4039 . . \n", - " [3,] 614.4039 959.8685 612.9151\n", - " [4,] 971.4187 605.9740 612.9151\n", - " [5,] . 605.9740 . \n", - " [6,] 614.4039 605.9740 . \n", - " [7,] 614.4039 . . \n", - " [8,] 971.4187 605.9740 . \n", - " [9,] . . 612.9151\n", - "[10,] . . 612.9151\n" + " \n", + " [1,] 602.1263 622.1264 600.5326 1416.439 . . 965.8600\n", + " [2,] 602.1263 . . 613.435 618.2562 943.8910 609.1107\n", + " [3,] 602.1263 982.8946 1202.6879 969.506 618.2562 943.8910 1219.1005\n", + " [4,] . . 600.5326 613.435 976.3175 594.3451 . \n", + " [5,] 602.1263 622.1264 . 1221.384 618.2562 594.3451 609.1107\n", + " [6,] 954.8394 982.8946 1202.6879 613.435 618.2562 943.8910 1219.1005\n", + " [7,] 954.8394 622.1264 952.6379 613.435 618.2562 . 965.8600\n", + " [8,] . 982.8946 . . 618.2562 594.3451 . \n", + " [9,] 954.8394 . 600.5326 613.435 . 1192.4227 1219.1005\n", + "[10,] 954.8394 622.1264 952.6379 613.435 618.2562 943.8910 609.1107\n", + " \n", + " [1,] 958.4081 599.5090 610.1969\n", + " [2,] 958.4081 952.2526 610.1969\n", + " [3,] . . 965.9120\n", + " [4,] . . . \n", + " [5,] 605.0013 952.2526 . \n", + " [6,] . . . \n", + " [7,] 605.0013 599.5090 965.9120\n", + " [8,] 1209.0169 . 965.9120\n", + " [9,] . 599.5090 1413.3168\n", + "[10,] 1209.0169 . 610.1969\n" ] } ], @@ -695,7 +688,7 @@ "# Calculate Counts Per Million using the Bioconductor scuttle package\n", "# with a matrix in a Seurat object\n", "cpm <- scuttle::calculateCPM(Seurat::GetAssayData(seurat, slot = \"counts\"))\n", - "cpm[1:10, 1:10]" + "cpm[1:10, 1:10]\n" ] }, { @@ -722,17 +715,17 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<2000x100 sparse matrix of type ''\n", - "\twith 126516 stored elements in Compressed Sparse Column format>" + "\twith 126146 stored elements in Compressed Sparse Column format>" ] }, - "execution_count": 8, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -741,7 +734,7 @@ "counts_mat = adata.layers[\"counts\"].T\n", "rpy2.robjects.globalenv[\"counts_mat\"] = counts_mat\n", "cpm = rpy2.robjects.r(\"scuttle::calculateCPM(counts_mat)\")\n", - "cpm" + "cpm\n" ] }, { @@ -754,35 +747,35 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "%%R -i counts_mat -o magic_cpm\n", "# R code running using IPython magic\n", - "magic_cpm <- scuttle::calculateCPM(counts_mat)" + "magic_cpm <- scuttle::calculateCPM(counts_mat)\n" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<2000x100 sparse matrix of type ''\n", - "\twith 126516 stored elements in Compressed Sparse Column format>" + "\twith 126146 stored elements in Compressed Sparse Column format>" ] }, - "execution_count": 10, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Python code accessing the results\n", - "magic_cpm" + "magic_cpm\n" ] }, { @@ -797,14 +790,14 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "/Users/luke.zappia/miniconda3/envs/interoperability/lib/python3.9/functools.py:888: NotConvertedWarning: Conversion 'py2rpy' not defined for objects of type ''\n", + "/Users/luke.zappia/miniconda3/envs/interoperability2/lib/python3.9/functools.py:888: NotConvertedWarning: Conversion 'py2rpy' not defined for objects of type ''\n", " return dispatch(args[0].__class__)(*args, **kw)\n" ] }, @@ -813,19 +806,19 @@ "output_type": "stream", "text": [ " sum detected total\n", - "Cell_0 2021 1279 2021\n", - "Cell_1 1914 1242 1914\n", - "Cell_2 1995 1248 1995\n", - "Cell_3 2044 1285 2044\n", - "Cell_4 2009 1277 2009\n", - "Cell_5 1916 1237 1916\n" + "Cell_0 2005 1274 2005\n", + "Cell_1 1941 1233 1941\n", + "Cell_2 2011 1270 2011\n", + "Cell_3 1947 1268 1947\n", + "Cell_4 1933 1265 1933\n", + "Cell_5 2031 1289 2031\n" ] } ], "source": [ "%%R -i adata\n", "qc <- scuttle::perCellQCMetrics(adata)\n", - "head(qc)" + "head(qc)\n" ] }, { @@ -852,7 +845,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -860,7 +853,7 @@ "output_type": "stream", "text": [ "List (26 items)\n", - "\n" + "\n" ] } ], @@ -869,7 +862,7 @@ "reticulate_list <- reticulate::r_to_py(LETTERS)\n", "print(reticulate_list)\n", "py_builtins <- reticulate::import_builtins()\n", - "py_builtins$zip(letters, LETTERS)" + "py_builtins$zip(letters, LETTERS)\n" ] }, { @@ -1087,7 +1080,239 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`." + "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`.\n", + "\n", + "Below is an example of reading and writing a small example dataset using the Python and R packages." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Python" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MuData object with n_obs × n_vars = 411 × 56\n", + " obs:\t'louvain', 'leiden', 'leiden_wnn', 'celltype'\n", + " var:\t'gene_ids', 'feature_types', 'highly_variable'\n", + " obsm:\t'X_mofa', 'X_mofa_umap', 'X_umap', 'X_wnn_umap'\n", + " varm:\t'LFs'\n", + " obsp:\t'connectivities', 'distances', 'wnn_connectivities', 'wnn_distances'\n", + " 2 modalities\n", + " prot:\t411 x 29\n", + " var:\t'gene_ids', 'feature_types', 'highly_variable'\n", + " uns:\t'neighbors', 'pca', 'umap'\n", + " obsm:\t'X_pca', 'X_umap'\n", + " varm:\t'PCs'\n", + " layers:\t'counts'\n", + " obsp:\t'connectivities', 'distances'\n", + " rna:\t411 x 27\n", + " obs:\t'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'celltype'\n", + " var:\t'gene_ids', 'feature_types', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'\n", + " uns:\t'celltype_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'\n", + " obsm:\t'X_pca', 'X_umap'\n", + " varm:\t'PCs'\n", + " obsp:\t'connectivities', 'distances'\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/luke.zappia/miniconda3/envs/interoperability2/lib/python3.9/site-packages/anndata/_core/anndata.py:1230: SettingWithCopyWarning: \n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + " df[key] = c\n" + ] + } + ], + "source": [ + "# Read file\n", + "mdata = mudata.read_h5mu(\"../../datasets/original.h5mu\")\n", + "print(mdata)\n", + "\n", + "# Write new file\n", + "python_h5mu_file = os.path.join(temp_dir.name, \"python.h5mu\")\n", + "mdata.write_h5mu(python_h5mu_file)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### R" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Bioconductor\n", + "\n", + "Read/write from/to a `MultiAssayExperiment` object" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A MultiAssayExperiment object of 2 listed\n", + " experiments with user-defined names and respective classes.\n", + " Containing an ExperimentList class object of length 2:\n", + " [1] prot: SingleCellExperiment with 29 rows and 411 columns\n", + " [2] rna: SingleCellExperiment with 27 rows and 411 columns\n", + "Functionality:\n", + " experiments() - obtain the ExperimentList instance\n", + " colData() - the primary/phenotype DataFrame\n", + " sampleMap() - the sample coordination DataFrame\n", + " `$`, `[`, `[[` - extract colData columns, subset, or experiment\n", + " *Format() - convert into a long or wide DataFrame\n", + " assays() - convert ExperimentList to a SimpleList of matrices\n", + " exportClass() - save data to flat files\n" + ] + } + ], + "source": [ + "%%R\n", + "mae <- MuData::readH5MU(\"../../datasets/original.h5mu\")\n", + "print(mae)\n", + "\n", + "bioc_h5mu_file <- tempfile(fileext = \".h5mu\")\n", + "MuData::writeH5MU(mae, bioc_h5mu_file)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Seurat\n", + "\n", + "Read/write from/to a `Seurat` object" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,\n", + "which was just loaded, will retire in October 2023.\n", + "Please refer to R-spatial evolution reports for details, especially\n", + "https://r-spatial.org/r/2023/05/15/evolution4.html.\n", + "It may be desirable to make the sf package available;\n", + "package maintainers should consider adding sf to Suggests:.\n", + "The sp package is now running under evolution status 2\n", + " (status 2 uses the sf package in place of rgdal)\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " WARNING: The R package \"reticulate\" only fixed recently\n", + " an issue that caused a segfault when used with rpy2:\n", + " https://github.com/rstudio/reticulate/pull/1188\n", + " Make sure that you use a version of that package that includes\n", + " the fix.\n", + " " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: Warnung:\n", + "R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from prot to prot_\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'MOFA_1:30'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from MOFA_UMAP_ to MOFAUMAP_\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: All keys should be one or more alphanumeric characters followed by an underscore '_', setting key to MOFAUMAP_\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'MOFAUMAP_1:2'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'UMAP_1:2'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: Keys should be one or more alphanumeric characters followed by an underscore, setting key from WNN_UMAP_ to WNNUMAP_\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: All keys should be one or more alphanumeric characters followed by an underscore '_', setting key to WNNUMAP_\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'WNNUMAP_1:2'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'protPCA_1:31'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'protUMAP_1:2'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'rnaPCA_1:50'\n", + "\n", + "R[write to console]: Warnung:\n", + "R[write to console]: No columnames present in cell embeddings, setting to 'rnaUMAP_1:2'\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "An object of class Seurat \n", + "56 features across 411 samples within 2 assays \n", + "Active assay: prot (29 features, 29 variable features)\n", + " 1 other assay present: rna\n", + " 8 dimensional reductions calculated: MOFA, MOFA_UMAP, UMAP, WNN_UMAP, protPCA, protUMAP, rnaPCA, rnaUMAP\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: Added .var['highly_variable'] with highly variable features\n", + "\n", + "R[write to console]: Added .var['highly_variable'] with highly variable features\n", + "\n" + ] + } + ], + "source": [ + "%%R\n", + "seurat <- MuDataSeurat::ReadH5MU(\"../../datasets/original.h5mu\")\n", + "print(seurat)\n", + "\n", + "seurat_h5mu_file <- tempfile(fileext = \".h5mu\")\n", + "MuDataSeurat::WriteH5MU(seurat, seurat_h5mu_file)\n" ] }, { @@ -1135,7 +1360,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -1147,6 +1372,7 @@ "-----\n", "anndata 0.9.2\n", "anndata2ri 1.2\n", + "mudata 0.2.3\n", "numpy 1.24.4\n", "rpy2 3.5.11\n", "scanpy 1.9.3\n", @@ -1230,7 +1456,6 @@ "psutil 5.9.5\n", "ptyprocess 0.7.0\n", "pure_eval 0.2.2\n", - "pycparser 2.21\n", "pydev_ipython NA\n", "pydevconsole NA\n", "pydevd 2.9.5\n", @@ -1238,7 +1463,6 @@ "pydevd_plugins NA\n", "pydevd_tracing NA\n", "pygments 2.15.1\n", - "pynndescent 0.5.10\n", "pyparsing 3.0.9\n", "pythonjsonlogger NA\n", "pytz 2023.3\n", @@ -1247,8 +1471,6 @@ "rfc3339_validator 0.1.4\n", "rfc3986_validator 0.1.1\n", "rpds NA\n", - "rpycall NA\n", - "rpytools NA\n", "send2trash NA\n", "six 1.16.0\n", "sklearn 1.3.0\n", @@ -1257,11 +1479,9 @@ "stack_data 0.6.2\n", "threadpoolctl 3.2.0\n", "tornado 6.3.2\n", - "tqdm 4.65.0\n", "traitlets 5.9.0\n", "typing_extensions NA\n", "tzlocal NA\n", - "umap 0.5.3\n", "uri_template NA\n", "urllib3 2.0.4\n", "wcwidth 0.2.6\n", @@ -1284,7 +1504,7 @@ "Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) [Clang 14.0.6 ]\n", "macOS-13.4.1-x86_64-i386-64bit\n", "-----\n", - "Session information updated at 2023-08-08 12:03\n", + "Session information updated at 2023-11-15 15:48\n", "\n", "" ], @@ -1292,7 +1512,7 @@ "" ] }, - "execution_count": 13, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -1300,7 +1520,7 @@ "source": [ "import session_info\n", "\n", - "session_info.show()" + "session_info.show()\n" ] }, { @@ -1313,20 +1533,160 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "NULL\n" + "─ Session info ───────────────────────────────────────────────────────────────\n", + " setting value\n", + " version R version 4.2.3 (2023-03-15)\n", + " os macOS Ventura 13.4.1\n", + " system x86_64, darwin13.4.0\n", + " ui unknown\n", + " language (EN)\n", + " collate C\n", + " ctype UTF-8\n", + " tz Europe/Berlin\n", + " date 2023-11-15\n", + " pandoc 2.19.2 @ /Users/luke.zappia/miniconda3/envs/interoperability2/bin/pandoc\n", + "\n", + "─ Packages ───────────────────────────────────────────────────────────────────\n", + " package * version date (UTC) lib source\n", + " abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.3)\n", + " Biobase 2.58.0 2022-11-01 [1] Bioconductor\n", + " BiocGenerics 0.44.0 2022-11-01 [1] Bioconductor\n", + " bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.3)\n", + " bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.3)\n", + " bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.3)\n", + " cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.3)\n", + " cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.3)\n", + " codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.3)\n", + " colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.3)\n", + " cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.2.3)\n", + " data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.3)\n", + " DelayedArray 0.24.0 2022-11-01 [1] Bioconductor\n", + " deldir 1.0-9 2023-05-17 [1] CRAN (R 4.2.3)\n", + " digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.3)\n", + " dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.2.3)\n", + " ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.3)\n", + " fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.3)\n", + " fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.3)\n", + " fitdistrplus 1.1-11 2023-04-25 [1] CRAN (R 4.2.3)\n", + " future 1.33.0 2023-07-01 [1] CRAN (R 4.2.3)\n", + " future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.2.3)\n", + " generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.3)\n", + " GenomeInfoDb 1.34.9 2023-02-02 [1] Bioconductor\n", + " GenomeInfoDbData 1.2.9 2023-11-10 [1] Bioconductor\n", + " GenomicRanges 1.50.0 2022-11-01 [1] Bioconductor\n", + " ggplot2 3.4.2 2023-04-03 [1] CRAN (R 4.2.3)\n", + " ggrepel 0.9.3 2023-02-03 [1] CRAN (R 4.2.3)\n", + " ggridges 0.5.4 2022-09-26 [1] CRAN (R 4.2.3)\n", + " globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.3)\n", + " glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.3)\n", + " goftest 1.2-3 2021-10-07 [1] CRAN (R 4.2.3)\n", + " gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.3)\n", + " gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.3)\n", + " hdf5r 1.3.8 2023-01-21 [1] CRAN (R 4.2.3)\n", + " htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.3)\n", + " htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.3)\n", + " httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.2.3)\n", + " httr 1.4.6 2023-05-08 [1] CRAN (R 4.2.3)\n", + " ica 1.0-3 2022-07-08 [1] CRAN (R 4.2.3)\n", + " igraph 1.5.0.1 2023-07-23 [1] CRAN (R 4.2.3)\n", + " IRanges 2.32.0 2022-11-01 [1] Bioconductor\n", + " irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.2.3)\n", + " jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.3)\n", + " KernSmooth 2.23-22 2023-07-10 [1] CRAN (R 4.2.3)\n", + " later 1.3.1 2023-05-02 [1] CRAN (R 4.2.3)\n", + " lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.3)\n", + " lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.3)\n", + " leiden 0.4.3 2022-09-10 [1] CRAN (R 4.2.3)\n", + " lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.3)\n", + " listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.3)\n", + " lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.2.3)\n", + " magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.3)\n", + " MASS 7.3-60 2023-05-04 [1] CRAN (R 4.2.3)\n", + " Matrix 1.6-0 2023-07-08 [1] CRAN (R 4.2.3)\n", + " MatrixGenerics 1.10.0 2022-11-01 [1] Bioconductor\n", + " matrixStats 1.0.0 2023-06-02 [1] CRAN (R 4.2.3)\n", + " mime 0.12 2021-09-28 [1] CRAN (R 4.2.3)\n", + " miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.3)\n", + " MuData 1.2.0 2022-11-01 [1] Bioconductor\n", + " MuDataSeurat 0.0.0.9000 2023-11-15 [1] Github (PMBio/MuDataSeurat@e34e908)\n", + " MultiAssayExperiment 1.24.0 2022-11-01 [1] Bioconductor\n", + " munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.3)\n", + " nlme 3.1-162 2023-01-31 [1] CRAN (R 4.2.3)\n", + " parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.2.3)\n", + " patchwork 1.1.2 2022-08-19 [1] CRAN (R 4.2.3)\n", + " pbapply 1.7-2 2023-06-27 [1] CRAN (R 4.2.3)\n", + " pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.3)\n", + " pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.3)\n", + " plotly 4.10.2 2023-06-03 [1] CRAN (R 4.2.3)\n", + " plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.3)\n", + " png 0.1-8 2022-11-29 [1] CRAN (R 4.2.3)\n", + " polyclip 1.10-4 2022-10-20 [1] CRAN (R 4.2.3)\n", + " progressr 0.13.0 2023-01-10 [1] CRAN (R 4.2.3)\n", + " promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.3)\n", + " purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.3)\n", + " R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.3)\n", + " RANN 2.6.1 2019-01-08 [1] CRAN (R 4.2.3)\n", + " RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.3)\n", + " Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.2.3)\n", + " RcppAnnoy 0.0.21 2023-07-02 [1] CRAN (R 4.2.3)\n", + " RCurl 1.98-1.12 2023-03-27 [1] CRAN (R 4.2.3)\n", + " reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.3)\n", + " reticulate 1.30 2023-06-09 [1] CRAN (R 4.2.3)\n", + " rhdf5 2.42.0 2022-11-01 [1] Bioconductor\n", + " rhdf5filters 1.10.0 2022-11-01 [1] Bioconductor\n", + " Rhdf5lib 1.20.0 2022-11-01 [1] Bioconductor\n", + " rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.3)\n", + " ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.2.3)\n", + " Rtsne 0.16 2022-04-17 [1] CRAN (R 4.2.3)\n", + " S4Vectors 0.36.0 2022-11-01 [1] Bioconductor\n", + " scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.3)\n", + " scattermore 1.2 2023-06-12 [1] CRAN (R 4.2.3)\n", + " sctransform 0.3.5 2022-09-21 [1] CRAN (R 4.2.3)\n", + " sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.3)\n", + " Seurat 4.3.0.1 2023-06-22 [1] CRAN (R 4.2.3)\n", + " SeuratObject 4.1.3 2022-11-07 [1] CRAN (R 4.2.3)\n", + " shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.2.3)\n", + " SingleCellExperiment 1.20.0 2022-11-01 [1] Bioconductor\n", + " sp 2.0-0 2023-06-22 [1] CRAN (R 4.2.3)\n", + " spatstat.data 3.0-1 2023-03-12 [1] CRAN (R 4.2.3)\n", + " spatstat.explore 3.2-1 2023-05-13 [1] CRAN (R 4.2.3)\n", + " spatstat.geom 3.2-4 2023-07-20 [1] CRAN (R 4.2.3)\n", + " spatstat.random 3.1-5 2023-05-11 [1] CRAN (R 4.2.3)\n", + " spatstat.sparse 3.0-2 2023-06-25 [1] CRAN (R 4.2.3)\n", + " spatstat.utils 3.0-3 2023-05-09 [1] CRAN (R 4.2.3)\n", + " stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.3)\n", + " stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.3)\n", + " SummarizedExperiment 1.28.0 2022-11-01 [1] Bioconductor\n", + " survival 3.5-5 2023-03-12 [1] CRAN (R 4.2.3)\n", + " tensor 1.5 2012-05-05 [1] CRAN (R 4.2.3)\n", + " tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.3)\n", + " tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.3)\n", + " tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.3)\n", + " utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.3)\n", + " uwot 0.1.16 2023-06-29 [1] CRAN (R 4.2.3)\n", + " vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.3)\n", + " viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.2.3)\n", + " xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.3)\n", + " XVector 0.38.0 2022-11-01 [1] Bioconductor\n", + " zlibbioc 1.44.0 2022-11-01 [1] Bioconductor\n", + " zoo 1.8-12 2023-04-13 [1] CRAN (R 4.2.3)\n", + "\n", + " [1] /Users/luke.zappia/miniconda3/envs/interoperability2/lib/R/library\n", + "\n", + "──────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "%%R\n", - "# sessioninfo::session_info()" + "sessioninfo::session_info()\n" ] }, { From dc4c1cf50d9f11a0227131a8fb0cf40502cefdb1 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 16:01:56 +0100 Subject: [PATCH 11/15] Add text about multimodal objects --- jupyter-book/introduction/interoperability.ipynb | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index f9c63eac..d86b55a0 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -1080,9 +1080,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`.\n", + "The complexity of multimodal data presents additional challenges for interoperability. Both the `SingleCellExperiment` (through \"alternative experiments\", which must match in the column dimension (cells)) and `Seurat` (using \"assays\") objects can store multiple modalities but the `AnnData` object is restricted to unimodal data.\n", "\n", - "Below is an example of reading and writing a small example dataset using the Python and R packages." + "To address this limitation the `MuData` object (introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) was developed as as an extension of `AnnData` for multimodal datasets. The developers have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`.\n", + "\n", + "Below is an example of reading and writing a small example `MuData` dataset using the Python and R packages." ] }, { From 26d90b7aeca499413aed29e2adf49f62b11023d7 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 16:10:22 +0100 Subject: [PATCH 12/15] Add summary box to interoperability --- .../introduction/interoperability.ipynb | 22 ++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index d86b55a0..737b0d7c 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -7,6 +7,19 @@ "# Interoperability" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{admonition} Summary\n", + "\n", + "- Interoperabilty between languages allows analysts to take advantage of the strengths of different ecosystems\n", + "- On-disk interoperability uses standard file formats to transfer data and is typically more reliable\n", + "- In-memory interoperabilty transfers data directly between parallel sessions and is convenient for interactive analysis\n", + "- While interoperability is currently possible developers continue to improve the experience\n", + "```" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -21,7 +34,14 @@ "source": [ "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", "\n", - "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.\n", + "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Nomenclature\n", "\n", "Because talking about different languages can get confusing we try to use the following conventions:\n", "\n", From 0db1503f45cc30511f9a5222069eb6951bab315a Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 16:26:37 +0100 Subject: [PATCH 13/15] Proofread interoperabilty text --- .../introduction/interoperability.ipynb | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 737b0d7c..344547c1 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -32,9 +32,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", + "As we have discussed in the {ref}`analysis frameworks and tools chapter ` there are three main ecosystems for single-cell analysis, the [Bioconductor](https://bioconductor.org/) and [Seurat](https://satijalab.org/seurat/index.html) ecosystems in R and the Python-based [scverse](https://scverse.org/) ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and can use the platform the platform that is most appropriate for their method. Another motivation for being comfortable with multiple ecosystems is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.\n", "\n", - "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages." + "While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully, lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages." ] }, { @@ -131,7 +131,7 @@ "source": [ "The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the [**anndata** package](https://anndata.readthedocs.io/en/latest/index.html) (read more about the format [here](https://anndata.readthedocs.io/en/latest/fileformat-prose.html)).\n", "\n", - "To demonstrate interoperability we will use a small randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots." + "To demonstrate interoperability we will use a small, randomly generated dataset that has gone through some of the steps of a standard analysis workflow to populate the various slots." ] }, { @@ -220,7 +220,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Unfortunately, because of the way this book is made we are unable to run the code directly here. Instead we will show the code and what the output looks like when run in an R session:\n", + "Unfortunately, because of the way this book is made, we are unable to run the code directly here. Instead, we will show the code and what the output looks like when run in an R session:\n", "\n", "```r\n", "sce <- zellkonverter::readH5AD(h5ad_file, verbose = TRUE)\n", @@ -334,7 +334,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If this the first time that you run a **{zellkonverter}** you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. **{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package](https://github.com/cellgeni/sceasy). While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future." + "If this the first time that you run a **{zellkonverter}** function you will see that it first creates a special conda environment to use (which can take a while). Once that environment exists it will be re-used by following function calls. **{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object. Please refer to the package documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package](https://github.com/cellgeni/sceasy). While these packages are effective, wrapping Python requires some overhead which should be addressed by native R H5AD writers/readers in the future." ] }, { @@ -499,9 +499,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done.\n", + "Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function](https://mojaveazure.github.io/seurat-disk/reference/Convert.html) for more details on how this is done.\n", "\n", - "The **{sceasy}** package also provides a function for reading H5AD files as `Seurat` or `SingleCellExperiment` objects in a single step. **{sceasy}** also does this by wrapping Python but unlike **{zellkonverter}** it doesn't use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here)." + "The **{sceasy}** package also provides a function for reading H5AD files as `Seurat` or `SingleCellExperiment` objects in a single step. **{sceasy}** also wraps Python functions but unlike **{zellkonverter}** it doesn't use a special Python environment. This means you need to be responsible for setting up the environment, making sure that R can find it and that the correct packages are installed (again, this code is not run here)." ] }, { @@ -545,7 +545,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [Loom file format](http://loompy.org/) is an older HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis ecosystem, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R](https://github.com/mojaveazure/loomR) and [Python](https://pypi.org/project/loompy/) as well as a [Bioconductor package](https://bioconductor.org/packages/LoomExperiment/) for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto](http://velocyto.org/) for {ref}`RNA velocity analysis `." + "The [Loom file format](http://loompy.org/) is an older HDF5 specification for omics data. Unlike H5AD, it is not linked to a specific analysis ecosystem, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R](https://github.com/mojaveazure/loomR) and [Python](https://pypi.org/project/loompy/) as well as a [Bioconductor package](https://bioconductor.org/packages/LoomExperiment/) for writing Loom files. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto](http://velocyto.org/) for {ref}`RNA velocity analysis `." ] }, { @@ -1102,7 +1102,7 @@ "source": [ "The complexity of multimodal data presents additional challenges for interoperability. Both the `SingleCellExperiment` (through \"alternative experiments\", which must match in the column dimension (cells)) and `Seurat` (using \"assays\") objects can store multiple modalities but the `AnnData` object is restricted to unimodal data.\n", "\n", - "To address this limitation the `MuData` object (introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) was developed as as an extension of `AnnData` for multimodal datasets. The developers have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`.\n", + "To address this limitation, the `MuData` object (introduced in the [analysis frameworks and tools chapter]({ref}`analysis frameworks and tools chapter `) was developed as as an extension of `AnnData` for multimodal datasets. The developers have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package](https://pmbio.github.io/MuDataSeurat/) for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package](https://bioconductor.org/packages/MuData/) for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation](https://docs.juliahub.com/Muon/QfqCh/0.1.1/objects/) of `AnnData` and `MuData`.\n", "\n", "Below is an example of reading and writing a small example `MuData` dataset using the Python and R packages." ] From 77e12683614b5ef92f64c25f090bf220b0c0ef97 Mon Sep 17 00:00:00 2001 From: Luke Zappia Date: Wed, 15 Nov 2023 16:48:10 +0100 Subject: [PATCH 14/15] Update interoperability reviewers --- jupyter-book/introduction/interoperability.ipynb | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 344547c1..5f8fc62b 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -762,6 +762,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "Common Python objects (lists, matrices, `DataFrame`s etc.) can also be passed to R.\n", + "\n", "If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required). For example, starting a cell with `%%R -i input -o output` says to take `input` as input, run R code and then return `output` as output." ] }, @@ -805,7 +807,9 @@ "source": [ "This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation](https://rpy2.github.io/doc/latest/html/index.html).\n", "\n", - "To work with single-cell data in this way the [**anndata2ri** package](https://icb-anndata2ri.readthedocs-hosted.com/en/latest/) is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object. It also enables the conversion of sparse **scipy** matrices like we saw above." + "To work with single-cell data in this way the [**anndata2ri** package](https://icb-anndata2ri.readthedocs-hosted.com/en/latest/) is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object. It also enables the conversion of sparse **scipy** matrices like we saw above.\n", + "\n", + "In this example, we pass an `AnnData` object in the Python session to R which views it as a `SingleCellExperiment` that can be used by R functions." ] }, { @@ -1743,7 +1747,10 @@ "### Reviewers\n", "\n", "* Lukas Heumos\n", - "* Isaac Virshup" + "* Isaac Virshup\n", + "* Anastasia Litinetskaya\n", + "* Ludwig Geistlinger\n", + "* Peter Hickey" ] } ], From b0fbd8f90ede2af9f107860cd914a287eb865789 Mon Sep 17 00:00:00 2001 From: zethson Date: Wed, 15 Nov 2023 17:06:27 +0100 Subject: [PATCH 15/15] pre-commit Signed-off-by: zethson --- jupyter-book/introduction/interoperability.ipynb | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/jupyter-book/introduction/interoperability.ipynb b/jupyter-book/introduction/interoperability.ipynb index 5f8fc62b..3fb33f33 100644 --- a/jupyter-book/introduction/interoperability.ipynb +++ b/jupyter-book/introduction/interoperability.ipynb @@ -70,7 +70,7 @@ "from scipy.sparse import csr_matrix\n", "\n", "anndata2ri.activate()\n", - "%load_ext rpy2.ipython\n" + "%load_ext rpy2.ipython" ] }, { @@ -172,7 +172,7 @@ "scanpy.tl.pca(adata)\n", "scanpy.pp.neighbors(adata)\n", "scanpy.tl.umap(adata)\n", - "adata\n" + "adata" ] }, { @@ -191,7 +191,7 @@ "temp_dir = tempfile.TemporaryDirectory()\n", "h5ad_file = os.path.join(temp_dir.name, \"example.h5ad\")\n", "\n", - "adata.write_h5ad(h5ad_file)\n" + "adata.write_h5ad(h5ad_file)" ] }, { @@ -754,7 +754,7 @@ "counts_mat = adata.layers[\"counts\"].T\n", "rpy2.robjects.globalenv[\"counts_mat\"] = counts_mat\n", "cpm = rpy2.robjects.r(\"scuttle::calculateCPM(counts_mat)\")\n", - "cpm\n" + "cpm" ] }, { @@ -797,7 +797,7 @@ ], "source": [ "# Python code accessing the results\n", - "magic_cpm\n" + "magic_cpm" ] }, { @@ -1170,7 +1170,7 @@ "\n", "# Write new file\n", "python_h5mu_file = os.path.join(temp_dir.name, \"python.h5mu\")\n", - "mdata.write_h5mu(python_h5mu_file)\n" + "mdata.write_h5mu(python_h5mu_file)" ] }, { @@ -1546,7 +1546,7 @@ "source": [ "import session_info\n", "\n", - "session_info.show()\n" + "session_info.show()" ] }, {