Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars 1xx #149

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

Polars 1xx #149

wants to merge 43 commits into from

Conversation

ghuls
Copy link
Member

@ghuls ghuls commented Jul 16, 2024

No description provided.

ghuls added 30 commits June 10, 2024 17:22
Format all python files with "ruff format".
Cleanup code of get_barcodes_passing_qc_for_sample.
Update Polars syntax to 1.0.0+ version and fix some type checking.
…polars_series".

Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".
…tio can not be calculated.

Add workaround for very rare cases in QC where KDE for duplication ratio
can not be calculated as all duplication ratio values are 0.0% as the
fragment counts for each fragment were 1.
Add mypy stub generation and mypy numpy typing plugin.

Later stubs can be generated with:

    stubgen -o stubs -p pycisTopic
…_barcodes_file_to_polars_series".

Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".
… lazy Polars dataframes.

Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.
…ader" instead of "has_header".

Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".
…renaming the grouped column to "by".

Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that
as renaming the grouped column to "by".
Change "sort" and "partition_by" too, to not use the "by" keyword,
although those cases still work as before.
Reformat TSS profile chained code by adding no opt "clone" calls.
Restrict version numbers for some depencencies.
Expose "engine" option in "pycistopic qc".
…ell barcode files.

Support adding sample ID to cell barcodes when reading fragments or
cell barcode files by adding extra arguments to:
  - `read_fragments_to_polars_df`
  - `read_barcodes_file_to_polars_series`
Remove unused argument from `get_insert_size_distribution` docstring.
…ing_qc_for_sample".

Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".
Change `pycistopic qc` to `pycistopic qc run`.
…n QC stats.

Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.
…idx".

Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx".

Before it was not setting None values to 0.0 when run with "polars-u64-idx".
…h "polars" and "polars-u64-idx".

Fix some columns to pl.UInt32 so dataframes have same schema both
with "polars" and "polars-u64-idx".
…se fragment matrix.

Add "create_fragment_matrix_from_fragments" to create directly a
sparse fragment matrix from a fragments file for consensus peaks
of interest. The new code uses a lot less memory as it never builds
a full dense matrix.

    import pycisTopic.cistopic_class
    import pycisTopic.fragments

    # Create fragments matrix for fragment file for consensus regions and
    # for cell barcodes selected after pycistopic QC.
    counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
        fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz",
        regions_bed_filename="consensus_regions.bed",
        barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv",
        blacklist_bed_filename="hg38-blacklist.v2.bed",  # Or None
    )

    # Define sample ID (project name).
    sample_id = "Sample1"

    # Create cisTopic object from sparse fragment matrix.
    cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object(
        fragment_matrix=counts_fragments_matrix,
        cell_names=cbs,
        region_names=region_ids,
        path_to_fragments={sample_id: "fragments.tsv.gz"},
        project=sample_id
    )
…d_word_topics" in the future.

Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics
counts file as input, to replace "load_word_topics" and "get_topics" in the future.

"load_word_topics" uses Mallet state file, which can be huge if Mallet is run with
a lot of regions and cells as input.

Mallet state files are quite big (even several hunderds of GB) and take quite a bit
of time to be written to disk when Mallet is run with a lot of regions and cells.
Getting count values for each region-topic pair is quite memory intensive in the
current code too.
Mallet region topics counts files are much smaller and don't require much post
processing and time to be written to disk.
…pics_count_matrix".

Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix"
and keep the code for creating the frequency matrix in "get_topics" as both matrices
are needed.
…matrix.

Allow creating of mallet serialized corpus file directly from sparse matrix.
This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and
allows this new method to be used independently of the Mallet topic modeling
itself. As generating the Mallet serialized corpus only needs to be done once
before all topic modelings, being able to run it independently, is a huge plus.
Expose creation of Mallet corpus file from pycistopic CLI interface:

    pycistopic topic_modeling create_mallet_corpus

Usage:

  Create binary accessibility matrix in Matrix Market format:

    import pycisTopic.fragments
    import scipy

    counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
        "fragments.tsv.gz",
        "consensus_regions.bed",
        "cbs.tsv"
    )

    # Create binary matrix:
    binary_matrix = counts_fragments_matrix.copy()
    binary_matrix.data.fill(1)

    # Write binary matrix in Matrix Market format.
    scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix)

  Create Mallet corpus file from binary accessibility matrix in Matrix Market format:

    $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"
…rpus_file".

Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".
Pass correct alpha and eta to loglikelihood function so it is the
same than the one that was actually used during topic modeling.
Remove Mallet text corpus after conversion to serialised corpus file.
Add LDAMalletFilenames class to automatically generate filenames
which will be used later in new LDAMallet class.
…ing Mallet.

Use --word-topic-counts-file instead of --output-state file when running Mallet.
The first generates directly the counts we need, with minimal post processing.
The state file also could take ages to write and read for datasets with a lot
of regions and could increase the runtime of topic modeling by 24 hours.

Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made
it a static method.

Reading of the output produced by Mallet will be added in further commits.
…tput-doc-topics" output.

Add static methods to read cell-topic probabilities from Mallet "--output-doc-topics"
txt output to parquet file.
… Mallet "--word-topic-counts-file" output.

Add static methods to read cell-topic counts and probabilities from Mallet
"--word-topic-counts-file" txt output to parquet file.

- `LDAMallet.convert_region_topic_counts_txt_to_parquet()` replaces
   `LDAMallet.create_regions_topics_frequency_matrix()`, which recently
   replaced `LDAMallet.load_word_topics()` (read Mallet state file):

     See commit: b074950

     Add "create_regions_topics_frequency_matrix" function to replace "load_word_topics" in the future.

     Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics
     counts file as input, to replace "load_word_topics" and "get_topics" in the future.

     "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with
     a lot of regions and cells as input.

     Mallet state files are quite big (even several hunderds of GB) and take quite a bit
     of time to be written to disk when Mallet is run with a lot of regions and cells.
     Getting count values for each region-topic pair is quite memory intensive in the
     current code too.
     Mallet region topics counts files are much smaller and don't require much post
     processing and time to be written to disk.

- `LDAMallet.read_region_topic_counts_parquet_file_to_region_topic_probabilities()`
  replaces the functionality of `LDAMallet.get_topics()`, but reads the original
  data directly from disk.
…un_mallet_topic_modeling`.

Add static method to read JSON parameter file written by `LDAMallet.run_mallet_topic_modeling`.
Use new functions for reading Mallet "--word-topic-counts-file" and
"--output-doc-topics" output to parquet files and saving parameters
JSON file.
Remove deprecated code from LDAMallet class.
Add/update some logging statements in LDAMallet class.
Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.
…Mallet class.

Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.
…ment parsing code.

Move "create_mallet_corpus" argument parsing code above "mallet"
argument parsing code, so when both get moved behind a new "mallet"
subcommand, the diff will be less noisy.
…elated subcommands under it.

Create `pycistopic topic_modeling mallet` subparser and move Mallet related
subcommands under it:
  - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus`
  - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`
…ut start from precalculated Mallet output.

Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats`
but start from precalculated Mallet output.
Removing running Mallet from this function allows easier parallelization
of Mallet topic modeling.
Add `pycistopic topic_modeling mallet stats` subcommand, using
`calculate_model_evaluation_stats`.
Rename `binary_matrix` to `binary_accessibility_matrix`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant