Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars 1xx #149

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

Polars 1xx #149

wants to merge 43 commits into from

Commits on Jun 10, 2024

  1. Format all python files with "ruff format".

    Format all python files with "ruff format".
    ghuls committed Jun 10, 2024
    Configuration menu
    Copy the full SHA
    ef010bc View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2024

  1. Cleanup code of get_barcodes_passing_qc_for_sample.

    Cleanup code of get_barcodes_passing_qc_for_sample.
    ghuls committed Jun 11, 2024
    Configuration menu
    Copy the full SHA
    9bb9fda View commit details
    Browse the repository at this point in the history

Commits on Jul 10, 2024

  1. Update Polars syntax to 1.0.0+ version.

    Update Polars syntax to 1.0.0+ version and fix some type checking.
    ghuls committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    0f73b45 View commit details
    Browse the repository at this point in the history
  2. Skip empty lines when reading barcode file in "read_barcodes_file_to_…

    …polars_series".
    
    Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".
    ghuls committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    2915437 View commit details
    Browse the repository at this point in the history
  3. Add workaround for very rare cases in QC where KDE for duplication ra…

    …tio can not be calculated.
    
    Add workaround for very rare cases in QC where KDE for duplication ratio
    can not be calculated as all duplication ratio values are 0.0% as the
    fragment counts for each fragment were 1.
    ghuls committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    7d9a9cc View commit details
    Browse the repository at this point in the history
  4. Add mypy stub generation and mypy numpy typing plugin.

    Add mypy stub generation and mypy numpy typing plugin.
    
    Later stubs can be generated with:
    
        stubgen -o stubs -p pycisTopic
    ghuls committed Jul 10, 2024
    Configuration menu
    Copy the full SHA
    35e13ea View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2024

  1. Only keep one copy of each barcode when reading barcode file in "read…

    …_barcodes_file_to_polars_series".
    
    Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".
    ghuls committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    14fce9a View commit details
    Browse the repository at this point in the history
  2. Change df.columns with df.collect_schema().names() so it will work on…

    … lazy Polars dataframes.
    
    Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.
    ghuls committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    39ace8a View commit details
    Browse the repository at this point in the history
  3. Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_he…

    …ader" instead of "has_header".
    
    Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".
    ghuls committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    9892252 View commit details
    Browse the repository at this point in the history
  4. Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that as …

    …renaming the grouped column to "by".
    
    Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that
    as renaming the grouped column to "by".
    Change "sort" and "partition_by" too, to not use the "by" keyword,
    although those cases still work as before.
    ghuls committed Jul 15, 2024
    Configuration menu
    Copy the full SHA
    ca9b230 View commit details
    Browse the repository at this point in the history

Commits on Jul 16, 2024

  1. Reformat TSS profile chained code by adding no opt "clone" calls.

    Reformat TSS profile chained code by adding no opt "clone" calls.
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    861f9b7 View commit details
    Browse the repository at this point in the history
  2. Restrict version numbers for some depencencies.

    Restrict version numbers for some depencencies.
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    1e2e0bf View commit details
    Browse the repository at this point in the history
  3. Expose "engine" option in "pycistopic qc".

    Expose "engine" option in "pycistopic qc".
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    bfa6e7b View commit details
    Browse the repository at this point in the history
  4. Support adding sample ID to cell barcodes when reading fragments or c…

    …ell barcode files.
    
    Support adding sample ID to cell barcodes when reading fragments or
    cell barcode files by adding extra arguments to:
      - `read_fragments_to_polars_df`
      - `read_barcodes_file_to_polars_series`
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    deef721 View commit details
    Browse the repository at this point in the history
  5. Remove unused argument from get_insert_size_distribution docstring.

    Remove unused argument from `get_insert_size_distribution` docstring.
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    49200f5 View commit details
    Browse the repository at this point in the history
  6. Use greater than or equal for threshold filters in "get_barcodes_pass…

    …ing_qc_for_sample".
    
    Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    15899de View commit details
    Browse the repository at this point in the history
  7. Change pycistopic qc to pycistopic qc run.

    Change `pycistopic qc` to `pycistopic qc run`.
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    419a536 View commit details
    Browse the repository at this point in the history
  8. Add pycistopic qc filter to be able to filter cell barcodes based o…

    …n QC stats.
    
    Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.
    ghuls committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    f9d78a1 View commit details
    Browse the repository at this point in the history

Commits on Jul 17, 2024

  1. Fix "get_tss_profile" so it works both with "polars" and "polars-u64-…

    …idx".
    
    Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx".
    
    Before it was not setting None values to 0.0 when run with "polars-u64-idx".
    ghuls committed Jul 17, 2024
    Configuration menu
    Copy the full SHA
    4661c7d View commit details
    Browse the repository at this point in the history
  2. Fix some columns to pl.UInt32 so dataframes have same schema both wit…

    …h "polars" and "polars-u64-idx".
    
    Fix some columns to pl.UInt32 so dataframes have same schema both
    with "polars" and "polars-u64-idx".
    ghuls committed Jul 17, 2024
    Configuration menu
    Copy the full SHA
    7b285a4 View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2024

  1. Add "create_fragment_matrix_from_fragments" to create directly a spar…

    …se fragment matrix.
    
    Add "create_fragment_matrix_from_fragments" to create directly a
    sparse fragment matrix from a fragments file for consensus peaks
    of interest. The new code uses a lot less memory as it never builds
    a full dense matrix.
    
        import pycisTopic.cistopic_class
        import pycisTopic.fragments
    
        # Create fragments matrix for fragment file for consensus regions and
        # for cell barcodes selected after pycistopic QC.
        counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
            fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz",
            regions_bed_filename="consensus_regions.bed",
            barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv",
            blacklist_bed_filename="hg38-blacklist.v2.bed",  # Or None
        )
    
        # Define sample ID (project name).
        sample_id = "Sample1"
    
        # Create cisTopic object from sparse fragment matrix.
        cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object(
            fragment_matrix=counts_fragments_matrix,
            cell_names=cbs,
            region_names=region_ids,
            path_to_fragments={sample_id: "fragments.tsv.gz"},
            project=sample_id
        )
    ghuls committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    a40e47c View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2024

  1. Add "create_regions_topics_frequency_matrix" function to replace "loa…

    …d_word_topics" in the future.
    
    Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics
    counts file as input, to replace "load_word_topics" and "get_topics" in the future.
    
    "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with
    a lot of regions and cells as input.
    
    Mallet state files are quite big (even several hunderds of GB) and take quite a bit
    of time to be written to disk when Mallet is run with a lot of regions and cells.
    Getting count values for each region-topic pair is quite memory intensive in the
    current code too.
    Mallet region topics counts files are much smaller and don't require much post
    processing and time to be written to disk.
    ghuls committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    b074950 View commit details
    Browse the repository at this point in the history

Commits on Aug 5, 2024

  1. Rename "create_regions_topics_frequency_matrix" to "create_regions_to…

    …pics_count_matrix".
    
    Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix"
    and keep the code for creating the frequency matrix in "get_topics" as both matrices
    are needed.
    ghuls committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    fab1300 View commit details
    Browse the repository at this point in the history
  2. Allow creating of mallet serialized corpus file directly from sparse …

    …matrix.
    
    Allow creating of mallet serialized corpus file directly from sparse matrix.
    This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and
    allows this new method to be used independently of the Mallet topic modeling
    itself. As generating the Mallet serialized corpus only needs to be done once
    before all topic modelings, being able to run it independently, is a huge plus.
    ghuls committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    ad9c756 View commit details
    Browse the repository at this point in the history
  3. Expose creation of Mallet corpus file from pycistopic CLI interface.

    Expose creation of Mallet corpus file from pycistopic CLI interface:
    
        pycistopic topic_modeling create_mallet_corpus
    
    Usage:
    
      Create binary accessibility matrix in Matrix Market format:
    
        import pycisTopic.fragments
        import scipy
    
        counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments(
            "fragments.tsv.gz",
            "consensus_regions.bed",
            "cbs.tsv"
        )
    
        # Create binary matrix:
        binary_matrix = counts_fragments_matrix.copy()
        binary_matrix.data.fill(1)
    
        # Write binary matrix in Matrix Market format.
        scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix)
    
      Create Mallet corpus file from binary accessibility matrix in Matrix Market format:
    
        $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"
    ghuls committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    2d54473 View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2024

  1. Add some basic sanity checking to "convert_binary_matrix_to_mallet_co…

    …rpus_file".
    
    Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".
    ghuls committed Aug 22, 2024
    Configuration menu
    Copy the full SHA
    b680420 View commit details
    Browse the repository at this point in the history

Commits on Aug 26, 2024

  1. Pass correct alpha and eta to loglikelihood function.

    Pass correct alpha and eta to loglikelihood function so it is the
    same than the one that was actually used during topic modeling.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    646db07 View commit details
    Browse the repository at this point in the history
  2. Remove Mallet text corpus after conversion to serialised corpus file.

    Remove Mallet text corpus after conversion to serialised corpus file.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    a59ae9e View commit details
    Browse the repository at this point in the history
  3. Add LDAMalletFilenames class.

    Add LDAMalletFilenames class to automatically generate filenames
    which will be used later in new LDAMallet class.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    b6d0973 View commit details
    Browse the repository at this point in the history
  4. Use --word-topic-counts-file instead of --output-state file when runn…

    …ing Mallet.
    
    Use --word-topic-counts-file instead of --output-state file when running Mallet.
    The first generates directly the counts we need, with minimal post processing.
    The state file also could take ages to write and read for datasets with a lot
    of regions and could increase the runtime of topic modeling by 24 hours.
    
    Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made
    it a static method.
    
    Reading of the output produced by Mallet will be added in further commits.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    5d8cd37 View commit details
    Browse the repository at this point in the history
  5. Add static methods to read cell-topic probabilities from Mallet "--ou…

    …tput-doc-topics" output.
    
    Add static methods to read cell-topic probabilities from Mallet "--output-doc-topics"
    txt output to parquet file.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    12c7c7b View commit details
    Browse the repository at this point in the history
  6. Add static methods to read region-topic counts and probabilities from…

    … Mallet "--word-topic-counts-file" output.
    
    Add static methods to read cell-topic counts and probabilities from Mallet
    "--word-topic-counts-file" txt output to parquet file.
    
    - `LDAMallet.convert_region_topic_counts_txt_to_parquet()` replaces
       `LDAMallet.create_regions_topics_frequency_matrix()`, which recently
       replaced `LDAMallet.load_word_topics()` (read Mallet state file):
    
         See commit: b074950
    
         Add "create_regions_topics_frequency_matrix" function to replace "load_word_topics" in the future.
    
         Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics
         counts file as input, to replace "load_word_topics" and "get_topics" in the future.
    
         "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with
         a lot of regions and cells as input.
    
         Mallet state files are quite big (even several hunderds of GB) and take quite a bit
         of time to be written to disk when Mallet is run with a lot of regions and cells.
         Getting count values for each region-topic pair is quite memory intensive in the
         current code too.
         Mallet region topics counts files are much smaller and don't require much post
         processing and time to be written to disk.
    
    - `LDAMallet.read_region_topic_counts_parquet_file_to_region_topic_probabilities()`
      replaces the functionality of `LDAMallet.get_topics()`, but reads the original
      data directly from disk.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    cbfe3bb View commit details
    Browse the repository at this point in the history
  7. Add static method to read JSON parameter file written by `LDAMallet.r…

    …un_mallet_topic_modeling`.
    
    Add static method to read JSON parameter file written by `LDAMallet.run_mallet_topic_modeling`.
    ghuls committed Aug 26, 2024
    Configuration menu
    Copy the full SHA
    8dbe474 View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2024

  1. Use new functions for reading Mallet "--word-topic-counts-file" output.

    Use new functions for reading Mallet "--word-topic-counts-file" and
    "--output-doc-topics" output to parquet files and saving parameters
    JSON file.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    3bbc709 View commit details
    Browse the repository at this point in the history
  2. Remove deprecated code from LDAMallet class.

    Remove deprecated code from LDAMallet class.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    52de7e1 View commit details
    Browse the repository at this point in the history
  3. Add/update some logging statements in LDAMallet class.

    Add/update some logging statements in LDAMallet class.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    223981c View commit details
    Browse the repository at this point in the history
  4. Add "--verbose" parameter to pycistopic topic_modeling subcommands.

    Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    14b6330 View commit details
    Browse the repository at this point in the history
  5. Update pycistopic topic_modeling mallet CLI code to use the new LDA…

    …Mallet class.
    
    Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    bb8aa47 View commit details
    Browse the repository at this point in the history
  6. Move "create_mallet_corpus" argument parsing code above "mallet" argu…

    …ment parsing code.
    
    Move "create_mallet_corpus" argument parsing code above "mallet"
    argument parsing code, so when both get moved behind a new "mallet"
    subcommand, the diff will be less noisy.
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    3daefd2 View commit details
    Browse the repository at this point in the history
  7. Create pycistopic topic_modeling mallet subparser and move Mallet r…

    …elated subcommands under it.
    
    Create `pycistopic topic_modeling mallet` subparser and move Mallet related
    subcommands under it:
      - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus`
      - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`
    ghuls committed Aug 27, 2024
    Configuration menu
    Copy the full SHA
    64ea33f View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2024

  1. Rework run_cgs_model_mallet to calculate_model_evaluation_stats b…

    …ut start from precalculated Mallet output.
    
    Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats`
    but start from precalculated Mallet output.
    Removing running Mallet from this function allows easier parallelization
    of Mallet topic modeling.
    ghuls committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    b6e7006 View commit details
    Browse the repository at this point in the history
  2. Add pycistopic topic_modeling mallet stats subcommand.

    Add `pycistopic topic_modeling mallet stats` subcommand, using
    `calculate_model_evaluation_stats`.
    ghuls committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    801f334 View commit details
    Browse the repository at this point in the history
  3. Rename binary_matrix to binary_accessibility_matrix.

    Rename `binary_matrix` to `binary_accessibility_matrix`.
    ghuls committed Aug 28, 2024
    Configuration menu
    Copy the full SHA
    8f2faef View commit details
    Browse the repository at this point in the history