Polars 1xx #149

ghuls · 2024-07-16T12:55:26Z

No description provided.

Format all python files with "ruff format".

Cleanup code of get_barcodes_passing_qc_for_sample.

Update Polars syntax to 1.0.0+ version and fix some type checking.

…polars_series". Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".

…tio can not be calculated. Add workaround for very rare cases in QC where KDE for duplication ratio can not be calculated as all duplication ratio values are 0.0% as the fragment counts for each fragment were 1.

Add mypy stub generation and mypy numpy typing plugin. Later stubs can be generated with: stubgen -o stubs -p pycisTopic

…_barcodes_file_to_polars_series". Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".

… lazy Polars dataframes. Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.

…ader" instead of "has_header". Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".

…renaming the grouped column to "by". Do not use "by=" keyword in group_by as Polars 1.0.0+ treats that as renaming the grouped column to "by". Change "sort" and "partition_by" too, to not use the "by" keyword, although those cases still work as before.

Reformat TSS profile chained code by adding no opt "clone" calls.

Restrict version numbers for some depencencies.

Expose "engine" option in "pycistopic qc".

…ell barcode files. Support adding sample ID to cell barcodes when reading fragments or cell barcode files by adding extra arguments to: - `read_fragments_to_polars_df` - `read_barcodes_file_to_polars_series`

Remove unused argument from `get_insert_size_distribution` docstring.

…ing_qc_for_sample". Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".

Change `pycistopic qc` to `pycistopic qc run`.

…n QC stats. Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.

…idx". Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx". Before it was not setting None values to 0.0 when run with "polars-u64-idx".

…h "polars" and "polars-u64-idx". Fix some columns to pl.UInt32 so dataframes have same schema both with "polars" and "polars-u64-idx".

…se fragment matrix. Add "create_fragment_matrix_from_fragments" to create directly a sparse fragment matrix from a fragments file for consensus peaks of interest. The new code uses a lot less memory as it never builds a full dense matrix. import pycisTopic.cistopic_class import pycisTopic.fragments # Create fragments matrix for fragment file for consensus regions and # for cell barcodes selected after pycistopic QC. counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( fragments_bed_filename="fragments_GSM7822226_MM_566.tsv.gz", regions_bed_filename="consensus_regions.bed", barcodes_tsv_filename="cbs_after_pycistopic_qc.tsv", blacklist_bed_filename="hg38-blacklist.v2.bed", # Or None ) # Define sample ID (project name). sample_id = "Sample1" # Create cisTopic object from sparse fragment matrix. cistopic_obj = pycisTopic.cistopic_class.create_cistopic_object( fragment_matrix=counts_fragments_matrix, cell_names=cbs, region_names=region_ids, path_to_fragments={sample_id: "fragments.tsv.gz"}, project=sample_id )

…d_word_topics" in the future. Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics counts file as input, to replace "load_word_topics" and "get_topics" in the future. "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with a lot of regions and cells as input. Mallet state files are quite big (even several hunderds of GB) and take quite a bit of time to be written to disk when Mallet is run with a lot of regions and cells. Getting count values for each region-topic pair is quite memory intensive in the current code too. Mallet region topics counts files are much smaller and don't require much post processing and time to be written to disk.

…pics_count_matrix". Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix" and keep the code for creating the frequency matrix in "get_topics" as both matrices are needed.

…matrix. Allow creating of mallet serialized corpus file directly from sparse matrix. This roughly replaces LDAMallet.convert_corpus_to_mallet_corpus_file and allows this new method to be used independently of the Mallet topic modeling itself. As generating the Mallet serialized corpus only needs to be done once before all topic modelings, being able to run it independently, is a huge plus.

Expose creation of Mallet corpus file from pycistopic CLI interface: pycistopic topic_modeling create_mallet_corpus Usage: Create binary accessibility matrix in Matrix Market format: import pycisTopic.fragments import scipy counts_fragments_matrix, cbs, region_ids = pycisTopic.fragments.create_fragment_matrix_from_fragments( "fragments.tsv.gz", "consensus_regions.bed", "cbs.tsv" ) # Create binary matrix: binary_matrix = counts_fragments_matrix.copy() binary_matrix.data.fill(1) # Write binary matrix in Matrix Market format. scipy.io.mmwrite("binary_accessibility.mtx", binary_matrix) Create Mallet corpus file from binary accessibility matrix in Matrix Market format: $ pycistopic topic_modeling create_mallet_corpus -i "binary_accessibility.mtx" -o "corpus.mallet"

…rpus_file". Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".

Pass correct alpha and eta to loglikelihood function so it is the same than the one that was actually used during topic modeling.

Remove Mallet text corpus after conversion to serialised corpus file.

Add LDAMalletFilenames class to automatically generate filenames which will be used later in new LDAMallet class.

…ing Mallet. Use --word-topic-counts-file instead of --output-state file when running Mallet. The first generates directly the counts we need, with minimal post processing. The state file also could take ages to write and read for datasets with a lot of regions and could increase the runtime of topic modeling by 24 hours. Changed `LDAMallet.train` to `LDMallet.run_mallet_topic_modeling` and made it a static method. Reading of the output produced by Mallet will be added in further commits.

…tput-doc-topics" output. Add static methods to read cell-topic probabilities from Mallet "--output-doc-topics" txt output to parquet file.

… Mallet "--word-topic-counts-file" output. Add static methods to read cell-topic counts and probabilities from Mallet "--word-topic-counts-file" txt output to parquet file. - `LDAMallet.convert_region_topic_counts_txt_to_parquet()` replaces `LDAMallet.create_regions_topics_frequency_matrix()`, which recently replaced `LDAMallet.load_word_topics()` (read Mallet state file): See commit: b074950 Add "create_regions_topics_frequency_matrix" function to replace "load_word_topics" in the future. Add "create_regions_topics_frequency_matrix" function, which uses Mallet region topics counts file as input, to replace "load_word_topics" and "get_topics" in the future. "load_word_topics" uses Mallet state file, which can be huge if Mallet is run with a lot of regions and cells as input. Mallet state files are quite big (even several hunderds of GB) and take quite a bit of time to be written to disk when Mallet is run with a lot of regions and cells. Getting count values for each region-topic pair is quite memory intensive in the current code too. Mallet region topics counts files are much smaller and don't require much post processing and time to be written to disk. - `LDAMallet.read_region_topic_counts_parquet_file_to_region_topic_probabilities()` replaces the functionality of `LDAMallet.get_topics()`, but reads the original data directly from disk.

…un_mallet_topic_modeling`. Add static method to read JSON parameter file written by `LDAMallet.run_mallet_topic_modeling`.

Use new functions for reading Mallet "--word-topic-counts-file" and "--output-doc-topics" output to parquet files and saving parameters JSON file.

Remove deprecated code from LDAMallet class.

Add/update some logging statements in LDAMallet class.

Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.

…Mallet class. Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.

…ment parsing code. Move "create_mallet_corpus" argument parsing code above "mallet" argument parsing code, so when both get moved behind a new "mallet" subcommand, the diff will be less noisy.

…elated subcommands under it. Create `pycistopic topic_modeling mallet` subparser and move Mallet related subcommands under it: - `pycistopic topic_modeling create_mallet_corpus` => `pycistopic topic_modeling mallet create_corpus` - `pycistopic topic_modeling mallet` => `pycistopic topic_modeling mallet run`

…ut start from precalculated Mallet output. Rework `run_cgs_model_mallet` to `calculate_model_evaluation_stats` but start from precalculated Mallet output. Removing running Mallet from this function allows easier parallelization of Mallet topic modeling.

Add `pycistopic topic_modeling mallet stats` subcommand, using `calculate_model_evaluation_stats`.

Rename `binary_matrix` to `binary_accessibility_matrix`.

ghuls added 30 commits June 10, 2024 17:22

Format all python files with "ruff format".

ef010bc

Format all python files with "ruff format".

Cleanup code of get_barcodes_passing_qc_for_sample.

9bb9fda

Cleanup code of get_barcodes_passing_qc_for_sample.

Update Polars syntax to 1.0.0+ version.

0f73b45

Update Polars syntax to 1.0.0+ version and fix some type checking.

Skip empty lines when reading barcode file in "read_barcodes_file_to_…

2915437

…polars_series". Skip empty lines when reading barcode file in "read_barcodes_file_to_polars_series".

Add mypy stub generation and mypy numpy typing plugin.

35e13ea

Add mypy stub generation and mypy numpy typing plugin. Later stubs can be generated with: stubgen -o stubs -p pycisTopic

Only keep one copy of each barcode when reading barcode file in "read…

14fce9a

…_barcodes_file_to_polars_series". Only keep one copy of each barcode when reading barcode file in "read_barcodes_file_to_polars_series".

Change df.columns with df.collect_schema().names() so it will work on…

39ace8a

… lazy Polars dataframes. Change df.columns with df.collect_schema().names() so it will work on lazy Polars dataframes.

Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_he…

9892252

…ader" instead of "has_header". Update "write_csv" usage to Polars syntax 1.0.0+ by using "include_header" instead of "has_header".

Reformat TSS profile chained code by adding no opt "clone" calls.

861f9b7

Reformat TSS profile chained code by adding no opt "clone" calls.

Restrict version numbers for some depencencies.

1e2e0bf

Restrict version numbers for some depencencies.

Expose "engine" option in "pycistopic qc".

bfa6e7b

Expose "engine" option in "pycistopic qc".

Support adding sample ID to cell barcodes when reading fragments or c…

deef721

…ell barcode files. Support adding sample ID to cell barcodes when reading fragments or cell barcode files by adding extra arguments to: - `read_fragments_to_polars_df` - `read_barcodes_file_to_polars_series`

Remove unused argument from get_insert_size_distribution docstring.

49200f5

Remove unused argument from `get_insert_size_distribution` docstring.

Use greater than or equal for threshold filters in "get_barcodes_pass…

15899de

…ing_qc_for_sample". Use greater than or equal for threshold filters in "get_barcodes_passing_qc_for_sample".

Change pycistopic qc to pycistopic qc run.

419a536

Change `pycistopic qc` to `pycistopic qc run`.

Add pycistopic qc filter to be able to filter cell barcodes based o…

f9d78a1

…n QC stats. Add `pycistopic qc filter` to be able to filter cell barcodes based on QC stats.

Fix "get_tss_profile" so it works both with "polars" and "polars-u64-…

4661c7d

…idx". Fix "get_tss_profile" so it works both with "polars" and "polars-u64-idx". Before it was not setting None values to 0.0 when run with "polars-u64-idx".

Fix some columns to pl.UInt32 so dataframes have same schema both wit…

7b285a4

…h "polars" and "polars-u64-idx". Fix some columns to pl.UInt32 so dataframes have same schema both with "polars" and "polars-u64-idx".

Rename "create_regions_topics_frequency_matrix" to "create_regions_to…

fab1300

…pics_count_matrix". Rename "create_regions_topics_frequency_matrix" to "create_regions_topics_count_matrix" and keep the code for creating the frequency matrix in "get_topics" as both matrices are needed.

Add some basic sanity checking to "convert_binary_matrix_to_mallet_co…

b680420

…rpus_file". Add some basic sanity checking to "convert_binary_matrix_to_mallet_corpus_file".

Pass correct alpha and eta to loglikelihood function.

646db07

Pass correct alpha and eta to loglikelihood function so it is the same than the one that was actually used during topic modeling.

Remove Mallet text corpus after conversion to serialised corpus file.

a59ae9e

Remove Mallet text corpus after conversion to serialised corpus file.

Add LDAMalletFilenames class.

b6d0973

Add LDAMalletFilenames class to automatically generate filenames which will be used later in new LDAMallet class.

ghuls added 13 commits August 26, 2024 18:10

Add static methods to read cell-topic probabilities from Mallet "--ou…

12c7c7b

…tput-doc-topics" output. Add static methods to read cell-topic probabilities from Mallet "--output-doc-topics" txt output to parquet file.

Add static method to read JSON parameter file written by `LDAMallet.r…

8dbe474

…un_mallet_topic_modeling`. Add static method to read JSON parameter file written by `LDAMallet.run_mallet_topic_modeling`.

Use new functions for reading Mallet "--word-topic-counts-file" output.

3bbc709

Use new functions for reading Mallet "--word-topic-counts-file" and "--output-doc-topics" output to parquet files and saving parameters JSON file.

Remove deprecated code from LDAMallet class.

52de7e1

Remove deprecated code from LDAMallet class.

Add/update some logging statements in LDAMallet class.

223981c

Add/update some logging statements in LDAMallet class.

Add "--verbose" parameter to pycistopic topic_modeling subcommands.

14b6330

Add "--verbose" parameter to `pycistopic topic_modeling` subcommands.

Update pycistopic topic_modeling mallet CLI code to use the new LDA…

bb8aa47

…Mallet class. Update `pycistopic topic_modeling mallet` CLI code to use the new LDAMallet class.

Move "create_mallet_corpus" argument parsing code above "mallet" argu…

3daefd2

…ment parsing code. Move "create_mallet_corpus" argument parsing code above "mallet" argument parsing code, so when both get moved behind a new "mallet" subcommand, the diff will be less noisy.

Add pycistopic topic_modeling mallet stats subcommand.

801f334

Add `pycistopic topic_modeling mallet stats` subcommand, using `calculate_model_evaluation_stats`.

Rename binary_matrix to binary_accessibility_matrix.

8f2faef

Rename `binary_matrix` to `binary_accessibility_matrix`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars 1xx #149

Polars 1xx #149

ghuls commented Jul 16, 2024

Polars 1xx #149

Are you sure you want to change the base?

Polars 1xx #149

Conversation

ghuls commented Jul 16, 2024