hicberg package

Submodules

hicberg.align module

hicberg.align.hic_align(index: str, fq_for: str, fq_rev: str, sensitivity: str = 'very-sensitive', max_alignment: int = None, cpus: int = 1, output_dir: str = None, verbose: bool = False) None[source]

Alignment of reads from HiC experiments along an indexed genome.

Parameters:
  • index (str) – Path to the index of the genome along which reads are going to be aligned (path to .bt2l files). Default to None, index files are searched to sample_name/data/index/sample_name.

  • fq_for (str) – Path to .fasta containing set of reads to align (forward mate).

  • fq_rev (str) – Path to .fasta containing set of reads to align (forward mate).

  • sensitivity (str, optional) – Sensitivity of the alignment., by default ‘very_sensitive’

  • max_alignment (int, optional) – Maximum number of alignments to be returned, by default None

  • cpus (int, optional) – Number of threads allocated for the alignment, by default 1

  • output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None

  • verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_build_index(genome: str, output_dir: str = None, cpus: int = 1, verbose: bool = False) None[source]

Building of bowtie2 index (.bt2l files) for read alignment.

Parameters:
  • genome (str) – Path to the genome file along which reads are going to be aligned.

  • cpus (int, optional) – Number of threads allocated for the alignment, by default 1

  • output_dir (str, optional) – Path where the Bowtie2 index files should be stored, by default None

  • verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_index(bam_for: str = '1.sorted.bam', bam_rev: str = '2.sorted.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None[source]

Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access.

Parameters:
  • bam_for (str, optional) – Forward alignment file to be indexed, by default “1.sorted.bam”

  • bam_rev (str, optional) – Reverse alignment file to be indexed,, by default “2.sorted.bam”

  • cpus (int, optional) – Number of threads allocated for the alignment, by default 1

  • output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None

  • verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_sort(bam_for: str = '1.bam', bam_rev: str = '2.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None[source]

Sort .bam alignment files by read_name (using samtools).

Parameters:
  • bam_for (str, optional) – Forward alignment file to be sorted, by default “1.bam”

  • bam_rev (str, optional) – Reverse alignment file to be sorted, by default “2.bam”

  • cpus (int, optional) – Number of threads allocated for the alignment, by default 1

  • output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None

  • verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_view(sam_for: str = '1.sam', sam_rev: str = '2.sam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None[source]

Conversion of .sam alignment files to .bam alignment format (using samtools).

Parameters:
  • sam_for (str, optional) – Path to forward .sam alignment file, by default “1.sam”

  • sam_rev (str, optional) – Path to reverse .sam alignment file, by default “2.sam”

  • cpus (int, optional) – Number of threads allocated for the alignment, by default 1

  • output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None

  • verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.benchmark module

hicberg.benchmark.benchmark(output_dir: str = None, chromosome: str = '', position: int = 0, trans_chromosome: str = None, trans_position: int = None, strides: list[int] = [], mode: str = 'full', auto: int = None, kernel_size: int = 11, deviation: float = 0.5, bins: int = None, circular: str = '', genome: str = None, pattern: str = None, threshold: float = 0.0, jitter: int = 0, trend: bool = True, top: int = 100, force: bool = False, iterations: int = 3, cpus: int = 8)[source]

Performs a benchmark of the HiC-Berg method by simulating the deletion of a genomic region and evaluating the method’s ability to recover the original signal.

The benchmark simulates the deletion of a genomic region by removing reads that map to that region from a Hi-C dataset. It then applies the HiC-Berg method to the depleted dataset to try to recover the original signal. The benchmark evaluates the method’s performance by comparing the rescued contact map to the original contact map.

Parameters:
  • output_dir (str, optional) – Path to the output directory.

  • chromosome (str, optional) – Chromosome to perform the benchmark on.

  • position (int, optional) – Position of the deletion on the chromosome.

  • trans_chromosome (str, optional) – Chromosome to consider for trans interactions.

  • trans_position (int, optional) – Position of the deletion on the trans chromosome.

  • strides (list[int], optional) – List of strides to use for the deletion.

  • mode (str, optional) – Mode of the HiC-Berg method to use. Can be “full” or “density”.

  • auto (int, optional) – Automatically determine the size of the deletion based on the given number of bins.

  • kernel_size (int, optional) – Size of the kernel to use for the density estimation.

  • deviation (float, optional) – Standard deviation of the kernel to use for the density estimation.

  • bins (int, optional) – Number of bins to use for the deletion.

  • circular (str, optional) – Whether the chromosome is circular.

  • genome (str, optional) – Genome assembly to use.

  • pattern (str, optional) – Pattern to use for Chromosight pre-call.

  • threshold (float, optional) – Threshold to use for Chromosight pre-call.

  • jitter (int, optional) – Jitter to use for Chromosight pre-call.

  • trend (bool, optional) – Whether to use trend for Chromosight pre-call.

  • top (int, optional) – Top patterns to use for Chromosight pre-call.

  • force (bool, optional) – Whether to force the benchmark to run even if the output directory already exists.

  • iterations (int, optional) – Number of iterations to run the HiC-Berg method.

  • cpus (int, optional) – Number of CPUs to use for parallel processing.

Raises:
  • ValueError – If the output directory does not exist.

  • ValueError – If the restriction map file does not exist.

hicberg.eval module

hicberg.eval.arange_multi(starts, stops=None, lengths=None)[source]

Create concatenated ranges of integers for multiple start/length. :param starts: Starts for each range :type starts: numpy.ndarray :param stops: Stops for each range :type stops: numpy.ndarray :param lengths: Lengths for each range. Either stops or lengths must be provided. :type lengths: numpy.ndarray

Returns:

concat_ranges – Concatenated ranges.

Return type:

numpy.ndarray

Notes

See the following illustrative example: starts = np.array([1, 3, 4, 6]) stops = np.array([1, 5, 7, 6]) print arange_multi(starts, lengths) >>> [3 4 4 5 6] From: https://codereview.stackexchange.com/questions/83018/vectorized-numpy-version-of-arange-with-multiple-start-stop

hicberg.eval.check_emptiness(intervals: dict[str, list[int, int]], matrix: Cooler = None) bool[source]

Check if intervals are empty in a given matrix.

Parameters:
  • intervals (dict[str, list[(int, int)]) – Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary.

  • matrix (cooler.Cooler, optional) – Cooler Hi-C matrix to be checked for emptiness, by default None

Returns:

Returns True if one of the intervals are empty in the matrix, False otherwise.

Return type:

bool

hicberg.eval.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
hicberg.eval.chromosight_cmd_generator(file: str = None, pattern: str = 'loops', untrend: bool = True, mode: bool = False, output_dir: str = None) str[source]

Generate chromosight command line to run pattern detection.

Parameters:
  • file (str, optional) – Path to Hi-C balanced contact matrix in .cool format , by default None

  • pattern (str, optional) – Pattern to detect, by default “loops”

  • untrend (bool, optional) – Set if map has to be detrended, by default True

  • mode (bool, optional) – Set either the detection has to be performed before or after map reconstruction, by default False

  • output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Chromosight command line to run to run pattern detection

Return type:

str

hicberg.eval.draw_intervals(nb_intervals: int = 1, bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy') dict[str, list[tuple[int, int]]][source]

Draw intervals from a given chromosome sizes dictionary.

Parameters:
  • nb_intervals (int, optional) – Number of intervals to draw, by default 1

  • bins (int, optional) – Size of the desired bin, by default 2000., by default 2000

  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Dictionary containing the intervals as {chromosome : [(start, end), …]}.

Return type:

dict[str, list[tuple[int, int]]]

hicberg.eval.generate_dict_coordinates(matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]][source]

Generates a dictionary of genomic coordinates for selecting regions in a Hi-C contact map.

This function defines genomic intervals based on provided parameters such as chromosome, position, strides, and bin size. It can generate intervals for both intra-chromosomal and inter-chromosomal interactions. The intervals can be used to select specific regions in a Hi-C contact map for further analysis, such as simulating deletions or duplications.

Parameters:
  • matrix_file (str, optional) – Path to the cooler file containing the Hi-C contact map, by default “unrescued_map.cool”

  • position (int, optional) – Genomic position on the chromosome, by default 0

  • chromosome (str | list[str], optional) – Chromosome or list of chromosomes, by default “”

  • bin_size (int, optional) – Size of the bins in the Hi-C contact map, by default 2000

  • chrom_sizes_dict (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”

  • strides (list[int], optional) – List of strides to use for generating intervals, by default [0]

  • trans_chromosome (str, optional) – Chromosome for trans-interactions, by default None

  • output_dir (str, optional) – Path to the output directory, by default None

  • trans_position (list[int], optional) – List of positions on the trans-chromosome, by default None

  • nb_bins (int, optional) – Number of bins to include in each interval, by default 1

  • random (bool, optional) – Whether to generate random intervals, by default False

  • auto (int, optional) – Number of random intervals to generate, by default None

Returns:

A dictionary where keys are chromosomes and values are lists of (start, end) genomic intervals.

Return type:

dict[str, list[(int, int)]]

Raises:

ValueError – If the output path, chromosome sizes file, or matrix file does not exist.

hicberg.eval.get_FN_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) DataFrame[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_FP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None)[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_TP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) DataFrame[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_bin_indexes(matrix: str = None, dictionary: dict = None)[source]

Return the bins indexes (in the cooler matrix) corresponding to the genomic coordinates in the dictionary.

Parameters:
  • matrix (str, optional) – Path to a cooler matrix file, by default None

  • dictionary (dict, optional) – Dictionary of genomic coordinates where reads are going to be selected for duplication, by default None

hicberg.eval.get_boundaries(position: int = None, bins: int = 2000, chromosome: str | list[str] = None, chrom_sizes_dict: str = 'chromosome_sizes.npy') tuple[int, int][source]

Return boundaries surrounding position considering regular binning of bins.

Parameters:
  • position (int, optional) – Position to get surrounding boundaries from, by default None

  • bins (int, optional) – Size of the desired bin, by default 2000., by default 2000

  • chromosome (str or list[str], optional) – Chromosomes associated to positions to get surrounding boundaries from, by default None

  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Tuple containing the boundaries of the position : (lower_bound, upper_bound).

Return type:

tuple[int, int]

hicberg.eval.get_chromosomes_intervals(bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', chromosome: str = '') list[int, int][source]

Get all possible intervals from a given chromosome considering a chromosome sizes dictionary.

Parameters:
  • bins (int, optional) – Size of the desired bin, by default 2000., by default 2000

  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

  • chromosome (str, optional) – Chromosome to get intervals from, by default “”

Returns:

List containing the intervals as [(start, end), …]

Return type:

list[(int, int)]

hicberg.eval.get_f1_score(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return f1 score i.e. 2 * ((precision * recall) / (precision + recall)).

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

f1_score – 2 * ((precision * recall) / (precision + recall))

Return type:

float

hicberg.eval.get_interval_index(chromosome: str = '', value: int = None, intervals_dict: dict[str, list[int, int]] = None, chrom_sizes_dict: dict[str, int] = None) dict[str, list[int, int]][source]
Parameters:
  • chromosome ([str]) – chromosome linked to value and intervals to search in.

  • value ([int]) – Value to search in intervals.

  • intervals_dict ([dict]) – Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary. If the value in not in any interval, return None.

Return type:

[dict]

hicberg.eval.get_intervals_proportions(chrom_sizes_dict: str = 'chromosome_sizes.npy', nb_intervals: int = 1) dict[str, float][source]

Extract a dictionary containing the number of intervals to draw considering the size of each chromosome.

Parameters:
  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

  • nb_intervals (int, optional) – Number of intervals to draw, by default 1

Returns:

Dictionary containing proportion by intervals as {chromosome : proportion}.

Return type:

dict[str, int]

hicberg.eval.get_precision(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return precision score corresponding to the number of true positives over the number of true positives and false positives.

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

precision – number of true positives over the number of true positives and false positives

Return type:

float

hicberg.eval.get_recall(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return recall score corresponding to the number of true positives over the number of true positives and false negatives.

Parameters:
  • df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.

  • df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.

  • chromosome ([str]) – chromosome to select position within.

  • bin_size ([int]) – bin size used to construct the HiC map.

  • jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.

  • threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

recall – number of true positives over the number of true positives and false negatives

Return type:

float

hicberg.eval.get_top_pattern(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None) DataFrame[source]

Get top patterns from a dataframe

Parameters:
  • df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None

  • top (int, optional) – Percentage of top patterns to get, by default 10

  • threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0

  • chromosome (str, optional) – Chromosome to consider, by default None

Returns:

Dataframe containing top percentage patterns.

Return type:

pd.DataFrame

hicberg.eval.hicberg_benchmark_cmd_generator(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None, mode: str = 'full', genome: str = None, bins: int = 0, output: str = None) str[source]

Get top patterns from a dataframe

Parameters:
  • df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None

  • top (int, optional) – Percentage of top patterns to get, by default 10

  • threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0

  • chromosome (str, optional) – Chromosome to consider, by default None

  • mode (str, optional) – Mode to consider for hicberg benchmark, by default “full”

  • genome (str, optional) – Path to the genome to consider for hicberg benchmark, by default None

  • bins (int, optional) – Number of bins to consider for hicberg benchmark, by default 0

  • output (str, optional) – Path to the output folder to consider for hicberg benchmark, by default None

Returns:

HiC-BERG command line to run to evaluate reconstruction after pattern discarding

Return type:

str

hicberg.eval.intersect2D(a, b)[source]

Find row intersection between 2D numpy arrays, a and b. Returns another numpy array with shared rows

hicberg.eval.overlap_intervals(starts1, ends1, starts2, ends2, closed=False, sort=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals.

Parameters:
  • starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

overlap_ids – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

Return type:

numpy.ndarray

hicberg.eval.overlap_intervals_outer(starts1, ends1, starts2, ends2, closed=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals, as well as the indices of the intervals that do not overlap any other interval.

Parameters:
  • starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

  • overlap_ids (numpy.ndarray) – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

  • no_overlap_ids1, no_overlap_ids2 (numpy.ndarray) – Two 1D arrays containing the indices of intervals in sets 1 and 2 respectively that do not overlap with any interval in the other set.

hicberg.eval.select_reads(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]][source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:
  • bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”

  • bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”

  • matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”

  • position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0

  • chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”

  • bin_size (int, optional) – Resolution of the given matrix, by default 2000

  • chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”

  • strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]

  • trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None

  • trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None

  • nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1

  • random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False

  • auto (int, optional) – Set the number of intervals to get while picking at random, by default None

  • output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.eval.select_reads_multithreads(bam_couple: tuple[str, str], interval_dictionary: dict[str, (<class 'int'>, <class 'int'>)], chrom_sizes_dict: str = 'chromosome_sizes.npy', output_dir: str = None) None[source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:
  • bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”

  • bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”

  • matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”

  • position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0

  • chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”

  • bin_size (int, optional) – Resolution of the given matrix, by default 2000

  • chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”

  • strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]

  • trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None

  • trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None

  • nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1

  • random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False

  • auto (int, optional) – Set the number of intervals to get while picking at random, by default None

  • output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.eval.select_reads_multithreads_bckp(bam_couple: tuple[str, str], matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]][source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:
  • bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”

  • bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”

  • matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”

  • position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0

  • chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”

  • bin_size (int, optional) – Resolution of the given matrix, by default 2000

  • chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”

  • strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]

  • trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None

  • trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None

  • nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1

  • random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False

  • auto (int, optional) – Set the number of intervals to get while picking at random, by default None

  • output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.io module

hicberg.io.build_matrix(bins: str = 'fragments_fixed_sizes.txt', pairs: str = 'group1.pairs', mode: bool = False, balance: bool = False, cpus: int = 8, output_dir: str = None) None[source]

Take table of bins and .pairs file and build a matrix in .cool format.

Parameters:
  • bins (str) – Path to the bin file.

  • pairs (str, optional) – Path to pairs file, by default “group1.pairs”

  • mode (bool, optional) – Choose wether the mode is rescued or un-rescued to construct associated .cool file, by default False

  • balance (bool, optional) – Set wether or not to balance the matrix, by default False

  • output_dir (str, optional) – Path to the folder where to save the cooler matrix file, by default None, by default None

hicberg.io.build_pairs(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', bam_for_rescued: str = 'group2.1.rescued.bam', bam_rev_rescued: str = 'group2.2.rescued.bam', mode: bool = False, output_dir: str = None) None[source]

Build pairs of reads from the aligned reads.

Parameters:
  • bam_for (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (non rescued)., by default “group1.1.bam”

  • bam_rev (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (non rescued)., by default None, by default “group1.2.bam”

  • bam_for_rescued (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.1.rescued.bam”

  • bam_rev_rescued (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.2.rescued.bam”

  • mode (bool, optional) – Choose wether the mode is rescued or unrescued to construct associated .pairs file, by default False

  • output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None

hicberg.io.create_folder(sample_name: str = None, output_dir: str = None, force: bool = False) None[source]

Creates folder architecture to store results and intermediate files for the full HiC-BERG pipeline.

Parameters:
  • sample_name (str) – Name of the folder to be created.

  • force (bool) – Set if existing folder has to be deleted before folder creation.

  • output_dir (str) – Path where the folder will be created.

Returns:

Path of the folder created

Return type:

[str]

hicberg.io.load_cooler(matrix: str = None) Cooler[source]

Load cooler matrix.

Parameters:

matrix (str, optional) – Path to a cooler matrix, by default None

Returns:

Cooler matrix object

Return type:

cooler.Cooler

hicberg.io.load_dictionary(dictionary: str = None) dict[source]

Load dictionary save into numpy format (.npy).

Parameters:

dictionary (str, optional) – Path to a the dictionary saved in numpy (.npy) format, by default None

Returns:

Python native dictionary

Return type:

dict

hicberg.io.merge_predictions(output_dir: str = None, clean: bool = True, stage='prediction', cpus: int = 1) None[source]

Merge predictions of all chunks of ambiguous reads predictions.

Parameters:
  • output_dir (str, optional) – Path to the folder where to save the fused alignment file, by default None

  • clean (bool, optional) – Set weither or not to remove temporary chunks, by default True

  • stage (str, optional) – Set the stage of the merging, by default “prediction”. Can be “prediction” or “classification”

  • cpus (int, optional) – Number of cpus to use for the merging, by default 1

hicberg.io.tidy_folder(output_dir: str = None) None[source]

Tidy all the files in the output folder.

Parameters:

output_dir (str, optional) – Path to the folder where to save the fused alignment file, by default None

hicberg.omics module

hicberg.omics.bedgraph_to_bigwig(bedgraph_file: str = 'coverage.bedgraph', chromosome_sizes: str = 'chromosome_sizes.txt', output_dir: str = None) None[source]

Convert bedgraph to bigwig format.

Parameters:
  • bedgraph_file (str, optional) – Path to coverage (.bedgraph), by default “coverage.bedgraph”

  • chromosome_sizes (str, optional) – Path to chromosome sizes file (chrom_id, size), by default “chromosome_sizes.txt”

  • output_dir (str, optional) – [description], by default None

Raises:
  • IOError – [description]

  • IOError – [description]

  • IOError – [description]

hicberg.omics.format_chrom_sizes(chromosome_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) None[source]

Format chromosome sizes to bed and txt format. - bed format : chrom, start, end - txt format : chrom, size :param chrom_sizes: Path to chromosome sizes file (.npy), by default “chromosome_sizes.npy” :type chrom_sizes: str, optional :param output_dir: Path where the formatted chromosome sizes will be saved, by default None :type output_dir: str, optional

hicberg.omics.get_bed_coverage(chromosome_sizes: str = 'chromosome_sizes.bed', pairs_file: str = 'preprocessed_pairs.pairs', output_dir: str = None) None[source]

Get bed coverage from pairs file (using bedtools).

Parameters:
  • chromosome_sizes (str, optional) – Path to chromsomes sizes files (.bed format), by default “chromosome_sizes.bed”

  • pairs_file (str, optional) – Path to processed pairs files (columns : chrom, start, end, count), by default “preprocessed_pairs.pairs”

  • output_dir (str, optional) – Path where the coverage (.bed) will be saved, by default None

hicberg.omics.get_bedgraph(bed_coverage: str = 'coverage.bed', output_dir: str = None) None[source]

Convert bed coverage to bedgraph format. Format is : chrom, start, end, count. Start and end are different by 1bp (end = start + 1).

Parameters:
  • bed_coverage (str, optional) – Path to coverage (.bed), by default “coverage.bed”

  • output_dir (str, optional) – Path where the coverage (.bedgraph) will be saved, by default None

hicberg.omics.preprocess_pairs(pairs_file: str = 'all_group.pairs', threshold: int = 1000, output_dir: str = None) None[source]

Preprocess pairs file to remove pairs that are not in the same chromosome or are greater than a threshold. Retain columns are : chromosome, start, end, count.

Parameters:
  • pairs_file (str, optional) – Path to the pairs file, by default “all_group.pairs”

  • threshold (int, optional) – Threshold distance beyond which pairs will not be kept, by default 1000

  • output_dir (str, optional) – Path where the formatted pairs will be saved, by default None

hicberg.pipeline module

hicberg.pipeline.check_tool(name: str) bool[source]

Check whether name is on PATH and marked as executable.

Parameters:

name ([str]) – Depenedency to check.

Returns:

True if the tool is available, False otherwise.

Return type:

[bool]

hicberg.pipeline.pipeline(name: str = 'sample', start_stage: str = 'fastq', exit_stage: str = 'None', genome: str = None, index=None, fq_for: str = None, fq_rev: str = None, sensitivity: str = 'very-sensitive', max_alignment: int = None, mapq: int = 35, enzyme: list[str] = ['DpnII', 'HinfI'], circular: str = '', rate: float = 1.0, distance: int = 1000, bins: int = 2000, nb_chunks: int = 1, mode: str = 'full', kernel_size: int = 11, deviation: float = 0.5, verbose: bool = False, cpus: int = 1, output_dir: str = None, force: bool = False, blacklist: str = None) None[source]

hicberg.plot module

hicberg.plot.plot_benchmark(original_matrix: str = None, depleted_matrix: str = None, rescued_matrix: str = None, chromosomes: list[str] = None, output_dir: str = None) None[source]

Plot benchmark results (original, depleted and rescued matrices with associated log ratios). One plot per chromosome.

Parameters:
  • original_matrix (str, optional) – Path to the original matrix, by default None

  • rescued_matrix (str, optional) – Path to the rescued matrix (re-attributed reads), by default None

  • chromosomes (list[str], optional) – List of chromosomes to plot, by default None

  • output_dir (str, optional) – Path to where to save plots, by default None

hicberg.plot.plot_couple_repartition(forward_bam_file: str = 'group2.1.rescued.bam', reverse_bam_file: str = 'group2.2.rescued.bam', output_dir: str = None) None[source]

Plot read couples sizes distribution

Parameters:
  • forward_bam_file (str, optional) – Path to forward .bam alignment file, by default 1.sorted.bam

  • reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default 2.sorted.bam Minimal read quality under which a Hi-C read pair will not be kept, by default 30

  • output_dir (str, optional) – Path to the folder where to save the plot, by default None

hicberg.plot.plot_coverages(bins: int = 2000, output_dir: str = None) None[source]

Plot coverages of chromosomes

Parameters:
  • bins (int, optional) – Size of the desired bin., by default 2000

  • output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_d1d2(output_dir: str = None) None[source]

Plot d1d2 law

Parameters:

output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_density(output_dir: str = None) None[source]

Plot density maps

Parameters:

output_dir (str, optional) – Path to the folder where to save the plots (one plot per chromosome couple), by default None.

hicberg.plot.plot_laws(output_dir: str = None) None[source]

Plot P(s) patterns laws

Parameters:

output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_matrix(unrescued_matrix: str = 'unrescued_map.cool', rescued_matrix: str = 'rescued_map.cool', restriction_map: str = 'restriction_map.npy', genome: str = '', vmin: float = 0.0, vmax: float = 3.5, bins: int = 2000, output_dir: str = None) None[source]

Plot matrix with additional trackss

Parameters:
  • unrescued_matrix (str, optional) – Path to the unrescued map file, by default unrescued_map.cool

  • rescued_matrix (str, optional) – Path to rescued map file, by default rescued_map.cool

  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default dist.frag.npy

  • genome (str, optional) – Path to the genome to digest, by default None, by default None

  • vmin (float, optional) – Inferior limit for the colorscale, by default 0.0

  • vmax (float, optional) – Superior limit for the colorscale, by default 3.5

  • bins (int, optional) – Size of the desired bin., by default 2000

  • output_dir (str, optional) – Path to the folder where to save the plot, by default None

hicberg.plot.plot_pattern_reconstruction(table: DataFrame = None, original_cool: str = None, rescued_cool: str = None, chromosome: str = None, threshold: float = 0.0, case: str = '', output_dir: str = None) None[source]

Create a plot of pattern reconstruction quality.

Parameters:
  • table (pd.DataFrame, optional) – Table containing either true positives, false positives or false negatives patterns, by default None

  • original_cool (str, optional) – Path to the original matrix in .cool format, by default None

  • rescued_cool (str, optional) – Path to the rescued matrix in .cool format, by default None

  • chromosome (str, optional) – Selected chromosome, by default None

  • threshold (float, optional) – Threshold for pattern score significance, by default 0.0

  • case (str, optional) – Mode to consider, either true positives, false positives or false negatives, by default “”

  • output_dir (str, optional) – Path to save plots, by default None

hicberg.plot.plot_trans_ps(output_dir: str = None) None[source]

Plot P(s) patterns laws

Parameters:

output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.statistics module

hicberg.statistics.attribute_xs(xs: ndarray[int], distance: int) int[source]

Attibute genomic distance to the corresponding log bin of xs.

Parameters:
  • xs (np.ndarray[int]) – Array containing the log bins.

  • distance (int) – Genomic distance in bp.

Returns:

Index of the corresponding bins where distance has to be attributed.

Return type:

int

hicberg.statistics.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
hicberg.statistics.compute_density(cooler_file: str = 'unrescued_map.cool', threads: int = 2, kernel_size: int = 11, deviation: float = 0.5, output_dir: str = None) None[source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:
  • cooler_file (str, optional) – [description], by default None

  • threads (int, optional) – [description], by default 2

  • output_dir (str, optional) – [description], by default None

hicberg.statistics.compute_propensity(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, xs: dict = None, weirds: dict = None, uncuts: dict = None, loops: dict = None, circular: str = '', trans_ps: dict = None, coverage: dict = None, bins: int = 2000, d1d2: dict = None, density_map: dict = None, mode: str = 'full') float[source]

Compute propensity for read pair to be selected among all plausible pairs related to multi-mapping reads.

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None

  • xs (dict) – Dictionary containing log binning values for each chromosome.

  • weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.

  • uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.

  • loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.

  • trans_ps (dict) – Dictionary of trans-chromosomal P(s)

  • coverage (dict) – Dictionary containing the coverage of each chromosome.

  • bins (int) – Size of the desired bin, by default 2000

  • d1d2 (np.array, optional) – Distribution of d1d2 values, by default None

  • density (dict) – Dictionary of contact density by chromosome couple.

  • mode (str, optional) – Mode to use to compute propensity among, by default “full”

Returns:

Propensity to use for read couple drawing

Return type:

float

hicberg.statistics.draw_read_couple(propensities: array) int[source]

Draw an index respecting distribution of propensities. This function is used to draw a couple of reads considering the propensity of each couple.

Parameters:

propensities (np.array) – Array containing all the propensities of each couple of reads.

Returns:

Index of the couple of reads drawn.

Return type:

int

hicberg.statistics.generate_coverages(genome: str = None, bins: int = 2000, forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', output_dir: str = None) None[source]

Take a genome and both forward and reverse bam files for unambiguous group and return a dictionary containing the coverage in terms of reads overs chromosomes.

Parameters:
  • genome (str, optional) – Path to the genome file to get coverage on, by default None

  • bins (int, optional) – Size of the desired bin., by default 2000

  • forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam

  • reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.2.bam

  • output_dir (str, optional) – Path to the folder where to save the classified alignment files, by default None, by default None

hicberg.statistics.generate_d1d2(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', restriction_map: str = 'restriction_map.npy', output_dir: str = None) None[source]

Compute d1d2 distance laws with the given alignments and restriction map.

Parameters:
  • forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”

  • reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.2.bam”

  • restriction_map (str, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default “dist.frag.npy”

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None

hicberg.statistics.generate_density_map(matrix: str = 'unrescued_map.cool', size: int = 5, sigma: int = 2, n_mads: int = 2, nan_threshold: bool = False, output_dir: str = None) dict[str, ndarray[float]][source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:
  • matrix (str, optional) – Path to a cooler matrix, by default None

  • rounds (int, optional) – Number of times the matrix has to be shaken, by default 1

  • magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0

  • output_dir (str, optional) – Path to the folder where to save the density map, by default None

Returns:

Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.

Return type:

dict[str, np.ndarray[float]]

hicberg.statistics.generate_density_map_backup(matrix: str = 'unrescued_map.cool', rounds: int = 1, magnitude: float = 1.0, output_dir: str = None) dict[str, ndarray[float]][source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:
  • matrix (str, optional) – Path to a cooler matrix, by default None

  • rounds (int, optional) – Number of times the matrix has to be shaken, by default 1

  • magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0

  • output_dir (str, optional) – Path to the folder where to save the density map, by default None

Returns:

Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.

Return type:

dict[str, np.ndarray[float]]

hicberg.statistics.generate_trans_ps(matrix: str = 'unrescued_map.cool', chrom_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) None[source]

Generate pseudo-P(s) while considering reads originating from different chromosomes. Pseudo-P(s) are computed as the number of interactions between two chromosomes divided by the product of the length of these two chromosomes.

Parameters:
  • matrix (str, optional) – Path to a cooler matrix, by default “unrescued_map.cool”

  • chrom_sizes (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None.

hicberg.statistics.generate_xs(chromosome_size: int, base: float = 1.1) ndarray[int][source]

Generate xs array for computing P(s). Return xs array which is log space.

Parameters:
  • chromosome_size (int) – Size of the chromosome to be binned in bp.

  • base (float, optional) – Base of the log space., by default 1.1

Returns:

Array of log bins related to the chromosome.

Return type:

np.ndarray[int]

hicberg.statistics.get_d1d2(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, d1d2: array = None) int[source]

Get the d1d2 value of a pair of reads.

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None

  • d1d2 (np.array, optional) – Distribution of d1d2 values, by default None

Returns:

d1d2 value

Return type:

int

hicberg.statistics.get_density(read_forward: ~pysam.libcalignedsegment.AlignedSegment, read_reverse: ~pysam.libcalignedsegment.AlignedSegment, density_map: dict[slice((<class 'str'>, <class 'str'>), <built-in function array>, None)], bin_size: int = 2000) float[source]

Get density from density map dictionary.

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to get density from.

  • read_reverse (pysam.AlignedSegment) – Reverse read to get density from.

  • density_map (dict) – Dictionary containing density maps for each chromosome couple.

  • bin_size (int, optional) – Resolution of the matrix on which density map has been computed, by default 2000

Returns:

Density corresponding to the pair of reads.

Return type:

float

hicberg.statistics.get_dist_frags(genome: str = None, restriction_map: dict = None, circular: str = '', rate: float = 1.0, output_dir: str = None) None[source]

Get the distribution of fragments’ distance from a genome distribution .

Parameters:
  • genome (str, optional) – Path to the genome, by default None

  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default “”

  • rate (float, optional) – Set the proportion of restriction sites to consider. Avoid memory overflow when restriction maps are very dense, by default 1.0

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None

Returns:

Dictionary of sub-sampled restriction map with keys as chromosome names and values as lists of restriction sites’ position.

Return type:

dict

hicberg.statistics.get_pair_cover(read_forward: AlignedSegment, read_reverse: AlignedSegment, coverage: dict, bins: int) float[source]

Get the coverage of a pair of reads.

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

  • coverage (dict) – Dictionary containing the coverage of each chromosome.

  • bins (int) – Size of the desired bin.

Returns:

Product of the coverage of the pair of reads.

Return type:

float

hicberg.statistics.get_pair_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, xs: dict, weirds: dict, uncuts: dict, loops: dict, circular: str = '') float[source]

Take two reads and return the P(s) value depending on event type (intra-chromosomal case only).

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

  • xs (dict) – Dictionary containing log binning values for each chromosome.

  • weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.

  • uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.

  • loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.

Returns:

P(s) of the pair considering the event type.

Return type:

float

hicberg.statistics.get_patterns(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', xs: str = 'xs.npy', chrom_sizes: str = 'chromosome_sizes.npy', circular: str = '', blacklist: str = None, output_dir: str = None) None[source]

Get the patterns distribution from read pairs alignment. .

Parameters:
  • forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.1.bam”

  • reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.2.bam”

  • xs (str, optional) – Path to the dictionary containing the xs values, by default “xs.npy”

  • dist_frag (str, optional) – Path to the dictionary containing the inter fragment distances, by default “dist.frag.npy”

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default “”

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None, by default None

hicberg.statistics.get_restriction_map(genome: str = None, enzyme: list[str] = ['DpnII'], output_dir: str = None) dict[str, ndarray[int]][source]

Get ordered restriction map (including 0 and n) from a chromosome sequence. Return a dictionary where keys are chromosomes names and values are restrictions sites positions.

Parameters:
  • genome (str, optional) – Path to the genome to digest, by default None, by default None

  • enzyme (list[str], optional) – Enzyme or list of enzyme to digest the genome with. If integer passed, micro-C mode using Mnase is used, and the integer correspond to the size of nucleosomal fragment, by default None, by default [“DpnII”]

  • output_dir (str, optional) – Path to the folder where to save the plot, by default None

Returns:

Dictionary of the product of digestion where keys are chromosomes names and values are restrictions sites positions.

Return type:

dict

hicberg.statistics.get_top_pattern(file: str = None, top: int = 10, chromosome: str = None) DataFrame[source]

Get top patterns from a dataframe

Parameters:
  • df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None

  • top (int, optional) – Percentage of top patterns to get, by default 10

  • chromosome (str, optional) – Chromosome to consider, by default None

Returns:

Dataframe containing top percentage patterns.

Return type:

pd.DataFrame

hicberg.statistics.get_trans_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, trans_ps: dict) float[source]
Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

  • trans_ps (dict) – Dictionary of trans-chromosomal P(s)

Returns:

Trans P(s) value

Return type:

float

hicberg.statistics.log_bin_genome(genome: str, base: float = 1.1, output_dir: str = None) dict[str, ndarray[int]][source]
hicberg.statistics.pearson_score(original_matrix: Cooler, rescued_matrix: Cooler, markers: list[int]) float[source]

Compute Pearson correlation between concatenated matrix bins which have been deleted and reconstructed.

Parameters:
  • original_matrix (cooler.Cooler) – Cooler object containing the original matrix.

  • rescued_matrix (cooler.Cooler) – Cooler object containing the reconstructed matrix.

  • markers (list[int]) – List of markers to consider. Markers are the bins which have been deleted.

Returns:

Pearson correlation between original and reconstructed matrix.

Return type:

float

hicberg.statistics.reattribute_reads(reads_couple: tuple[str, str] = ('group2.1.bam', 'group2.2.bam'), restriction_map: dict = None, xs: dict = 'xs.npy', weirds: dict = 'weirds.npy', uncuts: dict = 'uncuts.npy', loops: dict = 'loops.npy', circular: str = '', trans_ps: dict = 'trans_ps.npy', coverage: dict = 'coverage.npy', bins: int = 2000, d1d2: dict = 'd1d2.npy', density_map: dict = 'density_map.npy', mode: str = 'full', output_dir: str = None) None[source]

Re-attribute multi-mapping (ambiguous) reads considering sets of statistical laws.

Parameters:
  • reads_couple (tuple[str, str], optional) – Paths to ambiguous reads alignment files (.bam), by default (“group2.1.bam”, “group2.2.bam”)

  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None

  • xs (dict) – Dictionary containing log binning values for each chromosome.

  • weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.

  • uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.

  • loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.

  • trans_ps (dict) – Dictionary of trans-chromosomal P(s)

  • coverage (dict) – Dictionary containing the coverage of each chromosome.

  • bins (int) – Size of the desired bin, by default 2000

  • d1d2 (np.array, optional) – Distribution of d1d2 values, by default None

  • density (np.array, optional) – Dictionary containing density maps per chromosome couples as, by default None

  • mode (str, optional) – Mode to use to compute propensity among, by default “full”

  • output_dir (str, optional) – Path to the re-attributed ambiguous reads alignment files are saved, by default None

hicberg.utils module

hicberg.utils.bam_iterator(bam_file: str = None) Iterator[AlignedSegment][source]

Returns an iterator for the given SAM/BAM file (must be query-sorted). In each call, the alignments of a single read are yielded as a 3-tuple: (list of primary pysam.AlignedSegment, list of supplementary pysam.AlignedSegment, list of secondary pysam.AlignedSegment).

Parameters:

bam ([str]) – Path to alignment file in .sam or .bam format.

Yields:

Iterator[pysam.AlignedSegment] – Yields a list containing pysam AlignmentSegment objects, within which all the reads have the same id.

hicberg.utils.block_counter(forward_bam_file: str, reverse_bam_file: str) Tuple[int, int][source]

Return as a tuple the number of blocks in the forward and reverse bam files.

Parameters:
  • forward_bam_file (str, optional) – Path to forward .bam alignment file.

  • reverse_bam_file (str, optional) – Path to reverse .bam alignment file.

Returns:

Number of blocks in the forward and reverse bam files.

Return type:

Tuple[int, int]

hicberg.utils.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:
  • a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)

  • size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

  • replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.

  • p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice

which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
hicberg.utils.chunk_bam(forward_bam_file: str = 'group2.1.bam', reverse_bam_file: str = 'group2.2.bam', nb_chunks: int = 2, output_dir: str = None) None[source]

Split a .bam file into chunks .bam files. :param forward_bam_file: Path to forward .bam alignment file, by default group2.1.bam :type forward_bam_file: str, optional :param reverse_bam_file: Path to reverse .bam alignment file, by default group2.2.bam :type reverse_bam_file: str, optional :param nb_chunks: Number of chunks to create, by default 2 :type nb_chunks: int, optional :param output_dir: Path to the folder where to save the classified alignment files, by default None :type output_dir: str, optional

hicberg.utils.classify_reads(bam_couple: tuple[str, str] = ('1.sorted.bam', '2.sorted.bam'), chromosome_sizes: str = 'chromosome_sizes.npy', mapq: int = 35, output_dir: str = None) None[source]
Classification of pairs of reads in 2 different groups:

Group 0) –> (Unmappable) - files :group0.1.bam and group0.2.bam Group 1) –> (Uniquely Mapped Uniquely Mapped) - files :group1.1.bam and group1.2.bam Group 2) –> (Uniquely Mapped Multi Mapped) or (Multi Mapped Multi Mapped).- files :group2.1.bam and group2.2.bam

Parameters:
  • bam_couple (tuple[str, str]) – Tuple containing the paths to the forward and reverse alignment files. By default (“1.sorted.bam”, “2.sorted.bam”)

  • chromosome_sizes (str, optional) – Path to a chromosome size dictionary save in .npy format, by default chromosome_sizes.npy

  • mapq (int, optional) – Minimal read quality under which a Hi-C read pair will not be kept, by default 30

  • output_dir (str, optional) – Path to the folder where to save the classified alignment files, by default None

hicberg.utils.detrend_matrix(matrix: array) array[source]

Detrend a matrix by P(s).

Parameters:

matrix (np.array) – Hi-C matrix to detrend.

Returns:

Detrended Hi-C matrix.

Return type:

np.array

hicberg.utils.format_blacklist(blacklist: str = None) dict[str, Tuple[int, int]][source]

Format a blacklist file into a dictionary.

Parameters:

blacklist (str, optional) – Path to the blacklist file. If set to -1 this is equivalent to None for workflow managers purpose, by default None

Returns:

Dictionary with chromosome name as key and tuple of start and end as value.

Return type:

dict[str, Tuple[int, int]]

hicberg.utils.generate_gaussian_kernel(size: int = 1, sigma: int = 2) array[source]

Generate a 2D Gaussian kernel of a given size and standard deviation.

Parameters:
  • size (int, optional) – Size of the kernel, by default 1

  • sigma (int, optional) – Standard deviation to use for the kernel build, by default 2

Returns:

2D Gaussian kernel.

Return type:

np.array

hicberg.utils.get_bad_bins(matrix: array = None, n_mads: int = 2) array[source]

Detect bad bins (poor interacting bins) in a normalized Hi-C matrix and return their indexes. Bins where the nan sum of interactions is zero are considered as bad bins.

Parameters:
  • matrix (Normalized Hi-C matrix to detect bad bins from, by default None.) –

  • n_mads (int, optional) – Number of median absolute deviations to set poor interacting bins threshold, by default 2

Returns:

Indexes of bad bins.

Return type:

np.array

hicberg.utils.get_bin_table(chrom_sizes_dict: str = 'chromosome_sizes.npy', bins: int = 2000, output_dir: str = None) None[source]

Create bin table containing start and end position for fixed size bin per chromosome.

Parameters:
  • chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

  • bins (int) – Size of the desired bin, by default 2000.

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None

hicberg.utils.get_chromosomes_sizes(genome: str = None, output_dir: str = None) None[source]

Generate a dictionary save in .npy format where keys are chromosome name and value are size in bp.

Parameters:
  • genome (str, optional) – Path to the genome, by default None

  • output_dir (str, optional) – Path to the folder where to save the dictionary, by default None

hicberg.utils.get_chunks(output_dir: str = None)[source]

Return a tuple containing the paths to the forward and reverse chunks.

Parameters:

output_dir (str, optional) – Path to get chunks from, by default None

Returns:

Tuple containing the paths to the forward and reverse chunks.

Return type:

tuple([List[str], List[str]]

hicberg.utils.get_cis_distance(read_forward: AlignedSegment, read_reverse: AlignedSegment, circular: str = '') int[source]

Calculate the distance between two reads in the same pairwise alignment .

Parameters:
  • read_forward (pysam.aligned_segment) – Forward read of the pair

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair

  • circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.

Returns:

Genomic distance separating the two reads (bp).

Return type:

int

hicberg.utils.get_local_density(cooler_file: str = None, chrom_name: tuple = (None, None), size: int = 11, sigma: int = 0.2, n_mads: int = 2, nan_threshold: bool = False) array[source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps. Density is obtained by getting the local density of each pairwise bin using a gaussian kernel convolution.

Parameters:
  • cooler_file (str, optional) – Path to Hi-C matrix (or sub-matrix) to dget density from, by default None, by default None

  • chrom_name (tuple, optional) – Tuple containing the sub-matrix to fetch, by default (None, None)

  • size (int, optional) – Size of the gaussian kernel to use, by default 5

  • sigma (int, optional) – Standard deviation to use for the kernel, by default 2

  • n_mads (int, optional) – Number of median absolute deviations to set poor interacting bins threshold, by default 2

  • nan_threshold (bool, optional) – Set wether or not convolution return nan if not enough value are caught, by default None

Returns:

Map of local contact density.

Return type:

np.array

hicberg.utils.get_ordered_reads(read_forward: AlignedSegment, read_reverse: AlignedSegment) Tuple[AlignedSegment, AlignedSegment][source]

Returns the ordered pair of reads in the same chromosome as the two reads .

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.

  • read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.

Returns:

The ordered pair of reads in the same chromosome as the two reads.

Return type:

Tuple[pysam.AlignedSegment, pysam.AlignedSegment]

hicberg.utils.is_blacklisted(read_forward: AlignedSegment, read_reverse: AlignedSegment, blacklist: dict[str, Tuple[int, int]] = None) bool[source]

Check if a read pair is blacklisted based on a list of coordiantes.

Parameters:
  • read_forward (pysam.AlignmentSegment) – Forward read of the pair.

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair.

  • blacklist (dict[str, Tuple[int, int]]) – Blacklist of coordinates to check against. Chromsome name as key and tuple of start and end as value. Chromosome names should be formatted as ‘chr1_A’, ‘chr1_B’, etc. With A and B being the index of the coordinates to blacklist in a given chromosome.

Returns:

True if the read pair is blacklisted, False otherwise.

Return type:

bool

hicberg.utils.is_circle(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool[source]

Check if two reads are forming a loop pattern .

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read of the pair

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair

Returns:

True if the two reads are forming a loop pattern, False otherwise.

Return type:

bool

hicberg.utils.is_duplicated(read: AlignedSegment) bool[source]

Check if read from pysam AlignmentFile is mapping more than once along the genome.

Parameters:

read (pysam.AlignedSegment) – pysam AlignedSegment object.

Returns:

True if the read is duplicated i.e. mapping to more than one position.

Return type:

bool

hicberg.utils.is_empty_alignment(alignment_file: str) bool[source]

Check if an alignment file is empty. If empty, return True, else return False.

Parameters:

alignment_file (str) – Path to the alignment file to check.

Returns:

Return True if the file is empty, False otherwise.

Return type:

bool

hicberg.utils.is_intra_chromosome(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool[source]

Return True if two reads of a pair came from the same chromosome.

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read of the pair.

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair.

Returns:

True if the pair is intra-chromosomic, False otherwise.

Return type:

bool

hicberg.utils.is_poor_quality(read: AlignedSegment, mapq: int) bool[source]

Check if read from pysam AlignmentFile is under mapping quality threshold

Parameters:
  • read (pysam.AlignedSegment) – pysam AlignedSegment object.

  • mapq (int) – Mapping quality threshold.

Returns:

True if the read quality is below mapq threshold.

Return type:

bool

hicberg.utils.is_reverse(read: AlignedSegment) bool[source]

Check if read from pysam AlignmentFile is reverse

Parameters:

read (pysam.AlignedSegment) – pysam AlignedSegment object.

Returns:

True if the read is reverse.

Return type:

bool

hicberg.utils.is_uncut(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool[source]

Check if two reads are forming an uncut pattern .

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read of the pair

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair

Returns:

True if the two reads are forming an uncut pattern, False otherwise.

Return type:

bool

hicberg.utils.is_unmapped(read: AlignedSegment) bool[source]

Check if read from pysam AlignmentFile is unmapped

Parameters:

read (pysam.AlignedSegment) – pysam AlignedSegment object.

Returns:

True if the read is unmapped.

Return type:

bool

hicberg.utils.is_unqualitative(read: AlignedSegment) bool[source]

Check if the read is not qualitative.

Parameters:

read (pysam.AlignedSegment) – pysam AlignedSegment object.

Returns:

True if the read is not qualitative, False otherwise.

Return type:

bool

hicberg.utils.is_weird(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool[source]

Check if two reads are forming a weird pattern .

Parameters:
  • read_forward (pysam.AlignedSegment) – Forward read of the pair

  • read_reverse (pysam.AlignedSegment) – Reverse read of the pair

Returns:

True if the two reads are forming a weird pattern, False otherwise.

Return type:

bool

hicberg.utils.mad_smoothing(vector: ndarray[int] = None, window_size: int | str = 'auto', nmads: int = 1) ndarray[int][source]

Apply MAD smoothing to an vector .

Parameters:
  • vector (np.ndarray[int], optional) – Data to smooth, by default None

  • window_size (int or str, optional) – Size of the window to perform mean sliding average in. Window is center on current value as [current_value - window_size/2] U [current_value + window_size/2], by default “auto”

  • nmads (int, optional) – number of median absolute deviation tu use, by default 1

Returns:

MAD smoothed vector.

Return type:

np.ndarray[int]

hicberg.utils.max_consecutive_nans(vector: ndarray) int[source]

Return the maximum number of consecutive NaN values in a vector.

Parameters:

vector (np.ndarray) – Vector to get the maximum number of consecutive NaN values from.

Returns:

Number of maximum consecutive NaN values.

Return type:

int

hicberg.utils.nan_conv(matrix: array = None, kernel: array = None, nan_threshold: bool = False) array[source]

Custom convolution function that takes into account nan values when convolving. Used to compute the local density of a Hi-C matrix.

Parameters:
  • matrix (np.array, optional) – Hi-C matrix to detect bad bins from, by default None

  • kernel (np.array, optional) – Kernel to use for convolution (dimension must be odd), by default None

  • nan_threshold (bool, optional) – Set wether or not convolution return nan if not enough value are caught, by default False

Returns:

Convolution product of the matrix and the kernel.

Return type:

np.array

hicberg.utils.replace_consecutive_zeros_with_mean(vector: ndarray[float]) ndarray[float][source]

Replace consecutive zeros in a vector with the mean of the flanking values.

Parameters:

vector (np.ndarray[float]) – Array to replace consecutive zeros in.

Returns:

Array with consecutive zeros replaced by the mean of the flanking values.

Return type:

np.ndarray[float]

hicberg.utils.subsample_restriction_map(restriction_map: dict = None, rate: float = 1.0) dict[str, ndarray[int]][source]

Subsample a restriction map by a given rate.

Parameters:
  • restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None

  • rate (float, optional) – Set the proportion of restriction sites to consider. Avoid memory overflow when restriction maps are very dense, by default 1.0

Returns:

Dictionary of sub-sampled restriction map with keys as chromosome names and values as lists of restriction sites’ position.

Return type:

dict[str, np.ndarray[int]]

hicberg.utils.sum_mat_bins(matrix: array) array[source]

Adapted from : https://github.com/koszullab/hicstuff/tree/master/hicstuff Compute the sum of matrices bins (i.e. rows or columns) using only the upper triangle, assuming symmetrical matrices.

Parameters:

mat (scipy.sparse.coo_matrix) – Contact map in sparse format, either in upper triangle or full matrix.

Returns:

1D array of bin sums.

Return type:

numpy.ndarray

hicberg.version module

Module contents