hicberg package

Submodules

hicberg.align module

hicberg.align.hic_align(index: str, fq_for: str, fq_rev: str, sensitivity: str = 'very-sensitive', max_alignment: int = None, cpus: int = 1, output_dir: str = None, verbose: bool = False) → None[source]

Alignment of reads from HiC experiments along an indexed genome.

Parameters:

index (str) – Path to the index of the genome along which reads are going to be aligned (path to .bt2l files). Default to None, index files are searched to sample_name/data/index/sample_name.
fq_for (str) – Path to .fasta containing set of reads to align (forward mate).
fq_rev (str) – Path to .fasta containing set of reads to align (forward mate).
sensitivity (str, optional) – Sensitivity of the alignment., by default ‘very_sensitive’
max_alignment (int, optional) – Maximum number of alignments to be returned, by default None
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_build_index(genome: str, output_dir: str = None, cpus: int = 1, verbose: bool = False) → None[source]

Building of bowtie2 index (.bt2l files) for read alignment.

Parameters:

genome (str) – Path to the genome file along which reads are going to be aligned.
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the Bowtie2 index files should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_index(bam_for: str = '1.sorted.bam', bam_rev: str = '2.sorted.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) → None[source]

Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access.

Parameters:

bam_for (str, optional) – Forward alignment file to be indexed, by default “1.sorted.bam”
bam_rev (str, optional) – Reverse alignment file to be indexed,, by default “2.sorted.bam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_sort(bam_for: str = '1.bam', bam_rev: str = '2.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) → None[source]

Sort .bam alignment files by read_name (using samtools).

Parameters:

bam_for (str, optional) – Forward alignment file to be sorted, by default “1.bam”
bam_rev (str, optional) – Reverse alignment file to be sorted, by default “2.bam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.align.hic_view(sam_for: str = '1.sam', sam_rev: str = '2.sam', cpus: int = 1, output_dir: str = None, verbose: bool = False) → None[source]

Conversion of .sam alignment files to .bam alignment format (using samtools).

Parameters:

sam_for (str, optional) – Path to forward .sam alignment file, by default “1.sam”
sam_rev (str, optional) – Path to reverse .sam alignment file, by default “2.sam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False

hicberg.benchmark module

hicberg.benchmark.benchmark(output_dir: str = None, chromosome: str = '', position: int = 0, trans_chromosome: str = None, trans_position: int = None, strides: list[int] = [], mode: str = 'full', auto: int = None, kernel_size: int = 11, deviation: float = 0.5, bins: int = None, circular: str = '', genome: str = None, pattern: str = None, threshold: float = 0.0, jitter: int = 0, trend: bool = True, top: int = 100, force: bool = False, iterations: int = 3, cpus: int = 8)[source]

Performs a benchmark of the HiC-Berg method by simulating the deletion of a genomic region and evaluating the method’s ability to recover the original signal.

The benchmark simulates the deletion of a genomic region by removing reads that map to that region from a Hi-C dataset. It then applies the HiC-Berg method to the depleted dataset to try to recover the original signal. The benchmark evaluates the method’s performance by comparing the rescued contact map to the original contact map.

Parameters:

output_dir (str, optional) – Path to the output directory.
chromosome (str, optional) – Chromosome to perform the benchmark on.
position (int, optional) – Position of the deletion on the chromosome.
trans_chromosome (str, optional) – Chromosome to consider for trans interactions.
trans_position (int, optional) – Position of the deletion on the trans chromosome.
strides (list[int], optional) – List of strides to use for the deletion.
mode (str, optional) – Mode of the HiC-Berg method to use. Can be “full” or “density”.
auto (int, optional) – Automatically determine the size of the deletion based on the given number of bins.
kernel_size (int, optional) – Size of the kernel to use for the density estimation.
deviation (float, optional) – Standard deviation of the kernel to use for the density estimation.
bins (int, optional) – Number of bins to use for the deletion.
circular (str, optional) – Whether the chromosome is circular.
genome (str, optional) – Genome assembly to use.
pattern (str, optional) – Pattern to use for Chromosight pre-call.
threshold (float, optional) – Threshold to use for Chromosight pre-call.
jitter (int, optional) – Jitter to use for Chromosight pre-call.
trend (bool, optional) – Whether to use trend for Chromosight pre-call.
top (int, optional) – Top patterns to use for Chromosight pre-call.
force (bool, optional) – Whether to force the benchmark to run even if the output directory already exists.
iterations (int, optional) – Number of iterations to run the HiC-Berg method.
cpus (int, optional) – Number of CPUs to use for parallel processing.

Raises:

ValueError – If the output directory does not exist.
ValueError – If the restriction map file does not exist.

hicberg.eval module

hicberg.eval.arange_multi(starts, stops=None, lengths=None)[source]

Create concatenated ranges of integers for multiple start/length. :param starts: Starts for each range :type starts: numpy.ndarray :param stops: Stops for each range :type stops: numpy.ndarray :param lengths: Lengths for each range. Either stops or lengths must be provided. :type lengths: numpy.ndarray

Returns:: concat_ranges – Concatenated ranges.
Return type:: numpy.ndarray

Notes

See the following illustrative example: starts = np.array([1, 3, 4, 6]) stops = np.array([1, 5, 7, 6]) print arange_multi(starts, lengths) >>> [3 4 4 5 6] From: https://codereview.stackexchange.com/questions/83018/vectorized-numpy-version-of-arange-with-multiple-start-stop

hicberg.eval.check_emptiness(intervals: dict[str, list[int, int]], matrix: Cooler = None) → bool[source]

Check if intervals are empty in a given matrix.

Parameters:

intervals (dict[str, list[(int, int)]) – Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary.
matrix (cooler.Cooler, optional) – Cooler Hi-C matrix to be checked for emptiness, by default None

Returns:

Returns True if one of the intervals are empty in the matrix, False otherwise.

Return type:

bool

hicberg.eval.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:

a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.
p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice: which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')

hicberg.eval.chromosight_cmd_generator(file: str = None, pattern: str = 'loops', untrend: bool = True, mode: bool = False, output_dir: str = None) → str[source]

Generate chromosight command line to run pattern detection.

Parameters:

file (str, optional) – Path to Hi-C balanced contact matrix in .cool format , by default None
pattern (str, optional) – Pattern to detect, by default “loops”
untrend (bool, optional) – Set if map has to be detrended, by default True
mode (bool, optional) – Set either the detection has to be performed before or after map reconstruction, by default False
output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Chromosight command line to run to run pattern detection

Return type:

str

hicberg.eval.draw_intervals(nb_intervals: int = 1, bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy') → dict[str, list[tuple[int, int]]][source]

Draw intervals from a given chromosome sizes dictionary.

Parameters:

nb_intervals (int, optional) – Number of intervals to draw, by default 1
bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Dictionary containing the intervals as {chromosome : [(start, end), …]}.

Return type:

dict[str, list[tuple[int, int]]]

hicberg.eval.generate_dict_coordinates(matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) → dict[str, list[int, int]][source]

Generates a dictionary of genomic coordinates for selecting regions in a Hi-C contact map.

This function defines genomic intervals based on provided parameters such as chromosome, position, strides, and bin size. It can generate intervals for both intra-chromosomal and inter-chromosomal interactions. The intervals can be used to select specific regions in a Hi-C contact map for further analysis, such as simulating deletions or duplications.

Parameters:

matrix_file (str, optional) – Path to the cooler file containing the Hi-C contact map, by default “unrescued_map.cool”
position (int, optional) – Genomic position on the chromosome, by default 0
chromosome (str | list[str], optional) – Chromosome or list of chromosomes, by default “”
bin_size (int, optional) – Size of the bins in the Hi-C contact map, by default 2000
chrom_sizes_dict (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to use for generating intervals, by default [0]
trans_chromosome (str, optional) – Chromosome for trans-interactions, by default None
output_dir (str, optional) – Path to the output directory, by default None
trans_position (list[int], optional) – List of positions on the trans-chromosome, by default None
nb_bins (int, optional) – Number of bins to include in each interval, by default 1
random (bool, optional) – Whether to generate random intervals, by default False
auto (int, optional) – Number of random intervals to generate, by default None

Returns:

A dictionary where keys are chromosomes and values are lists of (start, end) genomic intervals.

Return type:

dict[str, list[(int, int)]]

Raises:

ValueError – If the output path, chromosome sizes file, or matrix file does not exist.

hicberg.eval.get_FN_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) → DataFrame[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_FP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None)[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_TP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) → DataFrame[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

dataframe – Table containing components before and after pattern recall (True positives)

Return type:

[pandas DataFrame]

hicberg.eval.get_bin_indexes(matrix: str = None, dictionary: dict = None)[source]

Return the bins indexes (in the cooler matrix) corresponding to the genomic coordinates in the dictionary.

Parameters:

matrix (str, optional) – Path to a cooler matrix file, by default None
dictionary (dict, optional) – Dictionary of genomic coordinates where reads are going to be selected for duplication, by default None

hicberg.eval.get_boundaries(position: int = None, bins: int = 2000, chromosome: str | list[str] = None, chrom_sizes_dict: str = 'chromosome_sizes.npy') → tuple[int, int][source]

Return boundaries surrounding position considering regular binning of bins.

Parameters:

position (int, optional) – Position to get surrounding boundaries from, by default None
bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chromosome (str or list[str], optional) – Chromosomes associated to positions to get surrounding boundaries from, by default None
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Tuple containing the boundaries of the position : (lower_bound, upper_bound).

Return type:

tuple[int, int]

hicberg.eval.get_chromosomes_intervals(bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', chromosome: str = '') → list[int, int][source]

Get all possible intervals from a given chromosome considering a chromosome sizes dictionary.

Parameters:

bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
chromosome (str, optional) – Chromosome to get intervals from, by default “”

Returns:

List containing the intervals as [(start, end), …]

Return type:

list[(int, int)]

hicberg.eval.get_f1_score(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) → float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return f1 score i.e. 2 * ((precision * recall) / (precision + recall)).

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

f1_score – 2 * ((precision * recall) / (precision + recall))

Return type:

float

hicberg.eval.get_interval_index(chromosome: str = '', value: int = None, intervals_dict: dict[str, list[int, int]] = None, chrom_sizes_dict: dict[str, int] = None) → dict[str, list[int, int]][source]

Parameters:

chromosome ([str]) – chromosome linked to value and intervals to search in.
value ([int]) – Value to search in intervals.
intervals_dict ([dict]) – Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy

Returns:

Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary. If the value in not in any interval, return None.

Return type:

[dict]

hicberg.eval.get_intervals_proportions(chrom_sizes_dict: str = 'chromosome_sizes.npy', nb_intervals: int = 1) → dict[str, float][source]

Extract a dictionary containing the number of intervals to draw considering the size of each chromosome.

Parameters:

chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
nb_intervals (int, optional) – Number of intervals to draw, by default 1

Returns:

Dictionary containing proportion by intervals as {chromosome : proportion}.

Return type:

dict[str, int]

hicberg.eval.get_precision(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) → float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return precision score corresponding to the number of true positives over the number of true positives and false positives.

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

precision – number of true positives over the number of true positives and false positives

Return type:

float

hicberg.eval.get_recall(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) → float[source]

Take dataframe of pattern call (Chromosight) before and after reconstruction and return recall score corresponding to the number of true positives over the number of true positives and false negatives.

Parameters:

df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.

Returns:

recall – number of true positives over the number of true positives and false negatives

Return type:

float

hicberg.eval.get_top_pattern(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None) → DataFrame[source]

Get top patterns from a dataframe

Parameters:

df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0
chromosome (str, optional) – Chromosome to consider, by default None

Returns:

Dataframe containing top percentage patterns.

Return type:

pd.DataFrame

hicberg.eval.hicberg_benchmark_cmd_generator(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None, mode: str = 'full', genome: str = None, bins: int = 0, output: str = None) → str[source]

Get top patterns from a dataframe

Parameters:

df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0
chromosome (str, optional) – Chromosome to consider, by default None
mode (str, optional) – Mode to consider for hicberg benchmark, by default “full”
genome (str, optional) – Path to the genome to consider for hicberg benchmark, by default None
bins (int, optional) – Number of bins to consider for hicberg benchmark, by default 0
output (str, optional) – Path to the output folder to consider for hicberg benchmark, by default None

Returns:

HiC-BERG command line to run to evaluate reconstruction after pattern discarding

Return type:

str

hicberg.eval.intersect2D(a, b)[source]: Find row intersection between 2D numpy arrays, a and b. Returns another numpy array with shared rows

hicberg.eval.overlap_intervals(starts1, ends1, starts2, ends2, closed=False, sort=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals.

Parameters:

starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

overlap_ids – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

Return type:

numpy.ndarray

hicberg.eval.overlap_intervals_outer(starts1, ends1, starts2, ends2, closed=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals, as well as the indices of the intervals that do not overlap any other interval.

Parameters:

starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

overlap_ids (numpy.ndarray) – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.
no_overlap_ids1, no_overlap_ids2 (numpy.ndarray) – Two 1D arrays containing the indices of intervals in sets 1 and 2 respectively that do not overlap with any interval in the other set.

hicberg.eval.select_reads(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) → dict[str, list[int, int]][source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:

bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.eval.select_reads_multithreads(bam_couple: tuple[str, str], interval_dictionary: dict[str, (<class 'int'>, <class 'int'>)], chrom_sizes_dict: str = 'chromosome_sizes.npy', output_dir: str = None) → None[source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:

bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.eval.select_reads_multithreads_bckp(bam_couple: tuple[str, str], matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) → dict[str, list[int, int]][source]

Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.

Parameters:

bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None

Returns:

Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].

Return type:

dict[str, list[(int, int)]]

hicberg.io module

hicberg.io.build_matrix(bins: str = 'fragments_fixed_sizes.txt', pairs: str = 'group1.pairs', mode: bool = False, balance: bool = False, cpus: int = 8, output_dir: str = None) → None[source]

Take table of bins and .pairs file and build a matrix in .cool format.

Parameters:

bins (str) – Path to the bin file.
pairs (str, optional) – Path to pairs file, by default “group1.pairs”
mode (bool, optional) – Choose wether the mode is rescued or un-rescued to construct associated .cool file, by default False
balance (bool, optional) – Set wether or not to balance the matrix, by default False
output_dir (str, optional) – Path to the folder where to save the cooler matrix file, by default None, by default None

hicberg.io.build_pairs(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', bam_for_rescued: str = 'group2.1.rescued.bam', bam_rev_rescued: str = 'group2.2.rescued.bam', mode: bool = False, output_dir: str = None) → None[source]

Build pairs of reads from the aligned reads.

Parameters:

bam_for (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (non rescued)., by default “group1.1.bam”
bam_rev (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (non rescued)., by default None, by default “group1.2.bam”
bam_for_rescued (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.1.rescued.bam”
bam_rev_rescued (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.2.rescued.bam”
mode (bool, optional) – Choose wether the mode is rescued or unrescued to construct associated .pairs file, by default False
output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None

hicberg.io.create_folder(sample_name: str = None, output_dir: str = None, force: bool = False) → None[source]

Creates folder architecture to store results and intermediate files for the full HiC-BERG pipeline.

Parameters:

sample_name (str) – Name of the folder to be created.
force (bool) – Set if existing folder has to be deleted before folder creation.
output_dir (str) – Path where the folder will be created.

Returns:

Path of the folder created

Return type:

[str]

hicberg.io.load_cooler(matrix: str = None) → Cooler[source]

Load cooler matrix.

Parameters:: matrix (str, optional) – Path to a cooler matrix, by default None
Returns:: Cooler matrix object
Return type:: cooler.Cooler

hicberg.io.load_dictionary(dictionary: str = None) → dict[source]

Load dictionary save into numpy format (.npy).

Parameters:: dictionary (str, optional) – Path to a the dictionary saved in numpy (.npy) format, by default None
Returns:: Python native dictionary
Return type:: dict

hicberg.io.merge_predictions(output_dir: str = None, clean: bool = True, stage='prediction', cpus: int = 1) → None[source]

Merge predictions of all chunks of ambiguous reads predictions.

Parameters:

output_dir (str, optional) – Path to the folder where to save the fused alignment file, by default None
clean (bool, optional) – Set weither or not to remove temporary chunks, by default True
stage (str, optional) – Set the stage of the merging, by default “prediction”. Can be “prediction” or “classification”
cpus (int, optional) – Number of cpus to use for the merging, by default 1

hicberg.io.tidy_folder(output_dir: str = None) → None[source]

Tidy all the files in the output folder.

Parameters:: output_dir (str, optional) – Path to the folder where to save the fused alignment file, by default None

hicberg.omics module

hicberg.omics.bedgraph_to_bigwig(bedgraph_file: str = 'coverage.bedgraph', chromosome_sizes: str = 'chromosome_sizes.txt', output_dir: str = None) → None[source]

Convert bedgraph to bigwig format.

Parameters:

bedgraph_file (str, optional) – Path to coverage (.bedgraph), by default “coverage.bedgraph”
chromosome_sizes (str, optional) – Path to chromosome sizes file (chrom_id, size), by default “chromosome_sizes.txt”
output_dir (str, optional) – [description], by default None

Raises:

IOError – [description]
IOError – [description]
IOError – [description]

hicberg.omics.format_chrom_sizes(chromosome_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) → None[source]: Format chromosome sizes to bed and txt format. - bed format : chrom, start, end - txt format : chrom, size :param chrom_sizes: Path to chromosome sizes file (.npy), by default “chromosome_sizes.npy” :type chrom_sizes: str, optional :param output_dir: Path where the formatted chromosome sizes will be saved, by default None :type output_dir: str, optional

hicberg.omics.get_bed_coverage(chromosome_sizes: str = 'chromosome_sizes.bed', pairs_file: str = 'preprocessed_pairs.pairs', output_dir: str = None) → None[source]

Get bed coverage from pairs file (using bedtools).

Parameters:

chromosome_sizes (str, optional) – Path to chromsomes sizes files (.bed format), by default “chromosome_sizes.bed”
pairs_file (str, optional) – Path to processed pairs files (columns : chrom, start, end, count), by default “preprocessed_pairs.pairs”
output_dir (str, optional) – Path where the coverage (.bed) will be saved, by default None

hicberg.omics.get_bedgraph(bed_coverage: str = 'coverage.bed', output_dir: str = None) → None[source]

Convert bed coverage to bedgraph format. Format is : chrom, start, end, count. Start and end are different by 1bp (end = start + 1).

Parameters:

bed_coverage (str, optional) – Path to coverage (.bed), by default “coverage.bed”
output_dir (str, optional) – Path where the coverage (.bedgraph) will be saved, by default None

hicberg.omics.preprocess_pairs(pairs_file: str = 'all_group.pairs', threshold: int = 1000, output_dir: str = None) → None[source]

Preprocess pairs file to remove pairs that are not in the same chromosome or are greater than a threshold. Retain columns are : chromosome, start, end, count.

Parameters:

pairs_file (str, optional) – Path to the pairs file, by default “all_group.pairs”
threshold (int, optional) – Threshold distance beyond which pairs will not be kept, by default 1000
output_dir (str, optional) – Path where the formatted pairs will be saved, by default None

hicberg.pipeline module

hicberg.pipeline.check_tool(name: str) → bool[source]

Check whether name is on PATH and marked as executable.

Parameters:: name ([str]) – Depenedency to check.
Returns:: True if the tool is available, False otherwise.
Return type:: [bool]

hicberg.pipeline.pipeline(name: str = 'sample', start_stage: str = 'fastq', exit_stage: str = 'None', genome: str = None, index=None, fq_for: str = None, fq_rev: str = None, sensitivity: str = 'very-sensitive', max_alignment: int = None, mapq: int = 35, enzyme: list[str] = ['DpnII', 'HinfI'], circular: str = '', rate: float = 1.0, distance: int = 1000, bins: int = 2000, nb_chunks: int = 1, mode: str = 'full', kernel_size: int = 11, deviation: float = 0.5, verbose: bool = False, cpus: int = 1, output_dir: str = None, force: bool = False, blacklist: str = None) → None[source]

hicberg.plot module

hicberg.plot.plot_benchmark(original_matrix: str = None, depleted_matrix: str = None, rescued_matrix: str = None, chromosomes: list[str] = None, output_dir: str = None) → None[source]

Plot benchmark results (original, depleted and rescued matrices with associated log ratios). One plot per chromosome.

Parameters:

original_matrix (str, optional) – Path to the original matrix, by default None
rescued_matrix (str, optional) – Path to the rescued matrix (re-attributed reads), by default None
chromosomes (list[str], optional) – List of chromosomes to plot, by default None
output_dir (str, optional) – Path to where to save plots, by default None

hicberg.plot.plot_couple_repartition(forward_bam_file: str = 'group2.1.rescued.bam', reverse_bam_file: str = 'group2.2.rescued.bam', output_dir: str = None) → None[source]

Plot read couples sizes distribution

Parameters:

forward_bam_file (str, optional) – Path to forward .bam alignment file, by default 1.sorted.bam
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default 2.sorted.bam Minimal read quality under which a Hi-C read pair will not be kept, by default 30
output_dir (str, optional) – Path to the folder where to save the plot, by default None

hicberg.plot.plot_coverages(bins: int = 2000, output_dir: str = None) → None[source]

Plot coverages of chromosomes

Parameters:

bins (int, optional) – Size of the desired bin., by default 2000
output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_d1d2(output_dir: str = None) → None[source]

Plot d1d2 law

Parameters:: output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_density(output_dir: str = None) → None[source]

Plot density maps

Parameters:: output_dir (str, optional) – Path to the folder where to save the plots (one plot per chromosome couple), by default None.

hicberg.plot.plot_laws(output_dir: str = None) → None[source]

Plot P(s) patterns laws

Parameters:: output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.plot.plot_matrix(unrescued_matrix: str = 'unrescued_map.cool', rescued_matrix: str = 'rescued_map.cool', restriction_map: str = 'restriction_map.npy', genome: str = '', vmin: float = 0.0, vmax: float = 3.5, bins: int = 2000, output_dir: str = None) → None[source]

Plot matrix with additional trackss

Parameters:

unrescued_matrix (str, optional) – Path to the unrescued map file, by default unrescued_map.cool
rescued_matrix (str, optional) – Path to rescued map file, by default rescued_map.cool
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default dist.frag.npy
genome (str, optional) – Path to the genome to digest, by default None, by default None
vmin (float, optional) – Inferior limit for the colorscale, by default 0.0
vmax (float, optional) – Superior limit for the colorscale, by default 3.5
bins (int, optional) – Size of the desired bin., by default 2000
output_dir (str, optional) – Path to the folder where to save the plot, by default None

hicberg.plot.plot_pattern_reconstruction(table: DataFrame = None, original_cool: str = None, rescued_cool: str = None, chromosome: str = None, threshold: float = 0.0, case: str = '', output_dir: str = None) → None[source]

Create a plot of pattern reconstruction quality.

Parameters:

table (pd.DataFrame, optional) – Table containing either true positives, false positives or false negatives patterns, by default None
original_cool (str, optional) – Path to the original matrix in .cool format, by default None
rescued_cool (str, optional) – Path to the rescued matrix in .cool format, by default None
chromosome (str, optional) – Selected chromosome, by default None
threshold (float, optional) – Threshold for pattern score significance, by default 0.0
case (str, optional) – Mode to consider, either true positives, false positives or false negatives, by default “”
output_dir (str, optional) – Path to save plots, by default None

hicberg.plot.plot_trans_ps(output_dir: str = None) → None[source]

Plot P(s) patterns laws

Parameters:: output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.

hicberg.statistics module

hicberg.statistics.attribute_xs(xs: ndarray[int], distance: int) → int[source]

Attibute genomic distance to the corresponding log bin of xs.

Parameters:

xs (np.ndarray[int]) – Array containing the log bins.
distance (int) – Genomic distance in bp.

Returns:

Index of the corresponding bins where distance has to be attributed.

Return type:

int

hicberg.statistics.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:

a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.
p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

See also

randint, shuffle, permutation

random.Generator.choice: which should be used in new code

Notes

Setting user-specified probabilities through p uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element of p is 1 / len(a).

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword.

Examples

Generate a uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3)
array([0, 3, 4]) # random
>>> #This is equivalent to np.random.randint(0,5,3)

Generate a non-uniform random sample from np.arange(5) of size 3:

>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0]) # random

Generate a uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0]) # random
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:

>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
array([2, 3, 0]) # random

Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:

>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')

hicberg.statistics.compute_density(cooler_file: str = 'unrescued_map.cool', threads: int = 2, kernel_size: int = 11, deviation: float = 0.5, output_dir: str = None) → None[source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:

cooler_file (str, optional) – [description], by default None
threads (int, optional) – [description], by default 2
output_dir (str, optional) – [description], by default None

hicberg.statistics.compute_propensity(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, xs: dict = None, weirds: dict = None, uncuts: dict = None, loops: dict = None, circular: str = '', trans_ps: dict = None, coverage: dict = None, bins: int = 2000, d1d2: dict = None, density_map: dict = None, mode: str = 'full') → float[source]

Compute propensity for read pair to be selected among all plausible pairs related to multi-mapping reads.

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin, by default 2000
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None
density (dict) – Dictionary of contact density by chromosome couple.
mode (str, optional) – Mode to use to compute propensity among, by default “full”

Returns:

Propensity to use for read couple drawing

Return type:

float

hicberg.statistics.draw_read_couple(propensities: array) → int[source]

Draw an index respecting distribution of propensities. This function is used to draw a couple of reads considering the propensity of each couple.

Parameters:: propensities (np.array) – Array containing all the propensities of each couple of reads.
Returns:: Index of the couple of reads drawn.
Return type:: int

hicberg.statistics.generate_coverages(genome: str = None, bins: int = 2000, forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', output_dir: str = None) → None[source]

Take a genome and both forward and reverse bam files for unambiguous group and return a dictionary containing the coverage in terms of reads overs chromosomes.

Parameters:

genome (str, optional) – Path to the genome file to get coverage on, by default None
bins (int, optional) – Size of the desired bin., by default 2000
forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.2.bam
output_dir (str, optional) – Path to the folder where to save the classified alignment files, by default None, by default None

hicberg.statistics.generate_d1d2(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', restriction_map: str = 'restriction_map.npy', output_dir: str = None) → None[source]

Compute d1d2 distance laws with the given alignments and restriction map.

Parameters:

forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.2.bam”
restriction_map (str, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default “dist.frag.npy”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None

hicberg.statistics.generate_density_map(matrix: str = 'unrescued_map.cool', size: int = 5, sigma: int = 2, n_mads: int = 2, nan_threshold: bool = False, output_dir: str = None) → dict[str, ndarray[float]][source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:

matrix (str, optional) – Path to a cooler matrix, by default None
rounds (int, optional) – Number of times the matrix has to be shaken, by default 1
magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0
output_dir (str, optional) – Path to the folder where to save the density map, by default None

Returns:

Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.

Return type:

dict[str, np.ndarray[float]]

hicberg.statistics.generate_density_map_backup(matrix: str = 'unrescued_map.cool', rounds: int = 1, magnitude: float = 1.0, output_dir: str = None) → dict[str, ndarray[float]][source]

Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.

Parameters:

matrix (str, optional) – Path to a cooler matrix, by default None
rounds (int, optional) – Number of times the matrix has to be shaken, by default 1
magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0
output_dir (str, optional) – Path to the folder where to save the density map, by default None

Returns:

Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.

Return type:

dict[str, np.ndarray[float]]

hicberg.statistics.generate_trans_ps(matrix: str = 'unrescued_map.cool', chrom_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) → None[source]

Generate pseudo-P(s) while considering reads originating from different chromosomes. Pseudo-P(s) are computed as the number of interactions between two chromosomes divided by the product of the length of these two chromosomes.

Parameters:

matrix (str, optional) – Path to a cooler matrix, by default “unrescued_map.cool”
chrom_sizes (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None.

hicberg.statistics.generate_xs(chromosome_size: int, base: float = 1.1) → ndarray[int][source]

Generate xs array for computing P(s). Return xs array which is log space.

Parameters:

chromosome_size (int) – Size of the chromosome to be binned in bp.
base (float, optional) – Base of the log space., by default 1.1

Returns:

Array of log bins related to the chromosome.

Return type:

np.ndarray[int]

hicberg.statistics.get_d1d2(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, d1d2: array = None) → int[source]

Get the d1d2 value of a pair of reads.

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None

Returns:

d1d2 value

Return type:

int

hicberg.statistics.get_density(read_forward: ~pysam.libcalignedsegment.AlignedSegment, read_reverse: ~pysam.libcalignedsegment.AlignedSegment, density_map: dict[slice((<class 'str'>, <class 'str'>), <built-in function array>, None)], bin_size: int = 2000) → float[source]

Get density from density map dictionary.

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to get density from.
read_reverse (pysam.AlignedSegment) – Reverse read to get density from.
density_map (dict) – Dictionary containing density maps for each chromosome couple.
bin_size (int, optional) – Resolution of the matrix on which density map has been computed, by default 2000

Returns:

Density corresponding to the pair of reads.

Return type:

float

hicberg.statistics.get_dist_frags(genome: str = None, restriction_map: dict = None, circular: str = '', rate: float = 1.0, output_dir: str = None) → None[source]

Get the distribution of fragments’ distance from a genome distribution .

Parameters:

genome (str, optional) – Path to the genome, by default None
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
circular (str, optional) – Name of the chromosomes to consider as circular, by default “”
rate (float, optional) – Set the proportion of restriction sites to consider. Avoid memory overflow when restriction maps are very dense, by default 1.0
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None

Returns:

Dictionary of sub-sampled restriction map with keys as chromosome names and values as lists of restriction sites’ position.

Return type:

dict

hicberg.statistics.get_pair_cover(read_forward: AlignedSegment, read_reverse: AlignedSegment, coverage: dict, bins: int) → float[source]

Get the coverage of a pair of reads.

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin.

Returns:

Product of the coverage of the pair of reads.

Return type:

float

hicberg.statistics.get_pair_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, xs: dict, weirds: dict, uncuts: dict, loops: dict, circular: str = '') → float[source]

Take two reads and return the P(s) value depending on event type (intra-chromosomal case only).

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.

Returns:

P(s) of the pair considering the event type.

Return type:

float

hicberg.statistics.get_patterns(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', xs: str = 'xs.npy', chrom_sizes: str = 'chromosome_sizes.npy', circular: str = '', blacklist: str = None, output_dir: str = None) → None[source]

Get the patterns distribution from read pairs alignment. .

Parameters:

forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.1.bam”
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.2.bam”
xs (str, optional) – Path to the dictionary containing the xs values, by default “xs.npy”
dist_frag (str, optional) – Path to the dictionary containing the inter fragment distances, by default “dist.frag.npy”
circular (str, optional) – Name of the chromosomes to consider as circular, by default “”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None, by default None

hicberg.statistics.get_restriction_map(genome: str = None, enzyme: list[str] = ['DpnII'], output_dir: str = None) → dict[str, ndarray[int]][source]

Get ordered restriction map (including 0 and n) from a chromosome sequence. Return a dictionary where keys are chromosomes names and values are restrictions sites positions.

Parameters:

genome (str, optional) – Path to the genome to digest, by default None, by default None
enzyme (list[str], optional) – Enzyme or list of enzyme to digest the genome with. If integer passed, micro-C mode using Mnase is used, and the integer correspond to the size of nucleosomal fragment, by default None, by default [“DpnII”]
output_dir (str, optional) – Path to the folder where to save the plot, by default None

Returns:

Dictionary of the product of digestion where keys are chromosomes names and values are restrictions sites positions.

Return type:

dict

hicberg.statistics.get_top_pattern(file: str = None, top: int = 10, chromosome: str = None) → DataFrame[source]

Get top patterns from a dataframe

Parameters:

df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
chromosome (str, optional) – Chromosome to consider, by default None

Returns:

Dataframe containing top percentage patterns.

Return type:

pd.DataFrame

hicberg.statistics.get_trans_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, trans_ps: dict) → float[source]

Parameters:

read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)

Returns:

Trans P(s) value

Return type:

float

hicberg.statistics.log_bin_genome(genome: str, base: float = 1.1, output_dir: str = None) → dict[str, ndarray[int]][source]

hicberg.statistics.pearson_score(original_matrix: Cooler, rescued_matrix: Cooler, markers: list[int]) → float[source]

Compute Pearson correlation between concatenated matrix bins which have been deleted and reconstructed.

Parameters:

original_matrix (cooler.Cooler) – Cooler object containing the original matrix.
rescued_matrix (cooler.Cooler) – Cooler object containing the reconstructed matrix.
markers (list[int]) – List of markers to consider. Markers are the bins which have been deleted.

Returns:

Pearson correlation between original and reconstructed matrix.

Return type:

float

hicberg.statistics.reattribute_reads(reads_couple: tuple[str, str] = ('group2.1.bam', 'group2.2.bam'), restriction_map: dict = None, xs: dict = 'xs.npy', weirds: dict = 'weirds.npy', uncuts: dict = 'uncuts.npy', loops: dict = 'loops.npy', circular: str = '', trans_ps: dict = 'trans_ps.npy', coverage: dict = 'coverage.npy', bins: int = 2000, d1d2: dict = 'd1d2.npy', density_map: dict = 'density_map.npy', mode: str = 'full', output_dir: str = None) → None[source]

Re-attribute multi-mapping (ambiguous) reads considering sets of statistical laws.

Parameters:

reads_couple (tuple[str, str], optional) – Paths to ambiguous reads alignment files (.bam), by default (“group2.1.bam”, “group2.2.bam”)
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin, by default 2000
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None
density (np.array, optional) – Dictionary containing density maps per chromosome couples as, by default None
mode (str, optional) – Mode to use to compute propensity among, by default “full”
output_dir (str, optional) – Path to the re-attributed ambiguous reads alignment files are saved, by default None

hicberg.utils module

hicberg.utils.bam_iterator(bam_file: str = None) → Iterator[AlignedSegment][source]

Returns an iterator for the given SAM/BAM file (must be query-sorted). In each call, the alignments of a single read are yielded as a 3-tuple: (list of primary pysam.AlignedSegment, list of supplementary pysam.AlignedSegment, list of secondary pysam.AlignedSegment).

Parameters:: bam ([str]) – Path to alignment file in .sam or .bam format.
Yields:: Iterator[pysam.AlignedSegment] – Yields a list containing pysam AlignmentSegment objects, within which all the reads have the same id.

hicberg.utils.block_counter(forward_bam_file: str, reverse_bam_file: str) → Tuple[int, int][source]

Return as a tuple the number of blocks in the forward and reverse bam files.

Parameters:

forward_bam_file (str, optional) – Path to forward .bam alignment file.
reverse_bam_file (str, optional) – Path to reverse .bam alignment file.

Returns:

Number of blocks in the forward and reverse bam files.

Return type:

Tuple[int, int]

hicberg.utils.choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

New in version 1.7.0.

Note

New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.

Parameters:

a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of a can be selected multiple times.
p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in a.

Returns:

samples – The generated random samples

Return type:

single item or ndarray

Raises:

ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size

hicberg package

Submodules

hicberg.align module

hicberg.benchmark module

hicberg.eval module

hicberg.io module

hicberg.omics module

hicberg.pipeline module

hicberg.plot module

hicberg.statistics module

hicberg.utils module

hicberg.version module

Module contents