hicberg package
Submodules
hicberg.align module
- hicberg.align.hic_align(index: str, fq_for: str, fq_rev: str, sensitivity: str = 'very-sensitive', max_alignment: int = None, cpus: int = 1, output_dir: str = None, verbose: bool = False) None [source]
Alignment of reads from HiC experiments along an indexed genome.
- Parameters:
index (str) – Path to the index of the genome along which reads are going to be aligned (path to .bt2l files). Default to None, index files are searched to sample_name/data/index/sample_name.
fq_for (str) – Path to .fasta containing set of reads to align (forward mate).
fq_rev (str) – Path to .fasta containing set of reads to align (forward mate).
sensitivity (str, optional) – Sensitivity of the alignment., by default ‘very_sensitive’
max_alignment (int, optional) – Maximum number of alignments to be returned, by default None
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False
- hicberg.align.hic_build_index(genome: str, output_dir: str = None, cpus: int = 1, verbose: bool = False) None [source]
Building of bowtie2 index (.bt2l files) for read alignment.
- Parameters:
genome (str) – Path to the genome file along which reads are going to be aligned.
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the Bowtie2 index files should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False
- hicberg.align.hic_index(bam_for: str = '1.sorted.bam', bam_rev: str = '2.sorted.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None [source]
Index a coordinate-sorted BGZIP-compressed SAM, BAM or CRAM file for fast random access.
- Parameters:
bam_for (str, optional) – Forward alignment file to be indexed, by default “1.sorted.bam”
bam_rev (str, optional) – Reverse alignment file to be indexed,, by default “2.sorted.bam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False
- hicberg.align.hic_sort(bam_for: str = '1.bam', bam_rev: str = '2.bam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None [source]
Sort .bam alignment files by read_name (using samtools).
- Parameters:
bam_for (str, optional) – Forward alignment file to be sorted, by default “1.bam”
bam_rev (str, optional) – Reverse alignment file to be sorted, by default “2.bam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False
- hicberg.align.hic_view(sam_for: str = '1.sam', sam_rev: str = '2.sam', cpus: int = 1, output_dir: str = None, verbose: bool = False) None [source]
Conversion of .sam alignment files to .bam alignment format (using samtools).
- Parameters:
sam_for (str, optional) – Path to forward .sam alignment file, by default “1.sam”
sam_rev (str, optional) – Path to reverse .sam alignment file, by default “2.sam”
cpus (int, optional) – Number of threads allocated for the alignment, by default 1
output_dir (str, optional) – Path where the alignment files (.bam) should be stored, by default None
verbose (bool, optional) – Set wether or not the shell command should be printed, by default False
hicberg.benchmark module
- hicberg.benchmark.benchmark(output_dir: str = None, chromosome: str = '', position: int = 0, trans_chromosome: str = None, trans_position: int = None, strides: list[int] = [], mode: str = 'full', auto: int = None, kernel_size: int = 11, deviation: float = 0.5, bins: int = None, circular: str = '', genome: str = None, pattern: str = None, threshold: float = 0.0, jitter: int = 0, trend: bool = True, top: int = 100, force: bool = False, iterations: int = 3, cpus: int = 8)[source]
Performs a benchmark of the HiC-Berg method by simulating the deletion of a genomic region and evaluating the method’s ability to recover the original signal.
The benchmark simulates the deletion of a genomic region by removing reads that map to that region from a Hi-C dataset. It then applies the HiC-Berg method to the depleted dataset to try to recover the original signal. The benchmark evaluates the method’s performance by comparing the rescued contact map to the original contact map.
- Parameters:
output_dir (str, optional) – Path to the output directory.
chromosome (str, optional) – Chromosome to perform the benchmark on.
position (int, optional) – Position of the deletion on the chromosome.
trans_chromosome (str, optional) – Chromosome to consider for trans interactions.
trans_position (int, optional) – Position of the deletion on the trans chromosome.
strides (list[int], optional) – List of strides to use for the deletion.
mode (str, optional) – Mode of the HiC-Berg method to use. Can be “full” or “density”.
auto (int, optional) – Automatically determine the size of the deletion based on the given number of bins.
kernel_size (int, optional) – Size of the kernel to use for the density estimation.
deviation (float, optional) – Standard deviation of the kernel to use for the density estimation.
bins (int, optional) – Number of bins to use for the deletion.
circular (str, optional) – Whether the chromosome is circular.
genome (str, optional) – Genome assembly to use.
pattern (str, optional) – Pattern to use for Chromosight pre-call.
threshold (float, optional) – Threshold to use for Chromosight pre-call.
jitter (int, optional) – Jitter to use for Chromosight pre-call.
trend (bool, optional) – Whether to use trend for Chromosight pre-call.
top (int, optional) – Top patterns to use for Chromosight pre-call.
force (bool, optional) – Whether to force the benchmark to run even if the output directory already exists.
iterations (int, optional) – Number of iterations to run the HiC-Berg method.
cpus (int, optional) – Number of CPUs to use for parallel processing.
- Raises:
ValueError – If the output directory does not exist.
ValueError – If the restriction map file does not exist.
hicberg.eval module
- hicberg.eval.arange_multi(starts, stops=None, lengths=None)[source]
Create concatenated ranges of integers for multiple start/length. :param starts: Starts for each range :type starts: numpy.ndarray :param stops: Stops for each range :type stops: numpy.ndarray :param lengths: Lengths for each range. Either stops or lengths must be provided. :type lengths: numpy.ndarray
- Returns:
concat_ranges – Concatenated ranges.
- Return type:
numpy.ndarray
Notes
See the following illustrative example: starts = np.array([1, 3, 4, 6]) stops = np.array([1, 5, 7, 6]) print arange_multi(starts, lengths) >>> [3 4 4 5 6] From: https://codereview.stackexchange.com/questions/83018/vectorized-numpy-version-of-arange-with-multiple-start-stop
- hicberg.eval.check_emptiness(intervals: dict[str, list[int, int]], matrix: Cooler = None) bool [source]
Check if intervals are empty in a given matrix.
- Parameters:
intervals (dict[str, list[(int, int)]) – Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary.
matrix (cooler.Cooler, optional) – Cooler Hi-C matrix to be checked for emptiness, by default None
- Returns:
Returns True if one of the intervals are empty in the matrix, False otherwise.
- Return type:
bool
- hicberg.eval.choice(a, size=None, replace=True, p=None)
Generates a random sample from a given 1-D array
New in version 1.7.0.
Note
New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.
- Parameters:
a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were
np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of
a
can be selected multiple times.p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in
a
.
- Returns:
samples – The generated random samples
- Return type:
single item or ndarray
- Raises:
ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size
See also
randint
,shuffle
,permutation
random.Generator.choice
which should be used in new code
Notes
Setting user-specified probabilities through
p
uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element ofp
is 1 / len(a).Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its
axis
keyword.Examples
Generate a uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3) array([0, 3, 4]) # random >>> #This is equivalent to np.random.randint(0,5,3)
Generate a non-uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0]) array([3, 3, 0]) # random
Generate a uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False) array([3,1,0]) # random >>> #This is equivalent to np.random.permutation(np.arange(5))[:3]
Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0]) array([2, 3, 0]) # random
Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher'] >>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3]) array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random dtype='<U11')
- hicberg.eval.chromosight_cmd_generator(file: str = None, pattern: str = 'loops', untrend: bool = True, mode: bool = False, output_dir: str = None) str [source]
Generate chromosight command line to run pattern detection.
- Parameters:
file (str, optional) – Path to Hi-C balanced contact matrix in .cool format , by default None
pattern (str, optional) – Pattern to detect, by default “loops”
untrend (bool, optional) – Set if map has to be detrended, by default True
mode (bool, optional) – Set either the detection has to be performed before or after map reconstruction, by default False
output_dir (str, optional) – Path to the folder where to save alignments, by default None
- Returns:
Chromosight command line to run to run pattern detection
- Return type:
str
- hicberg.eval.draw_intervals(nb_intervals: int = 1, bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy') dict[str, list[tuple[int, int]]] [source]
Draw intervals from a given chromosome sizes dictionary.
- Parameters:
nb_intervals (int, optional) – Number of intervals to draw, by default 1
bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
- Returns:
Dictionary containing the intervals as {chromosome : [(start, end), …]}.
- Return type:
dict[str, list[tuple[int, int]]]
- hicberg.eval.generate_dict_coordinates(matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]] [source]
Generates a dictionary of genomic coordinates for selecting regions in a Hi-C contact map.
This function defines genomic intervals based on provided parameters such as chromosome, position, strides, and bin size. It can generate intervals for both intra-chromosomal and inter-chromosomal interactions. The intervals can be used to select specific regions in a Hi-C contact map for further analysis, such as simulating deletions or duplications.
- Parameters:
matrix_file (str, optional) – Path to the cooler file containing the Hi-C contact map, by default “unrescued_map.cool”
position (int, optional) – Genomic position on the chromosome, by default 0
chromosome (str | list[str], optional) – Chromosome or list of chromosomes, by default “”
bin_size (int, optional) – Size of the bins in the Hi-C contact map, by default 2000
chrom_sizes_dict (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to use for generating intervals, by default [0]
trans_chromosome (str, optional) – Chromosome for trans-interactions, by default None
output_dir (str, optional) – Path to the output directory, by default None
trans_position (list[int], optional) – List of positions on the trans-chromosome, by default None
nb_bins (int, optional) – Number of bins to include in each interval, by default 1
random (bool, optional) – Whether to generate random intervals, by default False
auto (int, optional) – Number of random intervals to generate, by default None
- Returns:
A dictionary where keys are chromosomes and values are lists of (start, end) genomic intervals.
- Return type:
dict[str, list[(int, int)]]
- Raises:
ValueError – If the output path, chromosome sizes file, or matrix file does not exist.
- hicberg.eval.get_FN_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) DataFrame [source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
dataframe – Table containing components before and after pattern recall (True positives)
- Return type:
[pandas DataFrame]
- hicberg.eval.get_FP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None)[source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
dataframe – Table containing components before and after pattern recall (True positives)
- Return type:
[pandas DataFrame]
- hicberg.eval.get_TP_table(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) DataFrame [source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return table of retieved patterns
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
dataframe – Table containing components before and after pattern recall (True positives)
- Return type:
[pandas DataFrame]
- hicberg.eval.get_bin_indexes(matrix: str = None, dictionary: dict = None)[source]
Return the bins indexes (in the cooler matrix) corresponding to the genomic coordinates in the dictionary.
- Parameters:
matrix (str, optional) – Path to a cooler matrix file, by default None
dictionary (dict, optional) – Dictionary of genomic coordinates where reads are going to be selected for duplication, by default None
- hicberg.eval.get_boundaries(position: int = None, bins: int = 2000, chromosome: str | list[str] = None, chrom_sizes_dict: str = 'chromosome_sizes.npy') tuple[int, int] [source]
Return boundaries surrounding position considering regular binning of bins.
- Parameters:
position (int, optional) – Position to get surrounding boundaries from, by default None
bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chromosome (str or list[str], optional) – Chromosomes associated to positions to get surrounding boundaries from, by default None
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
- Returns:
Tuple containing the boundaries of the position : (lower_bound, upper_bound).
- Return type:
tuple[int, int]
- hicberg.eval.get_chromosomes_intervals(bins: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', chromosome: str = '') list[int, int] [source]
Get all possible intervals from a given chromosome considering a chromosome sizes dictionary.
- Parameters:
bins (int, optional) – Size of the desired bin, by default 2000., by default 2000
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
chromosome (str, optional) – Chromosome to get intervals from, by default “”
- Returns:
List containing the intervals as [(start, end), …]
- Return type:
list[(int, int)]
- hicberg.eval.get_f1_score(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float [source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return f1 score i.e. 2 * ((precision * recall) / (precision + recall)).
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
f1_score – 2 * ((precision * recall) / (precision + recall))
- Return type:
float
- hicberg.eval.get_interval_index(chromosome: str = '', value: int = None, intervals_dict: dict[str, list[int, int]] = None, chrom_sizes_dict: dict[str, int] = None) dict[str, list[int, int]] [source]
- Parameters:
chromosome ([str]) – chromosome linked to value and intervals to search in.
value ([int]) – Value to search in intervals.
intervals_dict ([dict]) – Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
- Returns:
Lists of intervals (sets) where the value is not in between as element 0 and where the value is in between as element 1. One list per chromosome in a dictionary. If the value in not in any interval, return None.
- Return type:
[dict]
- hicberg.eval.get_intervals_proportions(chrom_sizes_dict: str = 'chromosome_sizes.npy', nb_intervals: int = 1) dict[str, float] [source]
Extract a dictionary containing the number of intervals to draw considering the size of each chromosome.
- Parameters:
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
nb_intervals (int, optional) – Number of intervals to draw, by default 1
- Returns:
Dictionary containing proportion by intervals as {chromosome : proportion}.
- Return type:
dict[str, int]
- hicberg.eval.get_precision(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float [source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return precision score corresponding to the number of true positives over the number of true positives and false positives.
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
precision – number of true positives over the number of true positives and false positives
- Return type:
float
- hicberg.eval.get_recall(df_pattern: DataFrame, df_pattern_recall: DataFrame, chromosome: str, bin_size: int, jitter: int = 0, threshold: float = None) float [source]
Take dataframe of pattern call (Chromosight) before and after reconstruction and return recall score corresponding to the number of true positives over the number of true positives and false negatives.
- Parameters:
df_pattern ([dataframe]) – Pandas DataFrame given by chromosight before HiC map reconstruction.
df_pattern_recall ([dataframe]) – Pandas DataFrame given by chromosight after HiC map reconstruction.
chromosome ([str]) – chromosome to select position within.
bin_size ([int]) – bin size used to construct the HiC map.
jitter ([int]) – jitter to apply to the pattern recall table to allow overlapping with the pattern table. By default 0.
threshold ([float]) – threshold to apply to the pattern table to select patterns to consider. By default None.
- Returns:
recall – number of true positives over the number of true positives and false negatives
- Return type:
float
- hicberg.eval.get_top_pattern(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None) DataFrame [source]
Get top patterns from a dataframe
- Parameters:
df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0
chromosome (str, optional) – Chromosome to consider, by default None
- Returns:
Dataframe containing top percentage patterns.
- Return type:
pd.DataFrame
- hicberg.eval.hicberg_benchmark_cmd_generator(file: str = None, top: int = 10, threshold: float = 0.0, chromosome: str = None, mode: str = 'full', genome: str = None, bins: int = 0, output: str = None) str [source]
Get top patterns from a dataframe
- Parameters:
df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
threshold (float, optional) – Pattern Pearson score to consider to select pattern, by default 0.0
chromosome (str, optional) – Chromosome to consider, by default None
mode (str, optional) – Mode to consider for hicberg benchmark, by default “full”
genome (str, optional) – Path to the genome to consider for hicberg benchmark, by default None
bins (int, optional) – Number of bins to consider for hicberg benchmark, by default 0
output (str, optional) – Path to the output folder to consider for hicberg benchmark, by default None
- Returns:
HiC-BERG command line to run to evaluate reconstruction after pattern discarding
- Return type:
str
- hicberg.eval.intersect2D(a, b)[source]
Find row intersection between 2D numpy arrays, a and b. Returns another numpy array with shared rows
- hicberg.eval.overlap_intervals(starts1, ends1, starts2, ends2, closed=False, sort=False)[source]
Take two sets of intervals and return the indices of pairs of overlapping intervals.
- Parameters:
starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
closed (bool) – If True, then treat intervals as closed and report single-point overlaps.
- Returns:
overlap_ids – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.
- Return type:
numpy.ndarray
- hicberg.eval.overlap_intervals_outer(starts1, ends1, starts2, ends2, closed=False)[source]
Take two sets of intervals and return the indices of pairs of overlapping intervals, as well as the indices of the intervals that do not overlap any other interval.
- Parameters:
starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.
closed (bool) – If True, then treat intervals as closed and report single-point overlaps.
- Returns:
overlap_ids (numpy.ndarray) – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.
no_overlap_ids1, no_overlap_ids2 (numpy.ndarray) – Two 1D arrays containing the indices of intervals in sets 1 and 2 respectively that do not overlap with any interval in the other set.
- hicberg.eval.select_reads(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]] [source]
Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.
- Parameters:
bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None
- Returns:
Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].
- Return type:
dict[str, list[(int, int)]]
- hicberg.eval.select_reads_multithreads(bam_couple: tuple[str, str], interval_dictionary: dict[str, (<class 'int'>, <class 'int'>)], chrom_sizes_dict: str = 'chromosome_sizes.npy', output_dir: str = None) None [source]
Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.
- Parameters:
bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None
- Returns:
Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].
- Return type:
dict[str, list[(int, int)]]
- hicberg.eval.select_reads_multithreads_bckp(bam_couple: tuple[str, str], matrix_file: str = 'unrescued_map.cool', position: int = 0, chromosome: str | list[str] = '', bin_size: int = 2000, chrom_sizes_dict: str = 'chromosome_sizes.npy', strides: list[int] = [0], trans_chromosome: str = None, output_dir: str = None, trans_position: list[int] = None, nb_bins: int = 1, random: bool = False, auto: int = None) dict[str, list[int, int]] [source]
Select reads from a given matrix and alignment files. Two groups of alignment files are produced (.bam): - group1.1.in.bam and group1.2.in.bam : Forward and reverse reads from the source and target genomic interval. These reads are going to be duplicated un each intervals. - group1.1.out.bam and group1.2.out.bam : Forward and reverse reads not included the source and target genomic interval. These reads are going to be conserved and not duplicated.
- Parameters:
bam_for (str, optional) – Forward alignment file to select reads from, by default “group1.1.bam”
bam_rev (str, optional) – Reverse alignment file to select reads from, by default “group1.2.bam”
matrix_file (str, optional) – Name of the matrix from which the selection will be based (.cool format), by default “unrescued_map.cool”
position (int, optional) – Genomic coordinates (0-based) used to defined the source genomic interval, by default 0
chromosome (str or list[str], optional) – Chromosome associated to the 0-based coordinates given by “position’, by default “”
bin_size (int, optional) – Resolution of the given matrix, by default 2000
chrom_sizes_dict (str, optional) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format, by default “chromosome_sizes.npy”
strides (list[int], optional) – List of strides to apply from “position” to define target genomics intervals (in bp.), by default [0]
trans_chromosome (str, optional) – Chromosomes to consider as trans-chromosomes to define genomic target intervals, by default None
trans_position (list[int], optional) – Genomic coordinates (0-based) used to defined the trans-chromosomal target genomic interval, by default None
nb_bins (int, optional) – Number of bins to consider on each side of the defined source and target intervals, by default 1
random (bool, optional) – Set wether or not the source and target intervals are defined at random, by default False
auto (int, optional) – Set the number of intervals to get while picking at random, by default None
output_dir (str, optional) – Path to the folder where to save alignments, by default None
- Returns:
Dictionary of intervals as key : chromosome, value : [(low_limit_0, up_limit_0), (low_limit_1, up_limit_1), …, (low_limit_n, up_limit_n)].
- Return type:
dict[str, list[(int, int)]]
hicberg.io module
- hicberg.io.build_matrix(bins: str = 'fragments_fixed_sizes.txt', pairs: str = 'group1.pairs', mode: bool = False, balance: bool = False, cpus: int = 8, output_dir: str = None) None [source]
Take table of bins and .pairs file and build a matrix in .cool format.
- Parameters:
bins (str) – Path to the bin file.
pairs (str, optional) – Path to pairs file, by default “group1.pairs”
mode (bool, optional) – Choose wether the mode is rescued or un-rescued to construct associated .cool file, by default False
balance (bool, optional) – Set wether or not to balance the matrix, by default False
output_dir (str, optional) – Path to the folder where to save the cooler matrix file, by default None, by default None
- hicberg.io.build_pairs(bam_for: str = 'group1.1.bam', bam_rev: str = 'group1.2.bam', bam_for_rescued: str = 'group2.1.rescued.bam', bam_rev_rescued: str = 'group2.2.rescued.bam', mode: bool = False, output_dir: str = None) None [source]
Build pairs of reads from the aligned reads.
- Parameters:
bam_for (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (non rescued)., by default “group1.1.bam”
bam_rev (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (non rescued)., by default None, by default “group1.2.bam”
bam_for_rescued (str, optional) – Path to forward .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.1.rescued.bam”
bam_rev_rescued (str, optional) – Path to reverse .bam file for the construction of the .pairs equivalent file (rescued)., by default “group2.2.rescued.bam”
mode (bool, optional) – Choose wether the mode is rescued or unrescued to construct associated .pairs file, by default False
output_dir (str, optional) – Path where the alignment files (.sam) should be stored, by default None
- hicberg.io.create_folder(sample_name: str = None, output_dir: str = None, force: bool = False) None [source]
Creates folder architecture to store results and intermediate files for the full HiC-BERG pipeline.
- Parameters:
sample_name (str) – Name of the folder to be created.
force (bool) – Set if existing folder has to be deleted before folder creation.
output_dir (str) – Path where the folder will be created.
- Returns:
Path of the folder created
- Return type:
[str]
- hicberg.io.load_cooler(matrix: str = None) Cooler [source]
Load cooler matrix.
- Parameters:
matrix (str, optional) – Path to a cooler matrix, by default None
- Returns:
Cooler matrix object
- Return type:
cooler.Cooler
- hicberg.io.load_dictionary(dictionary: str = None) dict [source]
Load dictionary save into numpy format (.npy).
- Parameters:
dictionary (str, optional) – Path to a the dictionary saved in numpy (.npy) format, by default None
- Returns:
Python native dictionary
- Return type:
dict
- hicberg.io.merge_predictions(output_dir: str = None, clean: bool = True, stage='prediction', cpus: int = 1) None [source]
Merge predictions of all chunks of ambiguous reads predictions.
- Parameters:
output_dir (str, optional) – Path to the folder where to save the fused alignment file, by default None
clean (bool, optional) – Set weither or not to remove temporary chunks, by default True
stage (str, optional) – Set the stage of the merging, by default “prediction”. Can be “prediction” or “classification”
cpus (int, optional) – Number of cpus to use for the merging, by default 1
hicberg.omics module
- hicberg.omics.bedgraph_to_bigwig(bedgraph_file: str = 'coverage.bedgraph', chromosome_sizes: str = 'chromosome_sizes.txt', output_dir: str = None) None [source]
Convert bedgraph to bigwig format.
- Parameters:
bedgraph_file (str, optional) – Path to coverage (.bedgraph), by default “coverage.bedgraph”
chromosome_sizes (str, optional) – Path to chromosome sizes file (chrom_id, size), by default “chromosome_sizes.txt”
output_dir (str, optional) – [description], by default None
- Raises:
IOError – [description]
IOError – [description]
IOError – [description]
- hicberg.omics.format_chrom_sizes(chromosome_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) None [source]
Format chromosome sizes to bed and txt format. - bed format : chrom, start, end - txt format : chrom, size :param chrom_sizes: Path to chromosome sizes file (.npy), by default “chromosome_sizes.npy” :type chrom_sizes: str, optional :param output_dir: Path where the formatted chromosome sizes will be saved, by default None :type output_dir: str, optional
- hicberg.omics.get_bed_coverage(chromosome_sizes: str = 'chromosome_sizes.bed', pairs_file: str = 'preprocessed_pairs.pairs', output_dir: str = None) None [source]
Get bed coverage from pairs file (using bedtools).
- Parameters:
chromosome_sizes (str, optional) – Path to chromsomes sizes files (.bed format), by default “chromosome_sizes.bed”
pairs_file (str, optional) – Path to processed pairs files (columns : chrom, start, end, count), by default “preprocessed_pairs.pairs”
output_dir (str, optional) – Path where the coverage (.bed) will be saved, by default None
- hicberg.omics.get_bedgraph(bed_coverage: str = 'coverage.bed', output_dir: str = None) None [source]
Convert bed coverage to bedgraph format. Format is : chrom, start, end, count. Start and end are different by 1bp (end = start + 1).
- Parameters:
bed_coverage (str, optional) – Path to coverage (.bed), by default “coverage.bed”
output_dir (str, optional) – Path where the coverage (.bedgraph) will be saved, by default None
- hicberg.omics.preprocess_pairs(pairs_file: str = 'all_group.pairs', threshold: int = 1000, output_dir: str = None) None [source]
Preprocess pairs file to remove pairs that are not in the same chromosome or are greater than a threshold. Retain columns are : chromosome, start, end, count.
- Parameters:
pairs_file (str, optional) – Path to the pairs file, by default “all_group.pairs”
threshold (int, optional) – Threshold distance beyond which pairs will not be kept, by default 1000
output_dir (str, optional) – Path where the formatted pairs will be saved, by default None
hicberg.pipeline module
- hicberg.pipeline.check_tool(name: str) bool [source]
Check whether name is on PATH and marked as executable.
- Parameters:
name ([str]) – Depenedency to check.
- Returns:
True if the tool is available, False otherwise.
- Return type:
[bool]
- hicberg.pipeline.pipeline(name: str = 'sample', start_stage: str = 'fastq', exit_stage: str = 'None', genome: str = None, index=None, fq_for: str = None, fq_rev: str = None, sensitivity: str = 'very-sensitive', max_alignment: int = None, mapq: int = 35, enzyme: list[str] = ['DpnII', 'HinfI'], circular: str = '', rate: float = 1.0, distance: int = 1000, bins: int = 2000, nb_chunks: int = 1, mode: str = 'full', kernel_size: int = 11, deviation: float = 0.5, verbose: bool = False, cpus: int = 1, output_dir: str = None, force: bool = False, blacklist: str = None) None [source]
hicberg.plot module
- hicberg.plot.plot_benchmark(original_matrix: str = None, depleted_matrix: str = None, rescued_matrix: str = None, chromosomes: list[str] = None, output_dir: str = None) None [source]
Plot benchmark results (original, depleted and rescued matrices with associated log ratios). One plot per chromosome.
- Parameters:
original_matrix (str, optional) – Path to the original matrix, by default None
rescued_matrix (str, optional) – Path to the rescued matrix (re-attributed reads), by default None
chromosomes (list[str], optional) – List of chromosomes to plot, by default None
output_dir (str, optional) – Path to where to save plots, by default None
- hicberg.plot.plot_couple_repartition(forward_bam_file: str = 'group2.1.rescued.bam', reverse_bam_file: str = 'group2.2.rescued.bam', output_dir: str = None) None [source]
Plot read couples sizes distribution
- Parameters:
forward_bam_file (str, optional) – Path to forward .bam alignment file, by default 1.sorted.bam
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default 2.sorted.bam Minimal read quality under which a Hi-C read pair will not be kept, by default 30
output_dir (str, optional) – Path to the folder where to save the plot, by default None
- hicberg.plot.plot_coverages(bins: int = 2000, output_dir: str = None) None [source]
Plot coverages of chromosomes
- Parameters:
bins (int, optional) – Size of the desired bin., by default 2000
output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.
- hicberg.plot.plot_d1d2(output_dir: str = None) None [source]
Plot d1d2 law
- Parameters:
output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.
- hicberg.plot.plot_density(output_dir: str = None) None [source]
Plot density maps
- Parameters:
output_dir (str, optional) – Path to the folder where to save the plots (one plot per chromosome couple), by default None.
- hicberg.plot.plot_laws(output_dir: str = None) None [source]
Plot P(s) patterns laws
- Parameters:
output_dir (str, optional) – Path to the folder where to save the plot, by default None, by default None.
- hicberg.plot.plot_matrix(unrescued_matrix: str = 'unrescued_map.cool', rescued_matrix: str = 'rescued_map.cool', restriction_map: str = 'restriction_map.npy', genome: str = '', vmin: float = 0.0, vmax: float = 3.5, bins: int = 2000, output_dir: str = None) None [source]
Plot matrix with additional trackss
- Parameters:
unrescued_matrix (str, optional) – Path to the unrescued map file, by default unrescued_map.cool
rescued_matrix (str, optional) – Path to rescued map file, by default rescued_map.cool
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default dist.frag.npy
genome (str, optional) – Path to the genome to digest, by default None, by default None
vmin (float, optional) – Inferior limit for the colorscale, by default 0.0
vmax (float, optional) – Superior limit for the colorscale, by default 3.5
bins (int, optional) – Size of the desired bin., by default 2000
output_dir (str, optional) – Path to the folder where to save the plot, by default None
- hicberg.plot.plot_pattern_reconstruction(table: DataFrame = None, original_cool: str = None, rescued_cool: str = None, chromosome: str = None, threshold: float = 0.0, case: str = '', output_dir: str = None) None [source]
Create a plot of pattern reconstruction quality.
- Parameters:
table (pd.DataFrame, optional) – Table containing either true positives, false positives or false negatives patterns, by default None
original_cool (str, optional) – Path to the original matrix in .cool format, by default None
rescued_cool (str, optional) – Path to the rescued matrix in .cool format, by default None
chromosome (str, optional) – Selected chromosome, by default None
threshold (float, optional) – Threshold for pattern score significance, by default 0.0
case (str, optional) – Mode to consider, either true positives, false positives or false negatives, by default “”
output_dir (str, optional) – Path to save plots, by default None
hicberg.statistics module
- hicberg.statistics.attribute_xs(xs: ndarray[int], distance: int) int [source]
Attibute genomic distance to the corresponding log bin of xs.
- Parameters:
xs (np.ndarray[int]) – Array containing the log bins.
distance (int) – Genomic distance in bp.
- Returns:
Index of the corresponding bins where distance has to be attributed.
- Return type:
int
- hicberg.statistics.choice(a, size=None, replace=True, p=None)
Generates a random sample from a given 1-D array
New in version 1.7.0.
Note
New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.
- Parameters:
a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were
np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of
a
can be selected multiple times.p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in
a
.
- Returns:
samples – The generated random samples
- Return type:
single item or ndarray
- Raises:
ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size
See also
randint
,shuffle
,permutation
random.Generator.choice
which should be used in new code
Notes
Setting user-specified probabilities through
p
uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element ofp
is 1 / len(a).Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its
axis
keyword.Examples
Generate a uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3) array([0, 3, 4]) # random >>> #This is equivalent to np.random.randint(0,5,3)
Generate a non-uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0]) array([3, 3, 0]) # random
Generate a uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False) array([3,1,0]) # random >>> #This is equivalent to np.random.permutation(np.arange(5))[:3]
Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0]) array([2, 3, 0]) # random
Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher'] >>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3]) array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random dtype='<U11')
- hicberg.statistics.compute_density(cooler_file: str = 'unrescued_map.cool', threads: int = 2, kernel_size: int = 11, deviation: float = 0.5, output_dir: str = None) None [source]
Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.
- Parameters:
cooler_file (str, optional) – [description], by default None
threads (int, optional) – [description], by default 2
output_dir (str, optional) – [description], by default None
- hicberg.statistics.compute_propensity(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, xs: dict = None, weirds: dict = None, uncuts: dict = None, loops: dict = None, circular: str = '', trans_ps: dict = None, coverage: dict = None, bins: int = 2000, d1d2: dict = None, density_map: dict = None, mode: str = 'full') float [source]
Compute propensity for read pair to be selected among all plausible pairs related to multi-mapping reads.
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin, by default 2000
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None
density (dict) – Dictionary of contact density by chromosome couple.
mode (str, optional) – Mode to use to compute propensity among, by default “full”
- Returns:
Propensity to use for read couple drawing
- Return type:
float
- hicberg.statistics.draw_read_couple(propensities: array) int [source]
Draw an index respecting distribution of propensities. This function is used to draw a couple of reads considering the propensity of each couple.
- Parameters:
propensities (np.array) – Array containing all the propensities of each couple of reads.
- Returns:
Index of the couple of reads drawn.
- Return type:
int
- hicberg.statistics.generate_coverages(genome: str = None, bins: int = 2000, forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', output_dir: str = None) None [source]
Take a genome and both forward and reverse bam files for unambiguous group and return a dictionary containing the coverage in terms of reads overs chromosomes.
- Parameters:
genome (str, optional) – Path to the genome file to get coverage on, by default None
bins (int, optional) – Size of the desired bin., by default 2000
forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.2.bam
output_dir (str, optional) – Path to the folder where to save the classified alignment files, by default None, by default None
- hicberg.statistics.generate_d1d2(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', restriction_map: str = 'restriction_map.npy', output_dir: str = None) None [source]
Compute d1d2 distance laws with the given alignments and restriction map.
- Parameters:
forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.2.bam”
restriction_map (str, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default “dist.frag.npy”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None
- hicberg.statistics.generate_density_map(matrix: str = 'unrescued_map.cool', size: int = 5, sigma: int = 2, n_mads: int = 2, nan_threshold: bool = False, output_dir: str = None) dict[str, ndarray[float]] [source]
Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.
- Parameters:
matrix (str, optional) – Path to a cooler matrix, by default None
rounds (int, optional) – Number of times the matrix has to be shaken, by default 1
magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0
output_dir (str, optional) – Path to the folder where to save the density map, by default None
- Returns:
Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.
- Return type:
dict[str, np.ndarray[float]]
- hicberg.statistics.generate_density_map_backup(matrix: str = 'unrescued_map.cool', rounds: int = 1, magnitude: float = 1.0, output_dir: str = None) dict[str, ndarray[float]] [source]
Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps.
- Parameters:
matrix (str, optional) – Path to a cooler matrix, by default None
rounds (int, optional) – Number of times the matrix has to be shaken, by default 1
magnitude (float, optional) – Blending ratio between the native matrix and the shaken one, by default 1.0
output_dir (str, optional) – Path to the folder where to save the density map, by default None
- Returns:
Density maps as a dictionary where keys are chromosomes names couples as tuples and values are density maps.
- Return type:
dict[str, np.ndarray[float]]
- hicberg.statistics.generate_trans_ps(matrix: str = 'unrescued_map.cool', chrom_sizes: str = 'chromosome_sizes.npy', output_dir: str = None) None [source]
Generate pseudo-P(s) while considering reads originating from different chromosomes. Pseudo-P(s) are computed as the number of interactions between two chromosomes divided by the product of the length of these two chromosomes.
- Parameters:
matrix (str, optional) – Path to a cooler matrix, by default “unrescued_map.cool”
chrom_sizes (str, optional) – Path to the chromosome sizes dictionary, by default “chromosome_sizes.npy”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None.
- hicberg.statistics.generate_xs(chromosome_size: int, base: float = 1.1) ndarray[int] [source]
Generate xs array for computing P(s). Return xs array which is log space.
- Parameters:
chromosome_size (int) – Size of the chromosome to be binned in bp.
base (float, optional) – Base of the log space., by default 1.1
- Returns:
Array of log bins related to the chromosome.
- Return type:
np.ndarray[int]
- hicberg.statistics.get_d1d2(read_forward: AlignedSegment, read_reverse: AlignedSegment, restriction_map: dict = None, d1d2: array = None) int [source]
Get the d1d2 value of a pair of reads.
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None
- Returns:
d1d2 value
- Return type:
int
- hicberg.statistics.get_density(read_forward: ~pysam.libcalignedsegment.AlignedSegment, read_reverse: ~pysam.libcalignedsegment.AlignedSegment, density_map: dict[slice((<class 'str'>, <class 'str'>), <built-in function array>, None)], bin_size: int = 2000) float [source]
Get density from density map dictionary.
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to get density from.
read_reverse (pysam.AlignedSegment) – Reverse read to get density from.
density_map (dict) – Dictionary containing density maps for each chromosome couple.
bin_size (int, optional) – Resolution of the matrix on which density map has been computed, by default 2000
- Returns:
Density corresponding to the pair of reads.
- Return type:
float
- hicberg.statistics.get_dist_frags(genome: str = None, restriction_map: dict = None, circular: str = '', rate: float = 1.0, output_dir: str = None) None [source]
Get the distribution of fragments’ distance from a genome distribution .
- Parameters:
genome (str, optional) – Path to the genome, by default None
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
circular (str, optional) – Name of the chromosomes to consider as circular, by default “”
rate (float, optional) – Set the proportion of restriction sites to consider. Avoid memory overflow when restriction maps are very dense, by default 1.0
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None
- Returns:
Dictionary of sub-sampled restriction map with keys as chromosome names and values as lists of restriction sites’ position.
- Return type:
dict
- hicberg.statistics.get_pair_cover(read_forward: AlignedSegment, read_reverse: AlignedSegment, coverage: dict, bins: int) float [source]
Get the coverage of a pair of reads.
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin.
- Returns:
Product of the coverage of the pair of reads.
- Return type:
float
- hicberg.statistics.get_pair_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, xs: dict, weirds: dict, uncuts: dict, loops: dict, circular: str = '') float [source]
Take two reads and return the P(s) value depending on event type (intra-chromosomal case only).
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
- Returns:
P(s) of the pair considering the event type.
- Return type:
float
- hicberg.statistics.get_patterns(forward_bam_file: str = 'group1.1.bam', reverse_bam_file: str = 'group1.2.bam', xs: str = 'xs.npy', chrom_sizes: str = 'chromosome_sizes.npy', circular: str = '', blacklist: str = None, output_dir: str = None) None [source]
Get the patterns distribution from read pairs alignment. .
- Parameters:
forward_bam_file (str, optional) – Path to forward .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.1.bam”
reverse_bam_file (str, optional) – Path to reverse .bam alignment file, by default None, by default group1.1.bam, by default “group1.1.bam”, by default “group1.2.bam”
xs (str, optional) – Path to the dictionary containing the xs values, by default “xs.npy”
dist_frag (str, optional) – Path to the dictionary containing the inter fragment distances, by default “dist.frag.npy”
circular (str, optional) – Name of the chromosomes to consider as circular, by default “”
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None, by default None, by default None
- hicberg.statistics.get_restriction_map(genome: str = None, enzyme: list[str] = ['DpnII'], output_dir: str = None) dict[str, ndarray[int]] [source]
Get ordered restriction map (including 0 and n) from a chromosome sequence. Return a dictionary where keys are chromosomes names and values are restrictions sites positions.
- Parameters:
genome (str, optional) – Path to the genome to digest, by default None, by default None
enzyme (list[str], optional) – Enzyme or list of enzyme to digest the genome with. If integer passed, micro-C mode using Mnase is used, and the integer correspond to the size of nucleosomal fragment, by default None, by default [“DpnII”]
output_dir (str, optional) – Path to the folder where to save the plot, by default None
- Returns:
Dictionary of the product of digestion where keys are chromosomes names and values are restrictions sites positions.
- Return type:
dict
- hicberg.statistics.get_top_pattern(file: str = None, top: int = 10, chromosome: str = None) DataFrame [source]
Get top patterns from a dataframe
- Parameters:
df (pd.DataFrame, optional) – Dataframe containing patterns given by Chromosight, by default None
top (int, optional) – Percentage of top patterns to get, by default 10
chromosome (str, optional) – Chromosome to consider, by default None
- Returns:
Dataframe containing top percentage patterns.
- Return type:
pd.DataFrame
- hicberg.statistics.get_trans_ps(read_forward: AlignedSegment, read_reverse: AlignedSegment, trans_ps: dict) float [source]
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)
- Returns:
Trans P(s) value
- Return type:
float
- hicberg.statistics.log_bin_genome(genome: str, base: float = 1.1, output_dir: str = None) dict[str, ndarray[int]] [source]
- hicberg.statistics.pearson_score(original_matrix: Cooler, rescued_matrix: Cooler, markers: list[int]) float [source]
Compute Pearson correlation between concatenated matrix bins which have been deleted and reconstructed.
- Parameters:
original_matrix (cooler.Cooler) – Cooler object containing the original matrix.
rescued_matrix (cooler.Cooler) – Cooler object containing the reconstructed matrix.
markers (list[int]) – List of markers to consider. Markers are the bins which have been deleted.
- Returns:
Pearson correlation between original and reconstructed matrix.
- Return type:
float
- hicberg.statistics.reattribute_reads(reads_couple: tuple[str, str] = ('group2.1.bam', 'group2.2.bam'), restriction_map: dict = None, xs: dict = 'xs.npy', weirds: dict = 'weirds.npy', uncuts: dict = 'uncuts.npy', loops: dict = 'loops.npy', circular: str = '', trans_ps: dict = 'trans_ps.npy', coverage: dict = 'coverage.npy', bins: int = 2000, d1d2: dict = 'd1d2.npy', density_map: dict = 'density_map.npy', mode: str = 'full', output_dir: str = None) None [source]
Re-attribute multi-mapping (ambiguous) reads considering sets of statistical laws.
- Parameters:
reads_couple (tuple[str, str], optional) – Paths to ambiguous reads alignment files (.bam), by default (“group2.1.bam”, “group2.2.bam”)
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
xs (dict) – Dictionary containing log binning values for each chromosome.
weirds (dict) – Dictionary containing number of weird events considering distance for each chromosome.
uncuts (dict) – Dictionary containing number of uncuts events considering distance for each chromosome.
loops (dict) – Dictionary containing number of loops events considering distance for each chromosome.
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
trans_ps (dict) – Dictionary of trans-chromosomal P(s)
coverage (dict) – Dictionary containing the coverage of each chromosome.
bins (int) – Size of the desired bin, by default 2000
d1d2 (np.array, optional) – Distribution of d1d2 values, by default None
density (np.array, optional) – Dictionary containing density maps per chromosome couples as, by default None
mode (str, optional) – Mode to use to compute propensity among, by default “full”
output_dir (str, optional) – Path to the re-attributed ambiguous reads alignment files are saved, by default None
hicberg.utils module
- hicberg.utils.bam_iterator(bam_file: str = None) Iterator[AlignedSegment] [source]
Returns an iterator for the given SAM/BAM file (must be query-sorted). In each call, the alignments of a single read are yielded as a 3-tuple: (list of primary pysam.AlignedSegment, list of supplementary pysam.AlignedSegment, list of secondary pysam.AlignedSegment).
- Parameters:
bam ([str]) – Path to alignment file in .sam or .bam format.
- Yields:
Iterator[pysam.AlignedSegment] – Yields a list containing pysam AlignmentSegment objects, within which all the reads have the same id.
- hicberg.utils.block_counter(forward_bam_file: str, reverse_bam_file: str) Tuple[int, int] [source]
Return as a tuple the number of blocks in the forward and reverse bam files.
- Parameters:
forward_bam_file (str, optional) – Path to forward .bam alignment file.
reverse_bam_file (str, optional) – Path to reverse .bam alignment file.
- Returns:
Number of blocks in the forward and reverse bam files.
- Return type:
Tuple[int, int]
- hicberg.utils.choice(a, size=None, replace=True, p=None)
Generates a random sample from a given 1-D array
New in version 1.7.0.
Note
New code should use the ~numpy.random.Generator.choice method of a ~numpy.random.Generator instance instead; please see the random-quick-start.
- Parameters:
a (1-D array-like or int) – If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were
np.arange(a)
size (int or tuple of ints, optional) – Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.replace (boolean, optional) – Whether the sample is with or without replacement. Default is True, meaning that a value of
a
can be selected multiple times.p (1-D array-like, optional) – The probabilities associated with each entry in a. If not given, the sample assumes a uniform distribution over all entries in
a
.
- Returns:
samples – The generated random samples
- Return type:
single item or ndarray
- Raises:
ValueError – If a is an int and less than zero, if a or p are not 1-dimensional, if a is an array-like of size 0, if p is not a vector of probabilities, if a and p have different lengths, or if replace=False and the sample size is greater than the population size
See also
randint
,shuffle
,permutation
random.Generator.choice
which should be used in new code
Notes
Setting user-specified probabilities through
p
uses a more general but less efficient sampler than the default. The general sampler produces a different sample than the optimized sampler even if each element ofp
is 1 / len(a).Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its
axis
keyword.Examples
Generate a uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3) array([0, 3, 4]) # random >>> #This is equivalent to np.random.randint(0,5,3)
Generate a non-uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0]) array([3, 3, 0]) # random
Generate a uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False) array([3,1,0]) # random >>> #This is equivalent to np.random.permutation(np.arange(5))[:3]
Generate a non-uniform random sample from np.arange(5) of size 3 without replacement:
>>> np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0]) array([2, 3, 0]) # random
Any of the above can be repeated with an arbitrary array-like instead of just integers. For instance:
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher'] >>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3]) array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random dtype='<U11')
- hicberg.utils.chunk_bam(forward_bam_file: str = 'group2.1.bam', reverse_bam_file: str = 'group2.2.bam', nb_chunks: int = 2, output_dir: str = None) None [source]
Split a .bam file into chunks .bam files. :param forward_bam_file: Path to forward .bam alignment file, by default group2.1.bam :type forward_bam_file: str, optional :param reverse_bam_file: Path to reverse .bam alignment file, by default group2.2.bam :type reverse_bam_file: str, optional :param nb_chunks: Number of chunks to create, by default 2 :type nb_chunks: int, optional :param output_dir: Path to the folder where to save the classified alignment files, by default None :type output_dir: str, optional
- hicberg.utils.classify_reads(bam_couple: tuple[str, str] = ('1.sorted.bam', '2.sorted.bam'), chromosome_sizes: str = 'chromosome_sizes.npy', mapq: int = 35, output_dir: str = None) None [source]
- Classification of pairs of reads in 2 different groups:
Group 0) –> (Unmappable) - files :group0.1.bam and group0.2.bam Group 1) –> (Uniquely Mapped Uniquely Mapped) - files :group1.1.bam and group1.2.bam Group 2) –> (Uniquely Mapped Multi Mapped) or (Multi Mapped Multi Mapped).- files :group2.1.bam and group2.2.bam
- Parameters:
bam_couple (tuple[str, str]) – Tuple containing the paths to the forward and reverse alignment files. By default (“1.sorted.bam”, “2.sorted.bam”)
chromosome_sizes (str, optional) – Path to a chromosome size dictionary save in .npy format, by default chromosome_sizes.npy
mapq (int, optional) – Minimal read quality under which a Hi-C read pair will not be kept, by default 30
output_dir (str, optional) – Path to the folder where to save the classified alignment files, by default None
- hicberg.utils.detrend_matrix(matrix: array) array [source]
Detrend a matrix by P(s).
- Parameters:
matrix (np.array) – Hi-C matrix to detrend.
- Returns:
Detrended Hi-C matrix.
- Return type:
np.array
- hicberg.utils.format_blacklist(blacklist: str = None) dict[str, Tuple[int, int]] [source]
Format a blacklist file into a dictionary.
- Parameters:
blacklist (str, optional) – Path to the blacklist file. If set to -1 this is equivalent to None for workflow managers purpose, by default None
- Returns:
Dictionary with chromosome name as key and tuple of start and end as value.
- Return type:
dict[str, Tuple[int, int]]
- hicberg.utils.generate_gaussian_kernel(size: int = 1, sigma: int = 2) array [source]
Generate a 2D Gaussian kernel of a given size and standard deviation.
- Parameters:
size (int, optional) – Size of the kernel, by default 1
sigma (int, optional) – Standard deviation to use for the kernel build, by default 2
- Returns:
2D Gaussian kernel.
- Return type:
np.array
- hicberg.utils.get_bad_bins(matrix: array = None, n_mads: int = 2) array [source]
Detect bad bins (poor interacting bins) in a normalized Hi-C matrix and return their indexes. Bins where the nan sum of interactions is zero are considered as bad bins.
- Parameters:
matrix (Normalized Hi-C matrix to detect bad bins from, by default None.) –
n_mads (int, optional) – Number of median absolute deviations to set poor interacting bins threshold, by default 2
- Returns:
Indexes of bad bins.
- Return type:
np.array
- hicberg.utils.get_bin_table(chrom_sizes_dict: str = 'chromosome_sizes.npy', bins: int = 2000, output_dir: str = None) None [source]
Create bin table containing start and end position for fixed size bin per chromosome.
- Parameters:
chrom_sizes_dict (str) – Path to a dictionary containing chromosome sizes as {chromosome : size} saved in .npy format. By default chromosome_sizes.npy
bins (int) – Size of the desired bin, by default 2000.
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None
- hicberg.utils.get_chromosomes_sizes(genome: str = None, output_dir: str = None) None [source]
Generate a dictionary save in .npy format where keys are chromosome name and value are size in bp.
- Parameters:
genome (str, optional) – Path to the genome, by default None
output_dir (str, optional) – Path to the folder where to save the dictionary, by default None
- hicberg.utils.get_chunks(output_dir: str = None)[source]
Return a tuple containing the paths to the forward and reverse chunks.
- Parameters:
output_dir (str, optional) – Path to get chunks from, by default None
- Returns:
Tuple containing the paths to the forward and reverse chunks.
- Return type:
tuple([List[str], List[str]]
- hicberg.utils.get_cis_distance(read_forward: AlignedSegment, read_reverse: AlignedSegment, circular: str = '') int [source]
Calculate the distance between two reads in the same pairwise alignment .
- Parameters:
read_forward (pysam.aligned_segment) – Forward read of the pair
read_reverse (pysam.AlignedSegment) – Reverse read of the pair
circular (str, optional) – Name of the chromosomes to consider as circular, by default None, by default “”.
- Returns:
Genomic distance separating the two reads (bp).
- Return type:
int
- hicberg.utils.get_local_density(cooler_file: str = None, chrom_name: tuple = (None, None), size: int = 11, sigma: int = 0.2, n_mads: int = 2, nan_threshold: bool = False) array [source]
Create density map from a Hi-C matrix. Return a dictionary where keys are chromosomes names and values are density maps. Density is obtained by getting the local density of each pairwise bin using a gaussian kernel convolution.
- Parameters:
cooler_file (str, optional) – Path to Hi-C matrix (or sub-matrix) to dget density from, by default None, by default None
chrom_name (tuple, optional) – Tuple containing the sub-matrix to fetch, by default (None, None)
size (int, optional) – Size of the gaussian kernel to use, by default 5
sigma (int, optional) – Standard deviation to use for the kernel, by default 2
n_mads (int, optional) – Number of median absolute deviations to set poor interacting bins threshold, by default 2
nan_threshold (bool, optional) – Set wether or not convolution return nan if not enough value are caught, by default None
- Returns:
Map of local contact density.
- Return type:
np.array
- hicberg.utils.get_ordered_reads(read_forward: AlignedSegment, read_reverse: AlignedSegment) Tuple[AlignedSegment, AlignedSegment] [source]
Returns the ordered pair of reads in the same chromosome as the two reads .
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read to compare with the reverse read.
read_reverse (pysam.AlignedSegment) – Reverse read to compare with the forward read.
- Returns:
The ordered pair of reads in the same chromosome as the two reads.
- Return type:
Tuple[pysam.AlignedSegment, pysam.AlignedSegment]
- hicberg.utils.is_blacklisted(read_forward: AlignedSegment, read_reverse: AlignedSegment, blacklist: dict[str, Tuple[int, int]] = None) bool [source]
Check if a read pair is blacklisted based on a list of coordiantes.
- Parameters:
read_forward (pysam.AlignmentSegment) – Forward read of the pair.
read_reverse (pysam.AlignedSegment) – Reverse read of the pair.
blacklist (dict[str, Tuple[int, int]]) – Blacklist of coordinates to check against. Chromsome name as key and tuple of start and end as value. Chromosome names should be formatted as ‘chr1_A’, ‘chr1_B’, etc. With A and B being the index of the coordinates to blacklist in a given chromosome.
- Returns:
True if the read pair is blacklisted, False otherwise.
- Return type:
bool
- hicberg.utils.is_circle(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool [source]
Check if two reads are forming a loop pattern .
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read of the pair
read_reverse (pysam.AlignedSegment) – Reverse read of the pair
- Returns:
True if the two reads are forming a loop pattern, False otherwise.
- Return type:
bool
- hicberg.utils.is_duplicated(read: AlignedSegment) bool [source]
Check if read from pysam AlignmentFile is mapping more than once along the genome.
- Parameters:
read (pysam.AlignedSegment) – pysam AlignedSegment object.
- Returns:
True if the read is duplicated i.e. mapping to more than one position.
- Return type:
bool
- hicberg.utils.is_empty_alignment(alignment_file: str) bool [source]
Check if an alignment file is empty. If empty, return True, else return False.
- Parameters:
alignment_file (str) – Path to the alignment file to check.
- Returns:
Return True if the file is empty, False otherwise.
- Return type:
bool
- hicberg.utils.is_intra_chromosome(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool [source]
Return True if two reads of a pair came from the same chromosome.
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read of the pair.
read_reverse (pysam.AlignedSegment) – Reverse read of the pair.
- Returns:
True if the pair is intra-chromosomic, False otherwise.
- Return type:
bool
- hicberg.utils.is_poor_quality(read: AlignedSegment, mapq: int) bool [source]
Check if read from pysam AlignmentFile is under mapping quality threshold
- Parameters:
read (pysam.AlignedSegment) – pysam AlignedSegment object.
mapq (int) – Mapping quality threshold.
- Returns:
True if the read quality is below mapq threshold.
- Return type:
bool
- hicberg.utils.is_reverse(read: AlignedSegment) bool [source]
Check if read from pysam AlignmentFile is reverse
- Parameters:
read (pysam.AlignedSegment) – pysam AlignedSegment object.
- Returns:
True if the read is reverse.
- Return type:
bool
- hicberg.utils.is_uncut(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool [source]
Check if two reads are forming an uncut pattern .
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read of the pair
read_reverse (pysam.AlignedSegment) – Reverse read of the pair
- Returns:
True if the two reads are forming an uncut pattern, False otherwise.
- Return type:
bool
- hicberg.utils.is_unmapped(read: AlignedSegment) bool [source]
Check if read from pysam AlignmentFile is unmapped
- Parameters:
read (pysam.AlignedSegment) – pysam AlignedSegment object.
- Returns:
True if the read is unmapped.
- Return type:
bool
- hicberg.utils.is_unqualitative(read: AlignedSegment) bool [source]
Check if the read is not qualitative.
- Parameters:
read (pysam.AlignedSegment) – pysam AlignedSegment object.
- Returns:
True if the read is not qualitative, False otherwise.
- Return type:
bool
- hicberg.utils.is_weird(read_forward: AlignedSegment, read_reverse: AlignedSegment) bool [source]
Check if two reads are forming a weird pattern .
- Parameters:
read_forward (pysam.AlignedSegment) – Forward read of the pair
read_reverse (pysam.AlignedSegment) – Reverse read of the pair
- Returns:
True if the two reads are forming a weird pattern, False otherwise.
- Return type:
bool
- hicberg.utils.mad_smoothing(vector: ndarray[int] = None, window_size: int | str = 'auto', nmads: int = 1) ndarray[int] [source]
Apply MAD smoothing to an vector .
- Parameters:
vector (np.ndarray[int], optional) – Data to smooth, by default None
window_size (int or str, optional) – Size of the window to perform mean sliding average in. Window is center on current value as [current_value - window_size/2] U [current_value + window_size/2], by default “auto”
nmads (int, optional) – number of median absolute deviation tu use, by default 1
- Returns:
MAD smoothed vector.
- Return type:
np.ndarray[int]
- hicberg.utils.max_consecutive_nans(vector: ndarray) int [source]
Return the maximum number of consecutive NaN values in a vector.
- Parameters:
vector (np.ndarray) – Vector to get the maximum number of consecutive NaN values from.
- Returns:
Number of maximum consecutive NaN values.
- Return type:
int
- hicberg.utils.nan_conv(matrix: array = None, kernel: array = None, nan_threshold: bool = False) array [source]
Custom convolution function that takes into account nan values when convolving. Used to compute the local density of a Hi-C matrix.
- Parameters:
matrix (np.array, optional) – Hi-C matrix to detect bad bins from, by default None
kernel (np.array, optional) – Kernel to use for convolution (dimension must be odd), by default None
nan_threshold (bool, optional) – Set wether or not convolution return nan if not enough value are caught, by default False
- Returns:
Convolution product of the matrix and the kernel.
- Return type:
np.array
- hicberg.utils.replace_consecutive_zeros_with_mean(vector: ndarray[float]) ndarray[float] [source]
Replace consecutive zeros in a vector with the mean of the flanking values.
- Parameters:
vector (np.ndarray[float]) – Array to replace consecutive zeros in.
- Returns:
Array with consecutive zeros replaced by the mean of the flanking values.
- Return type:
np.ndarray[float]
- hicberg.utils.subsample_restriction_map(restriction_map: dict = None, rate: float = 1.0) dict[str, ndarray[int]] [source]
Subsample a restriction map by a given rate.
- Parameters:
restriction_map (dict, optional) – Restriction map saved as a dictionary like chrom_name : list of restriction sites’ position, by default None
rate (float, optional) – Set the proportion of restriction sites to consider. Avoid memory overflow when restriction maps are very dense, by default 1.0
- Returns:
Dictionary of sub-sampled restriction map with keys as chromosome names and values as lists of restriction sites’ position.
- Return type:
dict[str, np.ndarray[int]]
- hicberg.utils.sum_mat_bins(matrix: array) array [source]
Adapted from : https://github.com/koszullab/hicstuff/tree/master/hicstuff Compute the sum of matrices bins (i.e. rows or columns) using only the upper triangle, assuming symmetrical matrices.
- Parameters:
mat (scipy.sparse.coo_matrix) – Contact map in sparse format, either in upper triangle or full matrix.
- Returns:
1D array of bin sums.
- Return type:
numpy.ndarray