dataset package

Submodules

dataset.config module

Configuration for development - change these with caution!

dataset.dataset module

The Dataset object aggregates information about your whole-slide images, their labels, any filtration that should be applied to regions thereof, and any augmentation functions that should be applied to the regions as they are fetched from the disk. This class inherits from PyTorch’s dataset.

class dataset.dataset.Dataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.Dataset

Dataset implements automatic label inference, optional augmentation, and optional dynamic filtration.

augment_region(region: numpy.ndarray) → numpy.ndarray

augment_region augments a region using self.augmentation

Parameters: region (np.ndarray) – the region to be (potentially) augmented
Returns: the (potentially) augmented region
Return type: np.ndarray

get_label(filename: str) → Any

get_label returns the label associated with a certain filename

Parameters: filename (str) – the filename in question
Returns: the label for the filename in question
Return type: Any

get_label_distribution() → dict

get_label_distribution gives the count of regions belonging to each label - assumes region label is region’s image’s label

Returns: a dictionary representing the counts of the images’ labels
Return type: dict

get_region(filename: str, region_num: int) → numpy.ndarray

get_region returns the region at location region num in filename, applying filtration and filtration cache if necessary

Parameters

filename (str) – the filename from which to get the region
region_num (int) – the number of the region in the image at filename

Returns

the region in question

Return type

np.ndarray

get_region_labels_as_list() → List[Any]

get_region_labels_as_list returns an ordered list of region labels corresponding to the order of regions in the dataset

Returns: a list of labels
Return type: list[Any]

get_region_location_from_index(index: int) → Tuple[str, int]

get_region_location_from_index returns the image, region_num location of the region at dataset[i]

Parameters: index (int) – dataset[i]
Raises: IndexError – if index is out of bounds
Returns: the location of the region at dataset[i]
Return type: Tuple[str, int]

iterate_by_file(as_pytorch_datasets=False) → Generator[Tuple[str, Any, Generator], None, None]

iterate_by_file allows for users to iterate over regions in an image given the filename and the label

Parameters: as_pytorch_datasets (bool) – Whether to return pytorch datasets instead of the normal generators, defaults to False
Yield: the filename, the label, and an iterator for the regions in the image
Return type: Tuple[str, Any, Generator]

number_of_regions(filename: Optional[str] = None) → int

number_of_regions get the number of regions in the dataset or in a single image managed by the dataset

Parameters: filename (Optional[str]) – An optional parameter which will get the number of regions in that image (if it’s in the dataset) instead of the number of regions in the entire dataset, defaults to None
Returns: the number of regions in the dataset or image at filename
Return type: int

dataset.filtration_cache module

A system for caching information about which regions of an image pass through a filter. Includes metadata verification and context management.

class dataset.filtration_cache.FiltrationCache(h5filepath: Optional[Filepath] = 'filtration_cache.h5', h5filetitle: Optional[str] = 'filtration_cache', region_dims: Optional[RegionDimensions] = (512, 512))

Bases: contextlib.AbstractContextManager

FiltrationCache Tracks images’ regions’ filtration statuses in a PyTables hdf5 database

Raises

NotImplementedError – when region_index is region coordinates instead
TypeError – when region_index is of the wrong type

class Description(*args: Any, **kwargs: Any)

Bases: tables.IsDescription

Description - the structure for tables in the database

get_metadata(filtration: FiltrationRepr, filepath: Filepath) → FiltrationCacheMetadata

get_metadata gets the metadata for a table if it exists

Parameters

filtration (util.FiltrationRepr) – the filtration for which to get metadata
filepath (util.FilePath) – the image for which to get metadata

Returns

the metadata in question, if available

Return type

util.FiltrationCacheMetadata

get_status(filtration: Union[FiltrationRepr, str], filepath: Filepath, region_index: Optional[RegionIndex] = None) → FiltrationStatus

get_status gets one or all records from table for filtration, os.path.basename(filepath)

Parameters

filtration (Union[util.FiltrationRepr, str]) – a string representing the filtration applied to the image
filepath (util.FilePath) – a filepath representing the image filtration was applied to
region_index (Optional[util.RegionIndex], optional) – the index of the region in question, defaults to None

Raises

NotImplementedError – if region_index is coordinates
TypeError – if region_index is of the wrong type

Returns

a tuple of (region index, target region index, region index filtration status)

Return type

util.FiltrationReprStatus

has_data(filtration: FiltrationRepr, filepath: Filepath, **kwargs) → bool

has_data checks if the filtrationcache has a table at filtration/filepath

Parameters

filtration (util.FiltrationRepr) – the filtration for which statuses are checked
filepath (util.FilePath) – a key for the image in question

Returns

whether the FiltrationCache has the data in question

Return type

bool

metadata_fields = ['_image_filepath', '_image_size', '_image_region_count', '_image_dark_regions_count', '_image_regions_discounted', '_image_region_dims']

preprocess(filtration: FiltrationRepr, filepath: Filepath, loadingbars: bool, overwrite: bool = True, **kwargs) → None

preprocess applies filtration to image(s) at filepath (listdir recursive)

Parameters

filtration (util.FiltrationRepr) – filtration applied to images’ regions’
filepath (util.FilePath) – if a directory, applied to all files in directory
overwrite (bool, optional) – whether to overwrite existing data, if applicable, defaults to True

dataset.filtration_cache.clear_table(table: tables.Table) → None

clear_table clears a pytables table of all contents (currently leaves metadata)

Parameters: table (pt.Table) – the table to clear

dataset.filtration_cache.postprocess_filepath(filepath: Filepath) → str

postprocess_filepath undoes the preprocessing

Parameters: filepath (util.FilePath) – the filepath to un-preprocess
Returns: the natural filepath
Return type: str

dataset.filtration_cache.preprocess(filtration: Callable, filepath: Filepath, region_dims: RegionDimensions, loadingbars: bool, **kwargs) → Tuple[Dict[int, FiltrationStatus], int]

preprocess returns a mapping of region index to (fitration status and dark-region-mapped region index)

Parameters

filtration (util.FiltrationRepr) – the filtration to apply to the image’s regions. If callable and not strictly filtration, then a ranked threshold approach is used - see _apply_filtration_to_regions_ranked_threshold
filepath (util.FilePath) – the filepath where the image in question is found
region_dims (unified_image_reader.util.RegionDimensions) – the dimensions of the regions to which filtration is applied

Returns

filtration status records and dark region count

Return type

Tuple[Dict[int, util.FiltrationStatus], int]

dataset.filtration_cache.preprocess_filepath(filepath: Filepath) → str

preprocess_filepath pytables can’t handle certain characters

Parameters: filepath (util.FilePath) – the filepath to preprocess
Returns: a filepath more coherent to pytables’s restrictions
Return type: str

dataset.filtration_cache.preprocess_filtration(filtration: FiltrationRepr, **kwargs) → FiltrationRepr

preprocess_filtration removes whitespace for pytables compatibility

Parameters: filtration (util.FiltrationRepr) – the filtration to represent
Returns: a pytables-agreeable filtration representation
Return type: util.FiltrationRepr

dataset.filtration_cache.process_region(filepath: util.FilePath, filtration: util.FiltrationRepr, region_index: unified_image_reader.util.RegionIndex, region_dims: util.RegionDimensions) → Tuple[int, Dict[str:Any]]

process_region applies filtration to the specified region of the given image

Returns: the region index and the filtration status
Return type: Tuple[int, Dict[str: Any]]

dataset.label_extractor module

An automatic (or overloaded) label inference class

class dataset.label_extractor.LabelExtractor

Bases: abc.ABC

Strategy Pattern –> extracts labels from path for dictionary-based lookup

abstract static extract_labels(path: str): extracts labels from path for dictionary-based lookup

class dataset.label_extractor.LabelExtractorCSV

Bases: dataset.label_extractor.LabelExtractor

labels in csv file

static extract_labels(path: str): labels are inside of a csv file at path of structure (each line) <key><sep><label>

class dataset.label_extractor.LabelExtractorJSON

Bases: dataset.label_extractor.LabelExtractor

labels in json file

static extract_labels(path: str): labels are inside of a json file at path of structure {key: label, …}

class dataset.label_extractor.LabelExtractorNoLabels

Bases: dataset.label_extractor.LabelExtractor

class DefaultDictWithGet

Bases: collections.defaultdict

get(*args, **kwargs): Return the value for key if key is in the dictionary, else default.

static extract_labels(path: str): returns ‘LabelExtractorNoLabels’ for all labels

class dataset.label_extractor.LabelExtractorParentDir

Bases: dataset.label_extractor.LabelExtractor

labels represented by relative path

static extract_labels(path: str): labels are path relative to path arg (label_postprocessor recommended)

dataset.label_manager module

Label inference for the custom dataset

class dataset.label_manager.LabelManager(path: Filepath, label_extraction: Optional[dataset.label_extractor.LabelExtractor] = None, label_preprocessor: Optional[Callable] = None, label_postprocessor: Optional[Callable] = None, error_if_no_labels: bool = True)

Bases: object

A dictionary wrapper for managing labels

Raises

NotImplementedError – when a given file extension cannot be parsed natively
TypeError – when the label_extractor isn’t a LabelExtractor
TypeError – when label_preprocessor isn’t Callable
TypeError – when label_postprocessor isn’t Callable
IndexError – when a key doesn’t have a value

dataset.util module

Utility functions, etc. for the custom dataset

class dataset.util.ThreadingLock

Bases: contextlib.AbstractContextManager

A wrapper on threading.Lock that implements a context manager so that when the context closes the lock will unlock

Example:

with status_lock as permission: # will hang until it gets permission: # do things

exception dataset.util.UnsupportedFileType

Bases: Exception

UnsupportedFileType is raised when a file extension can’t be parsed natively

dataset.util.apply_args_and_kwargs(fn, args, kwargs): https://stackoverflow.com/a/53173433/13747259

dataset.util.listdir_recursive(path: Filepath) → List[Filepath]

listdir_recursive lists files (not directories) recursively from path

Parameters: path (FilePath) – the path to the directory whose files should be listed recursively
Returns: a list of filepaths relative to path
Return type: List[FilePath]

dataset.util.starmap_with_kwargs(pool, fn, args_iter, kwargs_iter): https://stackoverflow.com/a/53173433/13747259

dataset package

Submodules

dataset.config module

dataset.dataset module

dataset.filtration_cache module

dataset.label_extractor module

dataset.label_manager module

dataset.util module

Module contents