dataset package

Submodules

dataset.config module

Configuration for development - change these with caution!

dataset.dataset module

The Dataset object aggregates information about your whole-slide images, their labels, any filtration that should be applied to regions thereof, and any augmentation functions that should be applied to the regions as they are fetched from the disk. This class inherits from PyTorch’s dataset.

class dataset.dataset.Dataset(*args: Any, **kwargs: Any)

Bases: torch.utils.data.Dataset

Dataset implements automatic label inference, optional augmentation, and optional dynamic filtration.

augment_region(region: numpy.ndarray) numpy.ndarray

augment_region augments a region using self.augmentation

Parameters

region (np.ndarray) – the region to be (potentially) augmented

Returns

the (potentially) augmented region

Return type

np.ndarray

get_label(filename: str) Any

get_label returns the label associated with a certain filename

Parameters

filename (str) – the filename in question

Returns

the label for the filename in question

Return type

Any

get_label_distribution() dict

get_label_distribution gives the count of regions belonging to each label - assumes region label is region’s image’s label

Returns

a dictionary representing the counts of the images’ labels

Return type

dict

get_region(filename: str, region_num: int) numpy.ndarray

get_region returns the region at location region num in filename, applying filtration and filtration cache if necessary

Parameters
  • filename (str) – the filename from which to get the region

  • region_num (int) – the number of the region in the image at filename

Returns

the region in question

Return type

np.ndarray

get_region_labels_as_list() List[Any]

get_region_labels_as_list returns an ordered list of region labels corresponding to the order of regions in the dataset

Returns

a list of labels

Return type

list[Any]

get_region_location_from_index(index: int) Tuple[str, int]

get_region_location_from_index returns the image, region_num location of the region at dataset[i]

Parameters

index (int) – dataset[i]

Raises

IndexError – if index is out of bounds

Returns

the location of the region at dataset[i]

Return type

Tuple[str, int]

iterate_by_file(as_pytorch_datasets=False) Generator[Tuple[str, Any, Generator], None, None]

iterate_by_file allows for users to iterate over regions in an image given the filename and the label

Parameters

as_pytorch_datasets (bool) – Whether to return pytorch datasets instead of the normal generators, defaults to False

Yield

the filename, the label, and an iterator for the regions in the image

Return type

Tuple[str, Any, Generator]

number_of_regions(filename: Optional[str] = None) int

number_of_regions get the number of regions in the dataset or in a single image managed by the dataset

Parameters

filename (Optional[str]) – An optional parameter which will get the number of regions in that image (if it’s in the dataset) instead of the number of regions in the entire dataset, defaults to None

Returns

the number of regions in the dataset or image at filename

Return type

int

dataset.filtration_cache module

A system for caching information about which regions of an image pass through a filter. Includes metadata verification and context management.

class dataset.filtration_cache.FiltrationCache(h5filepath: Optional[Filepath] = 'filtration_cache.h5', h5filetitle: Optional[str] = 'filtration_cache', region_dims: Optional[RegionDimensions] = (512, 512))

Bases: contextlib.AbstractContextManager

FiltrationCache Tracks images’ regions’ filtration statuses in a PyTables hdf5 database

Raises
  • NotImplementedError – when region_index is region coordinates instead

  • TypeError – when region_index is of the wrong type

class Description(*args: Any, **kwargs: Any)

Bases: tables.IsDescription

Description - the structure for tables in the database

get_metadata(filtration: FiltrationRepr, filepath: Filepath) FiltrationCacheMetadata

get_metadata gets the metadata for a table if it exists

Parameters
  • filtration (util.FiltrationRepr) – the filtration for which to get metadata

  • filepath (util.FilePath) – the image for which to get metadata

Returns

the metadata in question, if available

Return type

util.FiltrationCacheMetadata

get_status(filtration: Union[FiltrationRepr, str], filepath: Filepath, region_index: Optional[RegionIndex] = None) FiltrationStatus

get_status gets one or all records from table for filtration, os.path.basename(filepath)

Parameters
  • filtration (Union[util.FiltrationRepr, str]) – a string representing the filtration applied to the image

  • filepath (util.FilePath) – a filepath representing the image filtration was applied to

  • region_index (Optional[util.RegionIndex], optional) – the index of the region in question, defaults to None

Raises
  • NotImplementedError – if region_index is coordinates

  • TypeError – if region_index is of the wrong type

Returns

a tuple of (region index, target region index, region index filtration status)

Return type

util.FiltrationReprStatus

has_data(filtration: FiltrationRepr, filepath: Filepath, **kwargs) bool

has_data checks if the filtrationcache has a table at filtration/filepath

Parameters
  • filtration (util.FiltrationRepr) – the filtration for which statuses are checked

  • filepath (util.FilePath) – a key for the image in question

Returns

whether the FiltrationCache has the data in question

Return type

bool

metadata_fields = ['_image_filepath', '_image_size', '_image_region_count', '_image_dark_regions_count', '_image_regions_discounted', '_image_region_dims']
preprocess(filtration: FiltrationRepr, filepath: Filepath, loadingbars: bool, overwrite: bool = True, **kwargs) None

preprocess applies filtration to image(s) at filepath (listdir recursive)

Parameters
  • filtration (util.FiltrationRepr) – filtration applied to images’ regions’

  • filepath (util.FilePath) – if a directory, applied to all files in directory

  • overwrite (bool, optional) – whether to overwrite existing data, if applicable, defaults to True

dataset.filtration_cache.clear_table(table: tables.Table) None

clear_table clears a pytables table of all contents (currently leaves metadata)

Parameters

table (pt.Table) – the table to clear

dataset.filtration_cache.postprocess_filepath(filepath: Filepath) str

postprocess_filepath undoes the preprocessing

Parameters

filepath (util.FilePath) – the filepath to un-preprocess

Returns

the natural filepath

Return type

str

dataset.filtration_cache.preprocess(filtration: Callable, filepath: Filepath, region_dims: RegionDimensions, loadingbars: bool, **kwargs) Tuple[Dict[int, FiltrationStatus], int]

preprocess returns a mapping of region index to (fitration status and dark-region-mapped region index)

Parameters
  • filtration (util.FiltrationRepr) – the filtration to apply to the image’s regions. If callable and not strictly filtration, then a ranked threshold approach is used - see _apply_filtration_to_regions_ranked_threshold

  • filepath (util.FilePath) – the filepath where the image in question is found

  • region_dims (unified_image_reader.util.RegionDimensions) – the dimensions of the regions to which filtration is applied

Returns

filtration status records and dark region count

Return type

Tuple[Dict[int, util.FiltrationStatus], int]

dataset.filtration_cache.preprocess_filepath(filepath: Filepath) str

preprocess_filepath pytables can’t handle certain characters

Parameters

filepath (util.FilePath) – the filepath to preprocess

Returns

a filepath more coherent to pytables’s restrictions

Return type

str

dataset.filtration_cache.preprocess_filtration(filtration: FiltrationRepr, **kwargs) FiltrationRepr

preprocess_filtration removes whitespace for pytables compatibility

Parameters

filtration (util.FiltrationRepr) – the filtration to represent

Returns

a pytables-agreeable filtration representation

Return type

util.FiltrationRepr

dataset.filtration_cache.process_region(filepath: util.FilePath, filtration: util.FiltrationRepr, region_index: unified_image_reader.util.RegionIndex, region_dims: util.RegionDimensions) Tuple[int, Dict[str:Any]]

process_region applies filtration to the specified region of the given image

Returns

the region index and the filtration status

Return type

Tuple[int, Dict[str: Any]]

dataset.label_extractor module

An automatic (or overloaded) label inference class

class dataset.label_extractor.LabelExtractor

Bases: abc.ABC

Strategy Pattern –> extracts labels from path for dictionary-based lookup

abstract static extract_labels(path: str)

extracts labels from path for dictionary-based lookup

class dataset.label_extractor.LabelExtractorCSV

Bases: dataset.label_extractor.LabelExtractor

labels in csv file

static extract_labels(path: str)

labels are inside of a csv file at path of structure (each line) <key><sep><label>

class dataset.label_extractor.LabelExtractorJSON

Bases: dataset.label_extractor.LabelExtractor

labels in json file

static extract_labels(path: str)

labels are inside of a json file at path of structure {key: label, …}

class dataset.label_extractor.LabelExtractorNoLabels

Bases: dataset.label_extractor.LabelExtractor

class DefaultDictWithGet

Bases: collections.defaultdict

get(*args, **kwargs)

Return the value for key if key is in the dictionary, else default.

static extract_labels(path: str)

returns ‘LabelExtractorNoLabels’ for all labels

class dataset.label_extractor.LabelExtractorParentDir

Bases: dataset.label_extractor.LabelExtractor

labels represented by relative path

static extract_labels(path: str)

labels are path relative to path arg (label_postprocessor recommended)

dataset.label_manager module

Label inference for the custom dataset

class dataset.label_manager.LabelManager(path: Filepath, label_extraction: Optional[dataset.label_extractor.LabelExtractor] = None, label_preprocessor: Optional[Callable] = None, label_postprocessor: Optional[Callable] = None, error_if_no_labels: bool = True)

Bases: object

A dictionary wrapper for managing labels

Raises
  • NotImplementedError – when a given file extension cannot be parsed natively

  • TypeError – when the label_extractor isn’t a LabelExtractor

  • TypeError – when label_preprocessor isn’t Callable

  • TypeError – when label_postprocessor isn’t Callable

  • IndexError – when a key doesn’t have a value

dataset.util module

Utility functions, etc. for the custom dataset

class dataset.util.ThreadingLock

Bases: contextlib.AbstractContextManager

A wrapper on threading.Lock that implements a context manager so that when the context closes the lock will unlock

Example:
with status_lock as permission: # will hang until it gets permission

# do things

exception dataset.util.UnsupportedFileType

Bases: Exception

UnsupportedFileType is raised when a file extension can’t be parsed natively

dataset.util.apply_args_and_kwargs(fn, args, kwargs)

https://stackoverflow.com/a/53173433/13747259

dataset.util.listdir_recursive(path: Filepath) List[Filepath]

listdir_recursive lists files (not directories) recursively from path

Parameters

path (FilePath) – the path to the directory whose files should be listed recursively

Returns

a list of filepaths relative to path

Return type

List[FilePath]

dataset.util.starmap_with_kwargs(pool, fn, args_iter, kwargs_iter)

https://stackoverflow.com/a/53173433/13747259

Module contents