puggle.data_utils package

Submodules

puggle.data_utils.manipulation module

Tools for manipulating Puggle datasets. When this library is imported, the functions will be added to the Dataset class.

puggle.data_utils.manipulation.drop_entity_class(self: Dataset, entity_class: str)

Manipulation Remove any instances of the given entity class from the mentions of this Dataset. Also remove all Relations referencing the deleted mentions.

Parameters

entity_class (str) – The entity class to remove.

puggle.data_utils.manipulation.drop_relation_class(self: Dataset, relation_class: str)

Manipulation Remove any instances of the given relation class from the relations of this Dataset.

Parameters

relation_class (str) – The relation class to remove.

puggle.data_utils.manipulation.convert_entity_class(self: Dataset, original_ec: str, modified_ec: str)

Manipulation Convert the given entity class from one label to another across the entire Dataset.

Parameters
  • original_ec (str) – The entity class to change.

  • modified_ec (str) – The entity class to change to.

puggle.data_utils.manipulation.convert_relation_class(self: Dataset, original_rc: str, modified_rc: str)

Manipulation Convert the given relation class from one label to another across the entire Dataset.

Parameters
  • original_ec (str) – The relation class to change.

  • modified_ec (str) – The relation class to change to.

puggle.data_utils.manipulation.flatten_all_entities(self: Dataset)

Manipulation Flatten all entities, i.e. resolve all hierarchical entities to their base class. For example, [“state/desirable”] becomes [“state”], etc.

puggle.data_utils.manipulation.flatten_all_relations(self: Dataset)

Manipulation Flatten all relations, i.e. resolve all hierarchical relations to their base class. For example, [“state/desirable”] becomes [“state”], etc.

puggle.data_utils.manipulation.split_sentences(self: Dataset, delimiter='.')

Split each document of this Dataset into sentences.

Parameters

delimiter (str, optional) – The delimiter to use for splitting.

Returns

A new dataset, where each document is a sentence. Each doc also has a document_index, allowing the user to know which doc the sentence originally came from.

Return type

Dataset

puggle.data_utils.sampling module

Functions for generating samples from Puggle Datasets.

puggle.data_utils.sampling.random_sample(self: Dataset, num_records: int) Dataset

sampling Run a ‘random sample’ over the given dataset to return a new Dataset with num_records documents.

Parameters
  • self (Dataset) – The dataset to sample.

  • num_records (int) – The number of documents that should appear in the output.

puggle.data_utils.sampling.random_split(self: ~puggle.Dataset.Dataset) -> (<class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>)

sampling Randomly split this dataset into 3 datasets - 80% train, 10% dev, 10% test.

Returns

The train, dev and test datasets.

Return type

Dataset, Dataset, Dataset

puggle.data_utils.sampling.smart_sample(self: Dataset, num_records: int, num_samples: int) Dataset

sampling

Warning

This function is experimental - it works, but hasn’t been tested.

Run a ‘smart sample’ on the given dataset to return a new Dataset with num_records documents. Repeat the sampling process num_samples times and select the best example. The algorithm aims to maximise the number of different tokens, entity classes and relation classes in the sample.

Parameters
  • num_records (int) – The number of documents that should appear in the output.

  • num_samples (int) – Number of times to repeat the process. The idea is that the more samples are run, the more likely it is that the function will generate a better quality sample.

puggle.data_utils.statistics module

Statistics-based functions for the Dataset class.

puggle.data_utils.statistics.get_unique_tokens_count(self: Dataset)

Statistics Return the number of unique tokens in this Dataset.

Parameters

dataset (Dataset) – The dataset to use.

Returns

The number of unique tokens in the dataset.

Return type

int

puggle.data_utils.statistics.get_entity_label_counts(self: Dataset, document_level=False)

Statistics Return a sorted list of (entity_label, freq) pairs in this Dataset. The frequency is the number of times that entity_label has been used.

Parameters

document_level (bool, optional) – If True, the counts will be the number of documents in which the entity label appears, rather than the total frequency of that entity label.

Returns

A sorted list of (entity_label, freq) pairs.

Return type

list[tuple]

Module contents

Functions for working with Datasets in Puggle.