puggle.data_utils package
Submodules
puggle.data_utils.manipulation module
Tools for manipulating Puggle datasets. When this library is imported, the functions will be added to the Dataset class.
- puggle.data_utils.manipulation.drop_entity_class(self: Dataset, entity_class: str)
Manipulation Remove any instances of the given entity class from the mentions of this Dataset. Also remove all Relations referencing the deleted mentions.
- Parameters
entity_class (str) – The entity class to remove.
- puggle.data_utils.manipulation.drop_relation_class(self: Dataset, relation_class: str)
Manipulation Remove any instances of the given relation class from the relations of this Dataset.
- Parameters
relation_class (str) – The relation class to remove.
- puggle.data_utils.manipulation.convert_entity_class(self: Dataset, original_ec: str, modified_ec: str)
Manipulation Convert the given entity class from one label to another across the entire Dataset.
- puggle.data_utils.manipulation.convert_relation_class(self: Dataset, original_rc: str, modified_rc: str)
Manipulation Convert the given relation class from one label to another across the entire Dataset.
- puggle.data_utils.manipulation.flatten_all_entities(self: Dataset)
Manipulation Flatten all entities, i.e. resolve all hierarchical entities to their base class. For example, [“state/desirable”] becomes [“state”], etc.
- puggle.data_utils.manipulation.flatten_all_relations(self: Dataset)
Manipulation Flatten all relations, i.e. resolve all hierarchical relations to their base class. For example, [“state/desirable”] becomes [“state”], etc.
puggle.data_utils.sampling module
Functions for generating samples from Puggle Datasets.
- puggle.data_utils.sampling.random_sample(self: Dataset, num_records: int) Dataset
sampling Run a ‘random sample’ over the given dataset to return a new Dataset with num_records documents.
- puggle.data_utils.sampling.random_split(self: ~puggle.Dataset.Dataset) -> (<class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>)
sampling Randomly split this dataset into 3 datasets - 80% train, 10% dev, 10% test.
- puggle.data_utils.sampling.smart_sample(self: Dataset, num_records: int, num_samples: int) Dataset
sampling
Warning
This function is experimental - it works, but hasn’t been tested.
Run a ‘smart sample’ on the given dataset to return a new Dataset with num_records documents. Repeat the sampling process num_samples times and select the best example. The algorithm aims to maximise the number of different tokens, entity classes and relation classes in the sample.
puggle.data_utils.statistics module
Statistics-based functions for the Dataset class.
- puggle.data_utils.statistics.get_unique_tokens_count(self: Dataset)
Statistics Return the number of unique tokens in this Dataset.
Module contents
Functions for working with Datasets in Puggle.