API Reference

puggle.Dataset module

A Dataset that stores a list of Documents.

class puggle.Dataset.Dataset

Bases: object

A class representing a Dataset, which stores a list of Documents.

Variables: documents – A List of puggle.Document.Document objects.

__init__(): Create an empty Dataset. Documents may be loaded via the puggle.Dataset.Dataset.load_documents() function.

load_documents(sd_filename: <module 'posixpath' (frozen)> = None, anns_filename: <module 'posixpath' (frozen)> = None, anns_format: str = None)

Load a set of documents given the filepath of the structured data (a .csv file), and the filepath of the annotations (a .json file). Documents can still be created if either one of these is not present, but not if both are not present. Each row of each file must correspond to the other, e.g. row 3 of the structured data csv must correspond to row 3 of the annotations json.

Parameters

sd_filename (os.path, optional) – The filepath of the structured data.
anns_filename (os.path, optional) – The filepath of the annotations.
anns_format (str) – The format of the annotations file. Can be either “quickgraph” or “spert”.

save_to_file(filename: str, output_format: str = 'json')

Save the documents of this dataset to the given filename.

There are two output_format options to choose from: json and quickgraph. See the “Basic functionality” section of the documentation for more info.

Parameters

filename (str) – The filename to save to.
output_format (str) – The format to save to. ‘json’ will save as a json file without any special formatting. ‘spert’ will save it ready for using in SPERT. ‘quickgraph’ will save as a json file that can be loaded directly into quickgraph.

load_into_neo4j(recreate=False)

Load the Dataset into a Neo4j database. Automatically creates Nodes from the entities (mentions) appearing in each document, and relationships between them via the Relations.

Parameters: recreate (bool, optional) – If true, the Neo4j db will be cleared prior to inserting the documents into it.
Raises: RuntimeError – If the Neo4j server is not running.

add_document(document: Document)

Add the given Document to this Dataset.

Parameters: document (Document) – The Document to add.

create_neo4j_csvs(documents_path: str, entities_path: str, relations_path: str, document_entities_path: str)

A function to generate a set of CSVs to load into Neo4j via IMPORT statements (an alternative for those who want to be able to save their graph to disk somehow and import it later/elsewhere).

Parameters

documents_path (str) – Path to save the documents (CSV).
entities_path (str) – Path to save the entities (CSV).
relations_path (str) – Path to save the relations (CSV).
document_entities_path (str) – Path to save the relationships between entities and the documents in which they appear (CSV).

get_stats()

Return a string of some useful stats of this dataset.

Returns: Stats (num docs, mentions, rels)
Return type: str

to_list()

Return a list representation of this dataset.

Returns

A list of Dicts, where each Dict is one document from: this dataset.

Return type

list[Dict]

convert_entity_class(original_ec: str, modified_ec: str)

Manipulation Convert the given entity class from one label to another across the entire Dataset.

Parameters

original_ec (str) – The entity class to change.
modified_ec (str) – The entity class to change to.

convert_relation_class(original_rc: str, modified_rc: str)

Manipulation Convert the given relation class from one label to another across the entire Dataset.

Parameters

original_ec (str) – The relation class to change.
modified_ec (str) – The relation class to change to.

drop_entity_class(entity_class: str)

Manipulation Remove any instances of the given entity class from the mentions of this Dataset. Also remove all Relations referencing the deleted mentions.

Parameters: entity_class (str) – The entity class to remove.

drop_relation_class(relation_class: str)

Manipulation Remove any instances of the given relation class from the relations of this Dataset.

Parameters: relation_class (str) – The relation class to remove.

flatten_all_entities(): Manipulation Flatten all entities, i.e. resolve all hierarchical entities to their base class. For example, [“state/desirable”] becomes [“state”], etc.

flatten_all_relations(): Manipulation Flatten all relations, i.e. resolve all hierarchical relations to their base class. For example, [“state/desirable”] becomes [“state”], etc.

get_entity_label_counts(document_level=False)

Statistics Return a sorted list of (entity_label, freq) pairs in this Dataset. The frequency is the number of times that entity_label has been used.

Parameters: document_level (bool, optional) – If True, the counts will be the number of documents in which the entity label appears, rather than the total frequency of that entity label.
Returns: A sorted list of (entity_label, freq) pairs.
Return type: list[tuple]

get_unique_tokens_count()

Statistics Return the number of unique tokens in this Dataset.

Parameters: dataset (Dataset) – The dataset to use.
Returns: The number of unique tokens in the dataset.
Return type: int

random_sample(num_records: int) → Dataset

sampling Run a ‘random sample’ over the given dataset to return a new Dataset with num_records documents.

Parameters

self (Dataset) – The dataset to sample.
num_records (int) – The number of documents that should appear in the output.

random_split() -> (<class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>)

sampling Randomly split this dataset into 3 datasets - 80% train, 10% dev, 10% test.

Returns: The train, dev and test datasets.
Return type: Dataset, Dataset, Dataset

smart_sample(num_records: int, num_samples: int) → Dataset

sampling

Warning

This function is experimental - it works, but hasn’t been tested.

Run a ‘smart sample’ on the given dataset to return a new Dataset with num_records documents. Repeat the sampling process num_samples times and select the best example. The algorithm aims to maximise the number of different tokens, entity classes and relation classes in the sample.

Parameters

num_records (int) – The number of documents that should appear in the output.
num_samples (int) – Number of times to repeat the process. The idea is that the more samples are run, the more likely it is that the function will generate a better quality sample.

split_sentences(delimiter='.')

Split each document of this Dataset into sentences.

Parameters: delimiter (str, optional) – The delimiter to use for splitting.
Returns: A new dataset, where each document is a sentence. Each doc also has a document_index, allowing the user to know which doc the sentence originally came from.
Return type: Dataset

puggle.Document module

A class representing a single Document. Contains an optional Annotation, and a list of fields.

class puggle.Document.Document(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)

Bases: object

A class representing a single Document. Contains an optional Annotation, and a list of fields.

__init__(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)

Create a new document.

Parameters

structured_fields (List[Dict], optional) – List of fields.
annotation (Annotation, optional) – The Annotation of the textual part of this document (such as annotations over the short text)
document_index (None) – When set, this is useful when splitting the documents into sentences. The document_index is the index of the original document that this sentence came from.

to_dict()

Return a dict of this Document.

Returns: A dictionary representing this document.
Return type: dict

split_sentences(delimiter)

Split this document into sentences, i.e. a list of Documents that have been split by the given delimiter.

Parameters

delimiter (str) – The delimiter to use.

Returns

List of documents. List[Relation]: List of relations that were removed due to being

across multiple sentences.

Return type

List[Document]

puggle.Annotation module

A class that stores the annotations of a Document.

class puggle.Annotation.Annotation(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)

Bases: object

An Annotation for the textual portion of a Document. Contains a list of tokens, a list of mentions, a list of relations.

Variables

tokens – A list of tokens (strings).
mentions – A list of puggle.Mention.Mention objects.
relations – A list of puggle.Relation.Relation objects.

__init__(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)

Create a new Annotation.

Parameters

tokens (list) – The list of tokens of the document.
mentions (list) – The list of mentions of the document. Each mention must follow the correct format (‘start’, ‘end’, ‘label’).
relations (list, optional) – The list of relations of the document. Each relation must follow the correct format (‘start’, ‘end’, ‘type’).

to_dict()

Return a dictionary representation of this Annotation. Format will be similar to the input dataset.

Returns: Dictionary representation of this Annotation.
Return type: dict

static from_dict(d: dict)

Create an Annotation from a dictionary.

Parameters

d (dict) – The dictionary. Must contain tokens, mentions, relations.

Returns

An Annotation.

Return type

Annotation

Raises

ValueError – If the dictionary is missing a required
key. –

puggle.Mention module

A mention, which is stored by the Annotation class.

class puggle.Mention.Mention(start: int, end: int, tokens: list, label: str, mention_id: int)

Bases: object

A single entity mention. Captures the start, end, tokens and labels of the mention.

__init__(start: int, end: int, tokens: list, label: str, mention_id: int)

Create a new Mention.

Parameters

start (int) – The index of the first token of the mention.
end (int) – The index of the last token of the mention.
tokens (list) – The list of tokens appearing in the mention.
label (str) – The label of the mention.
mention_id (int) – The index of this mention with respect to the Document in which it appears.

to_dict()

Return a dictionary representation of this mention. Don’t include the mention_id as it is not useful here - it is only used when creating Relation objects between Mentions.

Returns: The mention as a dictionary.
Return type: dict

puggle.Relation module

A relation, which captures the relationship between one Mention and another. It is stored by the Annotation class.

class puggle.Relation.Relation(start: Mention, end: Mention, label: str)

Bases: object

A single relation. Captures the start mention, end mention, tokens and labels of the relation.

__init__(start: Mention, end: Mention, label: str)

Create a new Relation.

Parameters

start (Mention) – The start (head) Mention.
end (Mention) – The end (tail) Mention.
label (str) – The label (type) of the relation.

Raises

ValueError – If the start and end Mention is the same.

to_dict()

Return a dictionary representation of this Relation. Convert ‘start’ and ‘end’ to the mention index of this relation’s start and end mention.

Returns: Dictionary representation of this Relation.
Return type: dict