API Reference

puggle.Dataset module

A Dataset that stores a list of Documents.

class puggle.Dataset.Dataset

Bases: object

A class representing a Dataset, which stores a list of Documents.

Variables

documents – A List of puggle.Document.Document objects.

__init__()

Create an empty Dataset. Documents may be loaded via the puggle.Dataset.Dataset.load_documents() function.

load_documents(sd_filename: <module 'posixpath' (frozen)> = None, anns_filename: <module 'posixpath' (frozen)> = None, anns_format: str = None)

Load a set of documents given the filepath of the structured data (a .csv file), and the filepath of the annotations (a .json file). Documents can still be created if either one of these is not present, but not if both are not present. Each row of each file must correspond to the other, e.g. row 3 of the structured data csv must correspond to row 3 of the annotations json.

Parameters
  • sd_filename (os.path, optional) – The filepath of the structured data.

  • anns_filename (os.path, optional) – The filepath of the annotations.

  • anns_format (str) – The format of the annotations file. Can be either “quickgraph” or “spert”.

save_to_file(filename: str, output_format: str = 'json')

Save the documents of this dataset to the given filename.

There are two output_format options to choose from: json and quickgraph. See the “Basic functionality” section of the documentation for more info.

Parameters
  • filename (str) – The filename to save to.

  • output_format (str) – The format to save to. ‘json’ will save as a json file without any special formatting. ‘spert’ will save it ready for using in SPERT. ‘quickgraph’ will save as a json file that can be loaded directly into quickgraph.

load_into_neo4j(recreate=False)

Load the Dataset into a Neo4j database. Automatically creates Nodes from the entities (mentions) appearing in each document, and relationships between them via the Relations.

Parameters

recreate (bool, optional) – If true, the Neo4j db will be cleared prior to inserting the documents into it.

Raises

RuntimeError – If the Neo4j server is not running.

add_document(document: Document)

Add the given Document to this Dataset.

Parameters

document (Document) – The Document to add.

create_neo4j_csvs(documents_path: str, entities_path: str, relations_path: str, document_entities_path: str)

A function to generate a set of CSVs to load into Neo4j via IMPORT statements (an alternative for those who want to be able to save their graph to disk somehow and import it later/elsewhere).

Parameters
  • documents_path (str) – Path to save the documents (CSV).

  • entities_path (str) – Path to save the entities (CSV).

  • relations_path (str) – Path to save the relations (CSV).

  • document_entities_path (str) – Path to save the relationships between entities and the documents in which they appear (CSV).

get_stats()

Return a string of some useful stats of this dataset.

Returns

Stats (num docs, mentions, rels)

Return type

str

to_list()

Return a list representation of this dataset.

Returns

A list of Dicts, where each Dict is one document from

this dataset.

Return type

list[Dict]

convert_entity_class(original_ec: str, modified_ec: str)

Manipulation Convert the given entity class from one label to another across the entire Dataset.

Parameters
  • original_ec (str) – The entity class to change.

  • modified_ec (str) – The entity class to change to.

convert_relation_class(original_rc: str, modified_rc: str)

Manipulation Convert the given relation class from one label to another across the entire Dataset.

Parameters
  • original_ec (str) – The relation class to change.

  • modified_ec (str) – The relation class to change to.

drop_entity_class(entity_class: str)

Manipulation Remove any instances of the given entity class from the mentions of this Dataset. Also remove all Relations referencing the deleted mentions.

Parameters

entity_class (str) – The entity class to remove.

drop_relation_class(relation_class: str)

Manipulation Remove any instances of the given relation class from the relations of this Dataset.

Parameters

relation_class (str) – The relation class to remove.

flatten_all_entities()

Manipulation Flatten all entities, i.e. resolve all hierarchical entities to their base class. For example, [“state/desirable”] becomes [“state”], etc.

flatten_all_relations()

Manipulation Flatten all relations, i.e. resolve all hierarchical relations to their base class. For example, [“state/desirable”] becomes [“state”], etc.

get_entity_label_counts(document_level=False)

Statistics Return a sorted list of (entity_label, freq) pairs in this Dataset. The frequency is the number of times that entity_label has been used.

Parameters

document_level (bool, optional) – If True, the counts will be the number of documents in which the entity label appears, rather than the total frequency of that entity label.

Returns

A sorted list of (entity_label, freq) pairs.

Return type

list[tuple]

get_unique_tokens_count()

Statistics Return the number of unique tokens in this Dataset.

Parameters

dataset (Dataset) – The dataset to use.

Returns

The number of unique tokens in the dataset.

Return type

int

random_sample(num_records: int) Dataset

sampling Run a ‘random sample’ over the given dataset to return a new Dataset with num_records documents.

Parameters
  • self (Dataset) – The dataset to sample.

  • num_records (int) – The number of documents that should appear in the output.

random_split() -> (<class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>)

sampling Randomly split this dataset into 3 datasets - 80% train, 10% dev, 10% test.

Returns

The train, dev and test datasets.

Return type

Dataset, Dataset, Dataset

smart_sample(num_records: int, num_samples: int) Dataset

sampling

Warning

This function is experimental - it works, but hasn’t been tested.

Run a ‘smart sample’ on the given dataset to return a new Dataset with num_records documents. Repeat the sampling process num_samples times and select the best example. The algorithm aims to maximise the number of different tokens, entity classes and relation classes in the sample.

Parameters
  • num_records (int) – The number of documents that should appear in the output.

  • num_samples (int) – Number of times to repeat the process. The idea is that the more samples are run, the more likely it is that the function will generate a better quality sample.

split_sentences(delimiter='.')

Split each document of this Dataset into sentences.

Parameters

delimiter (str, optional) – The delimiter to use for splitting.

Returns

A new dataset, where each document is a sentence. Each doc also has a document_index, allowing the user to know which doc the sentence originally came from.

Return type

Dataset

puggle.Document module

A class representing a single Document. Contains an optional Annotation, and a list of fields.

class puggle.Document.Document(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)

Bases: object

A class representing a single Document. Contains an optional Annotation, and a list of fields.

__init__(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)

Create a new document.

Parameters
  • structured_fields (List[Dict], optional) – List of fields.

  • annotation (Annotation, optional) – The Annotation of the textual part of this document (such as annotations over the short text)

  • document_index (None) – When set, this is useful when splitting the documents into sentences. The document_index is the index of the original document that this sentence came from.

to_dict()

Return a dict of this Document.

Returns

A dictionary representing this document.

Return type

dict

split_sentences(delimiter)

Split this document into sentences, i.e. a list of Documents that have been split by the given delimiter.

Parameters

delimiter (str) – The delimiter to use.

Returns

List of documents. List[Relation]: List of relations that were removed due to being

across multiple sentences.

Return type

List[Document]

puggle.Annotation module

A class that stores the annotations of a Document.

class puggle.Annotation.Annotation(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)

Bases: object

An Annotation for the textual portion of a Document. Contains a list of tokens, a list of mentions, a list of relations.

Variables
__init__(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)

Create a new Annotation.

Parameters
  • tokens (list) – The list of tokens of the document.

  • mentions (list) – The list of mentions of the document. Each mention must follow the correct format (‘start’, ‘end’, ‘label’).

  • relations (list, optional) – The list of relations of the document. Each relation must follow the correct format (‘start’, ‘end’, ‘type’).

to_dict()

Return a dictionary representation of this Annotation. Format will be similar to the input dataset.

Returns

Dictionary representation of this Annotation.

Return type

dict

static from_dict(d: dict)

Create an Annotation from a dictionary.

Parameters

d (dict) – The dictionary. Must contain tokens, mentions, relations.

Returns

An Annotation.

Return type

Annotation

Raises
  • ValueError – If the dictionary is missing a required

  • key.

puggle.Mention module

A mention, which is stored by the Annotation class.

class puggle.Mention.Mention(start: int, end: int, tokens: list, label: str, mention_id: int)

Bases: object

A single entity mention. Captures the start, end, tokens and labels of the mention.

__init__(start: int, end: int, tokens: list, label: str, mention_id: int)

Create a new Mention.

Parameters
  • start (int) – The index of the first token of the mention.

  • end (int) – The index of the last token of the mention.

  • tokens (list) – The list of tokens appearing in the mention.

  • label (str) – The label of the mention.

  • mention_id (int) – The index of this mention with respect to the Document in which it appears.

to_dict()

Return a dictionary representation of this mention. Don’t include the mention_id as it is not useful here - it is only used when creating Relation objects between Mentions.

Returns

The mention as a dictionary.

Return type

dict

puggle.Relation module

A relation, which captures the relationship between one Mention and another. It is stored by the Annotation class.

class puggle.Relation.Relation(start: Mention, end: Mention, label: str)

Bases: object

A single relation. Captures the start mention, end mention, tokens and labels of the relation.

__init__(start: Mention, end: Mention, label: str)

Create a new Relation.

Parameters
  • start (Mention) – The start (head) Mention.

  • end (Mention) – The end (tail) Mention.

  • label (str) – The label (type) of the relation.

Raises

ValueError – If the start and end Mention is the same.

to_dict()

Return a dictionary representation of this Relation. Convert ‘start’ and ‘end’ to the mention index of this relation’s start and end mention.

Returns

Dictionary representation of this Relation.

Return type

dict