API Reference
puggle.Dataset module
A Dataset that stores a list of Documents.
- class puggle.Dataset.Dataset
Bases:
objectA class representing a Dataset, which stores a list of Documents.
- Variables
documents – A List of
puggle.Document.Documentobjects.
- __init__()
Create an empty Dataset. Documents may be loaded via the
puggle.Dataset.Dataset.load_documents()function.
- load_documents(sd_filename: <module 'posixpath' (frozen)> = None, anns_filename: <module 'posixpath' (frozen)> = None, anns_format: str = None)
Load a set of documents given the filepath of the structured data (a .csv file), and the filepath of the annotations (a .json file). Documents can still be created if either one of these is not present, but not if both are not present. Each row of each file must correspond to the other, e.g. row 3 of the structured data csv must correspond to row 3 of the annotations json.
- Parameters
sd_filename (os.path, optional) – The filepath of the structured data.
anns_filename (os.path, optional) – The filepath of the annotations.
anns_format (str) – The format of the annotations file. Can be either “quickgraph” or “spert”.
- save_to_file(filename: str, output_format: str = 'json')
Save the documents of this dataset to the given filename.
There are two output_format options to choose from: json and quickgraph. See the “Basic functionality” section of the documentation for more info.
- load_into_neo4j(recreate=False)
Load the Dataset into a Neo4j database. Automatically creates Nodes from the entities (mentions) appearing in each document, and relationships between them via the Relations.
- Parameters
recreate (bool, optional) – If true, the Neo4j db will be cleared prior to inserting the documents into it.
- Raises
RuntimeError – If the Neo4j server is not running.
- add_document(document: Document)
Add the given Document to this Dataset.
- Parameters
document (Document) – The Document to add.
- create_neo4j_csvs(documents_path: str, entities_path: str, relations_path: str, document_entities_path: str)
A function to generate a set of CSVs to load into Neo4j via IMPORT statements (an alternative for those who want to be able to save their graph to disk somehow and import it later/elsewhere).
- Parameters
- get_stats()
Return a string of some useful stats of this dataset.
- Returns
Stats (num docs, mentions, rels)
- Return type
- to_list()
Return a list representation of this dataset.
- Returns
- A list of Dicts, where each Dict is one document from
this dataset.
- Return type
list[Dict]
- convert_entity_class(original_ec: str, modified_ec: str)
Manipulation Convert the given entity class from one label to another across the entire Dataset.
- convert_relation_class(original_rc: str, modified_rc: str)
Manipulation Convert the given relation class from one label to another across the entire Dataset.
- drop_entity_class(entity_class: str)
Manipulation Remove any instances of the given entity class from the mentions of this Dataset. Also remove all Relations referencing the deleted mentions.
- Parameters
entity_class (str) – The entity class to remove.
- drop_relation_class(relation_class: str)
Manipulation Remove any instances of the given relation class from the relations of this Dataset.
- Parameters
relation_class (str) – The relation class to remove.
- flatten_all_entities()
Manipulation Flatten all entities, i.e. resolve all hierarchical entities to their base class. For example, [“state/desirable”] becomes [“state”], etc.
- flatten_all_relations()
Manipulation Flatten all relations, i.e. resolve all hierarchical relations to their base class. For example, [“state/desirable”] becomes [“state”], etc.
- get_entity_label_counts(document_level=False)
Statistics Return a sorted list of (entity_label, freq) pairs in this Dataset. The frequency is the number of times that entity_label has been used.
- get_unique_tokens_count()
Statistics Return the number of unique tokens in this Dataset.
- random_sample(num_records: int) Dataset
sampling Run a ‘random sample’ over the given dataset to return a new Dataset with num_records documents.
- random_split() -> (<class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>, <class 'puggle.Dataset.Dataset'>)
sampling Randomly split this dataset into 3 datasets - 80% train, 10% dev, 10% test.
- smart_sample(num_records: int, num_samples: int) Dataset
sampling
Warning
This function is experimental - it works, but hasn’t been tested.
Run a ‘smart sample’ on the given dataset to return a new Dataset with num_records documents. Repeat the sampling process num_samples times and select the best example. The algorithm aims to maximise the number of different tokens, entity classes and relation classes in the sample.
- split_sentences(delimiter='.')
Split each document of this Dataset into sentences.
puggle.Document module
A class representing a single Document. Contains an optional Annotation, and a list of fields.
- class puggle.Document.Document(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)
Bases:
objectA class representing a single Document. Contains an optional Annotation, and a list of fields.
- __init__(structured_fields: List[Dict] = None, annotation: Annotation = None, document_index: int = None)
Create a new document.
- Parameters
structured_fields (List[Dict], optional) – List of fields.
annotation (Annotation, optional) – The Annotation of the textual part of this document (such as annotations over the short text)
document_index (None) – When set, this is useful when splitting the documents into sentences. The document_index is the index of the original document that this sentence came from.
- to_dict()
Return a dict of this Document.
- Returns
A dictionary representing this document.
- Return type
- split_sentences(delimiter)
Split this document into sentences, i.e. a list of Documents that have been split by the given delimiter.
puggle.Annotation module
A class that stores the annotations of a Document.
- class puggle.Annotation.Annotation(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)
Bases:
objectAn Annotation for the textual portion of a Document. Contains a list of tokens, a list of mentions, a list of relations.
- Variables
tokens – A list of tokens (strings).
mentions – A list of
puggle.Mention.Mentionobjects.relations – A list of
puggle.Relation.Relationobjects.
- __init__(tokens: list[str], mentions: list[Dict], relations: list[Dict] = None)
Create a new Annotation.
- Parameters
tokens (list) – The list of tokens of the document.
mentions (list) – The list of mentions of the document. Each mention must follow the correct format (‘start’, ‘end’, ‘label’).
relations (list, optional) – The list of relations of the document. Each relation must follow the correct format (‘start’, ‘end’, ‘type’).
- to_dict()
Return a dictionary representation of this Annotation. Format will be similar to the input dataset.
- Returns
Dictionary representation of this Annotation.
- Return type
- static from_dict(d: dict)
Create an Annotation from a dictionary.
- Parameters
d (dict) – The dictionary. Must contain tokens, mentions, relations.
- Returns
An Annotation.
- Return type
- Raises
ValueError – If the dictionary is missing a required
key. –
puggle.Mention module
A mention, which is stored by the Annotation
class.
- class puggle.Mention.Mention(start: int, end: int, tokens: list, label: str, mention_id: int)
Bases:
objectA single entity mention. Captures the start, end, tokens and labels of the mention.
- __init__(start: int, end: int, tokens: list, label: str, mention_id: int)
Create a new Mention.
- Parameters
start (int) – The index of the first token of the mention.
end (int) – The index of the last token of the mention.
tokens (list) – The list of tokens appearing in the mention.
label (str) – The label of the mention.
mention_id (int) – The index of this mention with respect to the Document in which it appears.
puggle.Relation module
A relation, which captures the relationship between one
Mention and another. It is stored by the
Annotation class.
- class puggle.Relation.Relation(start: Mention, end: Mention, label: str)
Bases:
objectA single relation. Captures the start mention, end mention, tokens and labels of the relation.
- __init__(start: Mention, end: Mention, label: str)
Create a new Relation.
- Parameters
- Raises
ValueError – If the start and end Mention is the same.