Source: keras_text/processing.py#L0


pad_sequences

pad_sequences(sequences, max_sentences=None, max_tokens=None, padding="pre", truncating="post", \
    value=0.0)

Pads each sequence to the same length (length of the longest sequence or provided override).

Args:

  • sequences: list of list (samples, words) or list of list of list (samples, sentences, words)
  • max_sentences: The max sentence length to use. If None, largest sentence length is used.
  • max_tokens: The max word length to use. If None, largest word length is used.
  • padding: 'pre' or 'post', pad either before or after each sequence.
  • truncating: 'pre' or 'post', remove values from sequences larger than max_sentences or max_tokens either in the beginning or in the end of the sentence or word sequence respectively.
  • value: The padding value.

Returns:

Numpy array of (samples, max_sentences, max_tokens) or (samples, max_tokens) depending on the sequence input.

Raises:

  • ValueError: in case of invalid values for truncating or padding.

unicodify

unicodify(texts)

Encodes all text sequences as unicode. This is a python2 hassle.

Args:

  • texts: The sequence of texts.

Returns:

Unicode encoded sequences.


Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

The number of texts used to build the vocabulary.

Tokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

Tokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

Tokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


Tokenizer.__init__

__init__(self, lang="en", lower=True)

Encodes text into (samples, aux_indices..., token) where each token is mapped to a unique index starting from 1. Note that 0 is a reserved for unknown tokens.

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)

Tokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you to play with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

Tokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Tokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter with a desired strategy and regenerate token_index using this method. The token index is subsequently used when encode_texts or decode_texts methods are called.


Tokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.


Tokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See `apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Returns:

The encoded texts.


Tokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates (text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths across texts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use get_stats method.


Tokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates (text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengths across texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.


Tokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

Tokenizer.token_generator

token_generator(self, texts, **kwargs)

Generator for yielding tokens. You need to implement this method.

Args:

  • texts: list of text items to tokenize. **kwargs: The kwargs propagated from build_vocab_and_encode or encode_texts call.

Returns:

(text_idx, aux_indices..., token) where aux_indices are optional. For example, if you want to vectorize texts as (text_idx, sentences, words), you should return(text_idx, sentence_idx, word_token)`. Similarly, you can include paragraph, page level information etc., if needed.


WordTokenizer

WordTokenizer.has_vocab

WordTokenizer.num_texts

The number of texts used to build the vocabulary.

WordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

WordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

WordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


WordTokenizer.__init__

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
    remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
    exclude_entities=['PERSON'])

Encodes text into (samples, words)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
  • remove_punct: Removes punct words if True. (Default value: True)
  • remove_digits: Removes digit words if True. (Default value: True)
  • remove_stop_words: Removes stop words if True. (Default value: False)
  • exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
  • exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
  • exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

WordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you to play with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

WordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

WordTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter with a desired strategy and regenerate token_index using this method. The token index is subsequently used when encode_texts or decode_texts methods are called.


WordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.


WordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See `apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Returns:

The encoded texts.


WordTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates (text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths across texts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use get_stats method.


WordTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates (text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengths across texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.


WordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

WordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, word)

Args:

  • texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

SentenceWordTokenizer

SentenceWordTokenizer.has_vocab

SentenceWordTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceWordTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

SentenceWordTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceWordTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


SentenceWordTokenizer.__init__

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
    remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
    exclude_entities=['PERSON'])

Encodes text into (samples, sentences, words)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
  • remove_punct: Removes punct words if True. (Default value: True)
  • remove_digits: Removes digit words if True. (Default value: True)
  • remove_stop_words: Removes stop words if True. (Default value: False)
  • exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
  • exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
  • exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])

SentenceWordTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you to play with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

SentenceWordTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

SentenceWordTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter with a desired strategy and regenerate token_index using this method. The token index is subsequently used when encode_texts or decode_texts methods are called.


SentenceWordTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.


SentenceWordTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See `apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Returns:

The encoded texts.


SentenceWordTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates (text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths across texts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use get_stats method.


SentenceWordTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates (text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengths across texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.


SentenceWordTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

SentenceWordTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, word)

Args:

  • texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)

CharTokenizer

CharTokenizer.has_vocab

CharTokenizer.num_texts

The number of texts used to build the vocabulary.

CharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

CharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

CharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


CharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

CharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you to play with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

CharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

CharTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter with a desired strategy and regenerate token_index using this method. The token index is subsequently used when encode_texts or decode_texts methods are called.


CharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.


CharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See `apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Returns:

The encoded texts.


CharTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates (text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths across texts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use get_stats method.


CharTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates (text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengths across texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.


CharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

CharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, character)


SentenceCharTokenizer

SentenceCharTokenizer.has_vocab

SentenceCharTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceCharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

SentenceCharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceCharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


SentenceCharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, sentences, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

SentenceCharTokenizer.apply_encoding_options

apply_encoding_options(self, min_token_count=1, max_tokens=None)

Applies the given settings for subsequent calls to encode_texts and decode_texts. This allows you to play with different settings without having to re-run tokenization on the entire corpus.

Args:

  • min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens below this frequency will be encoded to 0 which corresponds to unknown token. (Default value = 1)
  • max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common max_tokens tokens will be kept. Set to None to keep everything. (Default value: None)

SentenceCharTokenizer.build_vocab

build_vocab(self, texts, verbose=1, **kwargs)

Builds the internal vocabulary and computes various statistics.

Args:

  • texts: The list of text items to encode.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

SentenceCharTokenizer.create_token_indices

create_token_indices(self, tokens)

If apply_encoding_options is inadequate, one can retrieve tokens from self.token_counts, filter with a desired strategy and regenerate token_index using this method. The token index is subsequently used when encode_texts or decode_texts methods are called.


SentenceCharTokenizer.decode_texts

decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)

Decodes the texts using internal vocabulary. The list structure is maintained.

Args:

  • encoded_texts: The list of texts to decode.
  • unknown_token: The placeholder value for unknown token. (Default value: "")
  • inplace: True to make changes inplace. (Default value: True)

Returns:

The decoded texts.


SentenceCharTokenizer.encode_texts

encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)

Encodes the given texts using internal vocabulary with optionally applied encoding options. See `apply_encoding_options to set various options.

Args:

  • texts: The list of text items to encode.
  • include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
  • verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1) **kwargs: The kwargs for token_generator.

Returns:

The encoded texts.


SentenceCharTokenizer.get_counts

get_counts(self, i)

Numpy array of count values for aux_indices. For example, if token_generator generates (text_idx, sentence_idx, word), then get_counts(0) returns the numpy array of sentence lengths across texts. Similarly, get_counts(1) will return the numpy array of token lengths across sentences.

This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use get_stats method.


SentenceCharTokenizer.get_stats

get_stats(self, i)

Gets the standard statistics for aux_index i. For example, if token_generator generates (text_idx, sentence_idx, word), then get_stats(0) will return various statistics about sentence lengths across texts. Similarly, get_counts(1) will return statistics of token lengths across sentences.

This information can be used to pad or truncate inputs.


SentenceCharTokenizer.save

save(self, file_path)

Serializes this tokenizer to a file.

Args:

  • file_path: The file path to use.

SentenceCharTokenizer.token_generator

token_generator(self, texts, **kwargs)

Yields tokens from texts as (text_idx, sent_idx, character)

Args:

  • texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
  • batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)