Source: keras_text/processing.py#L0
pad_sequences
pad_sequences(sequences, max_sentences=None, max_tokens=None, padding="pre", truncating="post", \
value=0.0)
Pads each sequence to the same length (length of the longest sequence or provided override).
Args:
- sequences: list of list (samples, words) or list of list of list (samples, sentences, words)
- max_sentences: The max sentence length to use. If None, largest sentence length is used.
- max_tokens: The max word length to use. If None, largest word length is used.
- padding: 'pre' or 'post', pad either before or after each sequence.
- truncating: 'pre' or 'post', remove values from sequences larger than max_sentences or max_tokens either in the beginning or in the end of the sentence or word sequence respectively.
- value: The padding value.
Returns:
Numpy array of (samples, max_sentences, max_tokens) or (samples, max_tokens) depending on the sequence input.
Raises:
- ValueError: in case of invalid values for
truncating
orpadding
.
unicodify
unicodify(texts)
Encodes all text sequences as unicode. This is a python2 hassle.
Args:
- texts: The sequence of texts.
Returns:
Unicode encoded sequences.
Tokenizer
Tokenizer.has_vocab
Tokenizer.num_texts
The number of texts used to build the vocabulary.
Tokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
Tokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
Tokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
Tokenizer.__init__
__init__(self, lang="en", lower=True)
Encodes text into (samples, aux_indices..., token)
where each token is mapped to a unique index starting
from 1
. Note that 0
is a reserved for unknown tokens.
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
Tokenizer.apply_encoding_options
apply_encoding_options(self, min_token_count=1, max_tokens=None)
Applies the given settings for subsequent calls to encode_texts
and decode_texts
. This allows you to
play with different settings without having to re-run tokenization on the entire corpus.
Args:
- min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens
below this frequency will be encoded to
0
which corresponds to unknown token. (Default value = 1) - max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common
max_tokens
tokens will be kept. Set to None to keep everything. (Default value: None)
Tokenizer.build_vocab
build_vocab(self, texts, verbose=1, **kwargs)
Builds the internal vocabulary and computes various statistics.
Args:
- texts: The list of text items to encode.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Tokenizer.create_token_indices
create_token_indices(self, tokens)
If apply_encoding_options
is inadequate, one can retrieve tokens from self.token_counts
, filter with
a desired strategy and regenerate token_index
using this method. The token index is subsequently used
when encode_texts
or decode_texts
methods are called.
Tokenizer.decode_texts
decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)
Decodes the texts using internal vocabulary. The list structure is maintained.
Args:
- encoded_texts: The list of texts to decode.
- unknown_token: The placeholder value for unknown token. (Default value: "
") - inplace: True to make changes inplace. (Default value: True)
Returns:
The decoded texts.
Tokenizer.encode_texts
encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)
Encodes the given texts using internal vocabulary with optionally applied encoding options. See
`apply_encoding_options
to set various options.
Args:
- texts: The list of text items to encode.
- include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Returns:
The encoded texts.
Tokenizer.get_counts
get_counts(self, i)
Numpy array of count values for aux_indices. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_counts(0)
returns the numpy array of sentence lengths across
texts. Similarly, get_counts(1)
will return the numpy array of token lengths across sentences.
This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use
get_stats
method.
Tokenizer.get_stats
get_stats(self, i)
Gets the standard statistics for aux_index i
. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_stats(0)
will return various statistics about sentence lengths
across texts. Similarly, get_counts(1)
will return statistics of token lengths across sentences.
This information can be used to pad or truncate inputs.
Tokenizer.save
save(self, file_path)
Serializes this tokenizer to a file.
Args:
- file_path: The file path to use.
Tokenizer.token_generator
token_generator(self, texts, **kwargs)
Generator for yielding tokens. You need to implement this method.
Args:
- texts: list of text items to tokenize.
**kwargs: The kwargs propagated from
build_vocab_and_encode
orencode_texts
call.
Returns:
(text_idx, aux_indices..., token)
where aux_indices are optional. For example, if you want to vectorize
texts
as (text_idx, sentences, words), you should return
(text_idx, sentence_idx, word_token)`.
Similarly, you can include paragraph, page level information etc., if needed.
WordTokenizer
WordTokenizer.has_vocab
WordTokenizer.num_texts
The number of texts used to build the vocabulary.
WordTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
WordTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
WordTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
WordTokenizer.__init__
__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
exclude_entities=['PERSON'])
Encodes text into (samples, words)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- lemmatize: Lemmatizes words when set to True. This also makes the word lower case
irrespective if the
lower
setting. (Default value: False) - remove_punct: Removes punct words if True. (Default value: True)
- remove_digits: Removes digit words if True. (Default value: True)
- remove_stop_words: Removes stop words if True. (Default value: False)
- exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
- exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
- exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])
WordTokenizer.apply_encoding_options
apply_encoding_options(self, min_token_count=1, max_tokens=None)
Applies the given settings for subsequent calls to encode_texts
and decode_texts
. This allows you to
play with different settings without having to re-run tokenization on the entire corpus.
Args:
- min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens
below this frequency will be encoded to
0
which corresponds to unknown token. (Default value = 1) - max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common
max_tokens
tokens will be kept. Set to None to keep everything. (Default value: None)
WordTokenizer.build_vocab
build_vocab(self, texts, verbose=1, **kwargs)
Builds the internal vocabulary and computes various statistics.
Args:
- texts: The list of text items to encode.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
WordTokenizer.create_token_indices
create_token_indices(self, tokens)
If apply_encoding_options
is inadequate, one can retrieve tokens from self.token_counts
, filter with
a desired strategy and regenerate token_index
using this method. The token index is subsequently used
when encode_texts
or decode_texts
methods are called.
WordTokenizer.decode_texts
decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)
Decodes the texts using internal vocabulary. The list structure is maintained.
Args:
- encoded_texts: The list of texts to decode.
- unknown_token: The placeholder value for unknown token. (Default value: "
") - inplace: True to make changes inplace. (Default value: True)
Returns:
The decoded texts.
WordTokenizer.encode_texts
encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)
Encodes the given texts using internal vocabulary with optionally applied encoding options. See
`apply_encoding_options
to set various options.
Args:
- texts: The list of text items to encode.
- include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Returns:
The encoded texts.
WordTokenizer.get_counts
get_counts(self, i)
Numpy array of count values for aux_indices. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_counts(0)
returns the numpy array of sentence lengths across
texts. Similarly, get_counts(1)
will return the numpy array of token lengths across sentences.
This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use
get_stats
method.
WordTokenizer.get_stats
get_stats(self, i)
Gets the standard statistics for aux_index i
. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_stats(0)
will return various statistics about sentence lengths
across texts. Similarly, get_counts(1)
will return statistics of token lengths across sentences.
This information can be used to pad or truncate inputs.
WordTokenizer.save
save(self, file_path)
Serializes this tokenizer to a file.
Args:
- file_path: The file path to use.
WordTokenizer.token_generator
token_generator(self, texts, **kwargs)
Yields tokens from texts as (text_idx, word)
Args:
- texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
- batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)
SentenceWordTokenizer
SentenceWordTokenizer.has_vocab
SentenceWordTokenizer.num_texts
The number of texts used to build the vocabulary.
SentenceWordTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
SentenceWordTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
SentenceWordTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
SentenceWordTokenizer.__init__
__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
exclude_entities=['PERSON'])
Encodes text into (samples, sentences, words)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- lemmatize: Lemmatizes words when set to True. This also makes the word lower case
irrespective if the
lower
setting. (Default value: False) - remove_punct: Removes punct words if True. (Default value: True)
- remove_digits: Removes digit words if True. (Default value: True)
- remove_stop_words: Removes stop words if True. (Default value: False)
- exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
- exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
- exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])
SentenceWordTokenizer.apply_encoding_options
apply_encoding_options(self, min_token_count=1, max_tokens=None)
Applies the given settings for subsequent calls to encode_texts
and decode_texts
. This allows you to
play with different settings without having to re-run tokenization on the entire corpus.
Args:
- min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens
below this frequency will be encoded to
0
which corresponds to unknown token. (Default value = 1) - max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common
max_tokens
tokens will be kept. Set to None to keep everything. (Default value: None)
SentenceWordTokenizer.build_vocab
build_vocab(self, texts, verbose=1, **kwargs)
Builds the internal vocabulary and computes various statistics.
Args:
- texts: The list of text items to encode.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
SentenceWordTokenizer.create_token_indices
create_token_indices(self, tokens)
If apply_encoding_options
is inadequate, one can retrieve tokens from self.token_counts
, filter with
a desired strategy and regenerate token_index
using this method. The token index is subsequently used
when encode_texts
or decode_texts
methods are called.
SentenceWordTokenizer.decode_texts
decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)
Decodes the texts using internal vocabulary. The list structure is maintained.
Args:
- encoded_texts: The list of texts to decode.
- unknown_token: The placeholder value for unknown token. (Default value: "
") - inplace: True to make changes inplace. (Default value: True)
Returns:
The decoded texts.
SentenceWordTokenizer.encode_texts
encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)
Encodes the given texts using internal vocabulary with optionally applied encoding options. See
`apply_encoding_options
to set various options.
Args:
- texts: The list of text items to encode.
- include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Returns:
The encoded texts.
SentenceWordTokenizer.get_counts
get_counts(self, i)
Numpy array of count values for aux_indices. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_counts(0)
returns the numpy array of sentence lengths across
texts. Similarly, get_counts(1)
will return the numpy array of token lengths across sentences.
This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use
get_stats
method.
SentenceWordTokenizer.get_stats
get_stats(self, i)
Gets the standard statistics for aux_index i
. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_stats(0)
will return various statistics about sentence lengths
across texts. Similarly, get_counts(1)
will return statistics of token lengths across sentences.
This information can be used to pad or truncate inputs.
SentenceWordTokenizer.save
save(self, file_path)
Serializes this tokenizer to a file.
Args:
- file_path: The file path to use.
SentenceWordTokenizer.token_generator
token_generator(self, texts, **kwargs)
Yields tokens from texts as (text_idx, sent_idx, word)
Args:
- texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
- batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)
CharTokenizer
CharTokenizer.has_vocab
CharTokenizer.num_texts
The number of texts used to build the vocabulary.
CharTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
CharTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
CharTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
CharTokenizer.__init__
__init__(self, lang="en", lower=True, charset=None)
Encodes text into (samples, characters)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- charset: The character set to use. For example
charset = 'abc123'
. If None, all characters will be used. (Default value: None)
CharTokenizer.apply_encoding_options
apply_encoding_options(self, min_token_count=1, max_tokens=None)
Applies the given settings for subsequent calls to encode_texts
and decode_texts
. This allows you to
play with different settings without having to re-run tokenization on the entire corpus.
Args:
- min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens
below this frequency will be encoded to
0
which corresponds to unknown token. (Default value = 1) - max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common
max_tokens
tokens will be kept. Set to None to keep everything. (Default value: None)
CharTokenizer.build_vocab
build_vocab(self, texts, verbose=1, **kwargs)
Builds the internal vocabulary and computes various statistics.
Args:
- texts: The list of text items to encode.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
CharTokenizer.create_token_indices
create_token_indices(self, tokens)
If apply_encoding_options
is inadequate, one can retrieve tokens from self.token_counts
, filter with
a desired strategy and regenerate token_index
using this method. The token index is subsequently used
when encode_texts
or decode_texts
methods are called.
CharTokenizer.decode_texts
decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)
Decodes the texts using internal vocabulary. The list structure is maintained.
Args:
- encoded_texts: The list of texts to decode.
- unknown_token: The placeholder value for unknown token. (Default value: "
") - inplace: True to make changes inplace. (Default value: True)
Returns:
The decoded texts.
CharTokenizer.encode_texts
encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)
Encodes the given texts using internal vocabulary with optionally applied encoding options. See
`apply_encoding_options
to set various options.
Args:
- texts: The list of text items to encode.
- include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Returns:
The encoded texts.
CharTokenizer.get_counts
get_counts(self, i)
Numpy array of count values for aux_indices. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_counts(0)
returns the numpy array of sentence lengths across
texts. Similarly, get_counts(1)
will return the numpy array of token lengths across sentences.
This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use
get_stats
method.
CharTokenizer.get_stats
get_stats(self, i)
Gets the standard statistics for aux_index i
. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_stats(0)
will return various statistics about sentence lengths
across texts. Similarly, get_counts(1)
will return statistics of token lengths across sentences.
This information can be used to pad or truncate inputs.
CharTokenizer.save
save(self, file_path)
Serializes this tokenizer to a file.
Args:
- file_path: The file path to use.
CharTokenizer.token_generator
token_generator(self, texts, **kwargs)
Yields tokens from texts as (text_idx, character)
SentenceCharTokenizer
SentenceCharTokenizer.has_vocab
SentenceCharTokenizer.num_texts
The number of texts used to build the vocabulary.
SentenceCharTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
SentenceCharTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
SentenceCharTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
SentenceCharTokenizer.__init__
__init__(self, lang="en", lower=True, charset=None)
Encodes text into (samples, sentences, characters)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- charset: The character set to use. For example
charset = 'abc123'
. If None, all characters will be used. (Default value: None)
SentenceCharTokenizer.apply_encoding_options
apply_encoding_options(self, min_token_count=1, max_tokens=None)
Applies the given settings for subsequent calls to encode_texts
and decode_texts
. This allows you to
play with different settings without having to re-run tokenization on the entire corpus.
Args:
- min_token_count: The minimum token count (frequency) in order to include during encoding. All tokens
below this frequency will be encoded to
0
which corresponds to unknown token. (Default value = 1) - max_tokens: The maximum number of tokens to keep, based their frequency. Only the most common
max_tokens
tokens will be kept. Set to None to keep everything. (Default value: None)
SentenceCharTokenizer.build_vocab
build_vocab(self, texts, verbose=1, **kwargs)
Builds the internal vocabulary and computes various statistics.
Args:
- texts: The list of text items to encode.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
SentenceCharTokenizer.create_token_indices
create_token_indices(self, tokens)
If apply_encoding_options
is inadequate, one can retrieve tokens from self.token_counts
, filter with
a desired strategy and regenerate token_index
using this method. The token index is subsequently used
when encode_texts
or decode_texts
methods are called.
SentenceCharTokenizer.decode_texts
decode_texts(self, encoded_texts, unknown_token="<UNK>", inplace=True)
Decodes the texts using internal vocabulary. The list structure is maintained.
Args:
- encoded_texts: The list of texts to decode.
- unknown_token: The placeholder value for unknown token. (Default value: "
") - inplace: True to make changes inplace. (Default value: True)
Returns:
The decoded texts.
SentenceCharTokenizer.encode_texts
encode_texts(self, texts, include_oov=False, verbose=1, **kwargs)
Encodes the given texts using internal vocabulary with optionally applied encoding options. See
`apply_encoding_options
to set various options.
Args:
- texts: The list of text items to encode.
- include_oov: True to map unknown (out of vocab) tokens to 0. False to exclude the token.
- verbose: The verbosity level for progress. Can be 0, 1, 2. (Default value = 1)
**kwargs: The kwargs for
token_generator
.
Returns:
The encoded texts.
SentenceCharTokenizer.get_counts
get_counts(self, i)
Numpy array of count values for aux_indices. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_counts(0)
returns the numpy array of sentence lengths across
texts. Similarly, get_counts(1)
will return the numpy array of token lengths across sentences.
This is useful to plot histogram or eyeball the distributions. For getting standard statistics, you can use
get_stats
method.
SentenceCharTokenizer.get_stats
get_stats(self, i)
Gets the standard statistics for aux_index i
. For example, if token_generator
generates
(text_idx, sentence_idx, word)
, then get_stats(0)
will return various statistics about sentence lengths
across texts. Similarly, get_counts(1)
will return statistics of token lengths across sentences.
This information can be used to pad or truncate inputs.
SentenceCharTokenizer.save
save(self, file_path)
Serializes this tokenizer to a file.
Args:
- file_path: The file path to use.
SentenceCharTokenizer.token_generator
token_generator(self, texts, **kwargs)
Yields tokens from texts as (text_idx, sent_idx, character)
Args:
- texts: The list of texts. **kwargs: Supported args include: n_threads/num_threads: Number of threads to use. Uses num_cpus - 1 by default.
- batch_size: The number of texts to accumulate into a common working set before processing. (Default value: 1000)