public class VectorIndexer extends Estimator<VectorIndexerModel>
Vector.
This has 2 usage modes: - Automatically identify categorical features (default behavior) - This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. - Set maxCategories to the maximum number of categorical any categorical feature should have. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous. - Index all features, if all features are categorical - If maxCategories is set to be very large, then this will build an index of unique values for all features. - Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
Index stability: - This is not guaranteed to choose the same category index across multiple runs. - If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity. - More stability may be added in the future.
TODO: Future extensions: The following functionality is planned for the future: - Preserve metadata in transform; if a feature's metadata is already present, do not recompute. - Specify certain features to not index, either via a parameter or via existing metadata. - Add warning if a categorical feature has only 1 category. - Add option for allowing unknown categories.
| Modifier and Type | Class and Description |
|---|---|
static class |
VectorIndexer.CategoryStats
Helper class for tracking unique values for each feature.
|
| Constructor and Description |
|---|
VectorIndexer() |
VectorIndexer(String uid) |
| Modifier and Type | Method and Description |
|---|---|
VectorIndexerModel |
fit(DataFrame dataset)
Fits a model to the input data.
|
int |
getMaxCategories() |
IntParam |
maxCategories()
Threshold for the number of values a categorical feature can take.
|
VectorIndexer |
setInputCol(String value) |
VectorIndexer |
setMaxCategories(int value) |
VectorIndexer |
setOutputCol(String value) |
StructType |
transformSchema(StructType schema)
:: DeveloperApi ::
|
String |
uid() |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitclear, copy, copyValues, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, setDefault, shouldOwn, validateParamsinitializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarningpublic VectorIndexer(String uid)
public VectorIndexer()
public String uid()
public VectorIndexer setMaxCategories(int value)
public VectorIndexer setInputCol(String value)
public VectorIndexer setOutputCol(String value)
public VectorIndexerModel fit(DataFrame dataset)
Estimatorfit in class Estimator<VectorIndexerModel>dataset - (undocumented)public StructType transformSchema(StructType schema)
PipelineStageDerives the output schema from the input schema.
transformSchema in class PipelineStageschema - (undocumented)public IntParam maxCategories()
(default = 20)
public int getMaxCategories()