LDAModel (Spark 2.4.6 JavaDoc)

Object
- org.apache.spark.ml.PipelineStage
- - org.apache.spark.ml.Transformer
  - - org.apache.spark.ml.Model<LDAModel>
    - - org.apache.spark.ml.clustering.LDAModel

All Implemented Interfaces:

java.io.Serializable, Logging, LDAParams, Params, HasCheckpointInterval, HasFeaturesCol, HasMaxIter, HasSeed, Identifiable, MLWritable

Direct Known Subclasses:

DistributedLDAModel, LocalLDAModel
```
public abstract class LDAModel
extends Model<LDAModel>
implements LDAParams, Logging, MLWritable
```
Model fitted by LDA.
param: vocabSize Vocabulary size (number of terms or words in the vocabulary) param: sparkSession Used to construct local DataFrames for returning query results

See Also:

Serialized Form

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`Dataset<Row>`	`describeTopics()`
`Dataset<Row>`	`describeTopics(int maxTermsPerTopic)` Return the topics described by their top-weighted terms.
`Vector`	`estimatedDocConcentration()` Value for `docConcentration` estimated from data.
`abstract boolean`	`isDistributed()` Indicates whether this instance is of type `DistributedLDAModel`
`double`	`logLikelihood(Dataset<?> dataset)` Calculates a lower bound on the log likelihood of the entire corpus.
`double`	`logPerplexity(Dataset<?> dataset)` Calculate an upper bound on perplexity.
`LDAModel`	`setFeaturesCol(String value)` The features for LDA should be a `Vector` representing the word counts in a document.
`LDAModel`	`setSeed(long value)`
`LDAModel`	`setTopicDistributionCol(String value)`
`Matrix`	`topicsMatrix()` Inferred topics, where each topic is represented by a distribution over terms.
`Dataset<Row>`	`transform(Dataset<?> dataset)` Transforms the input dataset.
`StructType`	`transformSchema(StructType schema)` :: DeveloperApi ::
`String`	`uid()` An immutable unique ID for the object and its derivatives.
`int`	`vocabSize()`

Methods inherited from class org.apache.spark.ml.Model
copy, hasParent, parent, setParent

Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.clustering.LDAParams
docConcentration, getDocConcentration, getK, getKeepLastCheckpoint, getLearningDecay, getLearningOffset, getOldDocConcentration, getOldOptimizer, getOldTopicConcentration, getOptimizeDocConcentration, getOptimizer, getSubsamplingRate, getTopicConcentration, getTopicDistributionCol, k, keepLastCheckpoint, learningDecay, learningOffset, optimizeDocConcentration, optimizer, subsamplingRate, supportedOptimizers, topicConcentration, topicDistributionCol, validateAndTransformSchema

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
featuresCol, getFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIter, maxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed
getSeed, seed

Methods inherited from interface org.apache.spark.ml.param.shared.HasCheckpointInterval
checkpointInterval, getCheckpointInterval

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copy, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString

Methods inherited from interface org.apache.spark.internal.Logging
initializeLogging, initializeLogIfNecessary, initializeLogIfNecessary, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save, write

- Method Detail
  - describeTopics
```
public Dataset<Row> describeTopics(int maxTermsPerTopic)
```
    Return the topics described by their top-weighted terms.
    
    Parameters:
    
    maxTermsPerTopic - Maximum number of terms to collect for each topic. Default value of 10.
    
    Returns:
    
    Local DataFrame with one topic per Row, with columns: - "topic": IntegerType: topic index - "termIndices": ArrayType(IntegerType): term indices, sorted in order of decreasing term importance - "termWeights": ArrayType(DoubleType): corresponding sorted term weights
  - describeTopics
```
public Dataset<Row> describeTopics()
```
  - estimatedDocConcentration
```
public Vector estimatedDocConcentration()
```
    Value for docConcentration estimated from data. If Online LDA was used and optimizeDocConcentration was set to false, then this returns the fixed (given) value for the docConcentration parameter.
    
    Returns:
    
    (undocumented)
  - isDistributed
```
public abstract boolean isDistributed()
```
    Indicates whether this instance is of type DistributedLDAModel
  - logLikelihood
```
public double logLikelihood(Dataset<?> dataset)
```
    Calculates a lower bound on the log likelihood of the entire corpus.
    See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
    WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.
    
    Parameters:
    
    dataset - test corpus to use for calculating log likelihood
    
    Returns:
    
    variational lower bound on the log likelihood of the entire corpus
  - logPerplexity
```
public double logPerplexity(Dataset<?> dataset)
```
    Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
    WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.
    
    Parameters:
    
    dataset - test corpus to use for calculating perplexity
    
    Returns:
    
    Variational upper bound on log perplexity per token.
  - setFeaturesCol
```
public LDAModel setFeaturesCol(String value)
```
    The features for LDA should be a Vector representing the word counts in a document. The vector should be of length vocabSize, with counts for each term (word).
    
    Parameters:
    
    value - (undocumented)
    
    Returns:
    
    (undocumented)
  - setSeed
```
public LDAModel setSeed(long value)
```
  - setTopicDistributionCol
```
public LDAModel setTopicDistributionCol(String value)
```
  - topicsMatrix
```
public Matrix topicsMatrix()
```
    Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
    WARNING: If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization ("em") optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
    
    Returns:
    
    (undocumented)
  - transform
```
public Dataset<Row> transform(Dataset<?> dataset)
```
    Transforms the input dataset.
    WARNING: If this model is an instance of DistributedLDAModel (produced when optimizer is set to "em"), this involves collecting a large topicsMatrix to the driver. This implementation may be changed in the future.
    
    Specified by:
    
    transform in class Transformer
    
    Parameters:
    
    dataset - (undocumented)
    
    Returns:
    
    (undocumented)
  - transformSchema
```
public StructType transformSchema(StructType schema)
```
    Description copied from class: PipelineStage
    
    :: DeveloperApi ::
    Check transform validity and derive the output schema from the input schema.
    We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().
    Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
    
    Specified by:
    
    transformSchema in class PipelineStage
    
    Parameters:
    
    schema - (undocumented)
    
    Returns:
    
    (undocumented)
  - uid
```
public String uid()
```
    Description copied from interface: Identifiable
    
    An immutable unique ID for the object and its derivatives.
    
    Specified by:
    
    uid in interface Identifiable
    
    Returns:
    
    (undocumented)
  - vocabSize
```
public int vocabSize()
```

Class LDAModel

Method Summary

Methods inherited from class org.apache.spark.ml.Model

Methods inherited from class org.apache.spark.ml.Transformer

Methods inherited from class Object

Methods inherited from interface org.apache.spark.ml.clustering.LDAParams

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasCheckpointInterval

Methods inherited from interface org.apache.spark.ml.param.Params

Methods inherited from interface org.apache.spark.ml.util.Identifiable

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Method Detail

describeTopics

describeTopics

estimatedDocConcentration

isDistributed

logLikelihood

logPerplexity

setFeaturesCol

setSeed

setTopicDistributionCol

topicsMatrix

transform

transformSchema

uid

vocabSize