DocViz: How it works

DocViz allows you to visualize the sentences in your document(s) in a 3D space. Here is the detail of its underlying model, its potential usage, and its limitations.

Compute Sentence Embeddings

Given an input document, DocViz computes the sentence embedding of each sentence in the input using Sentence-BERT. The model it's using is "distilbert-base-nli-mean-tokens".

The computed embeddings are 768 dimensional vectors. I use PCA to reduce the embedding dimensions to 3. In the end, you will get a 3 dimensional vector representing each sentence in the input.

Plot Embeddings

I use Plotly for the interactive visualizations.

How should I interpret the visualization?

Please refer to the technical details list towards the end of this page.

What should I be careful about?

Just like any other machine learning model, Sentence-BERT is trained on data. Where there is data, there is bias. Sentence-BERT is trained on HUGE amounts of data, that even some people think that the bias in data can be canceled out. But this is not true. Sentence-BERT HAVE social, gender, and racial biases because its data generated by humans - and we all have biases. So while Sentence-BERT is able to capture the meaning of language with a high accuracy like a human, it also interprets the meaning with its own bias, like a human. You need to acknowledge that whatever it shows you is NOT absolute. With that, you can have a better chance using this model to its full potential. For more information on bias in contextual representations, you can check this paper.

Remember, you cannot reach conclusions from DocViz - it does not answer your question, instead it helps you come up with questions. You need to actively interpret the visualization, ideally combining other methods, such as close reading or text mining. You need to be responsible for understanding the assumptions of the models and their limitations. You are the one who draw meanings and information from the visualization.

What can I do with DocViz?

With all that's been said above, DocViz is still a useful tool for all types of humanities or computational research. Let me show you through an example. Below is an interactive graph. The data is Obama and Trump's inaugural speech transcripts.

Suggestions:

Technical Details
Sentence-BERT

Sentence-BERT is a framework from Sentence-Transformer that can generate sentence embeddings from pre-trained BERT models. BERT is a language model pre-trained on BooksCorpus (800M words) and English Wikipedia (2500M words) that can produce contextual word level embeddings. The Sentence-BERT model in DocViz uses Siamese networks and average pooling to obtain sentence-level representations from BERT's output, and is fine-tuned on Natural Language Inference data (SNLI). The model and training detail can be found in this paper. The paper for the original BERT model is here.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensional reduction method for increasing interpretability while minimizing information loss. It does so by finding the largest eigenvalues and their corresponding eigenvectors of the data's covariance matrix. You can understand it as defining a new (and smaller) set of dimensions for the data, and represent data in the space made by that set of dimensions, so that the number of dimensions is reduced while information loss is minimized. For more information regarding PCA, you can check this link.

PCA reduces the dimension of sentence embeddings to 3, which allows DocViz to visualize them on a graph. PCA is an adaptive data analysis technique - its output changes based on the input. This is the reason why you cannot make cross-graph comparisons with DocViz, because the dimensions (axes) in each graph have different meanings.