DocViz: How it works
DocViz allows you to visualize the sentences in your document(s) in a 3D space. Here is the detail of its underlying model, its potential usage, and its limitations.
Compute Sentence Embeddings
Given an input document, DocViz computes the sentence embedding of each sentence in the input using Sentence-BERT. The model it's using is "distilbert-base-nli-mean-tokens".
The computed embeddings are 768 dimensional vectors. I use PCA to reduce the embedding dimensions to 3. In the end, you will get a 3 dimensional vector representing each sentence in the input.
Plot Embeddings
I use Plotly for the interactive visualizations.
How should I interpret the visualization?
Please refer to the technical details list towards the end of this page.
- In short, the original embeddings obtained from Sentence-BERT can be viewed as the contextual representation of a sentence. The goal is to map each sentence to a vector space such that semantically similar sentences are close. It is assumed to capture the semantic (and syntactic) information of the text - the meaning and the structure of language. This 768-dimensional embedding is absolute, meaning that you can compare any pairs of embeddings generated by the same Sentence-BERT model.
- Reducing the dimensions of the original embeddings makes it easier to plot the embeddings while keeping as much information as possible. Since PCA is performed on the given data, the embeddings obtained after PCA is relative. This means that when you use DocViz, you can only compare documents if you input them in one graph - you cannot make cross-graph comparisons.
- The fundamental guide for viewing an embedding graph: Look at the points plotted. Each represents a sentence (you can hover on a point to see which sentence it is). What you need to notice is the distance between points - points closer together are sentences that Sentence-BERT thinks are semantically closer, thus have closer meanings, and vice versa. Remember, all the conclusions you draw from the visualization should be based on this assumption.
- However, one thing you need to notice is that due to how embeddings are calculated, Sentence-BERT will naturally believe that shorter sentences are similar to shorter ones, and longer similar to longer ones. You can view it as Sentence-BERT is also able to capture some structural information of the sentences.
What should I be careful about?
Just like any other machine learning model, Sentence-BERT is trained on data. Where there is data, there is bias. Sentence-BERT is trained on HUGE amounts of data, that even some people think that the bias in data can be canceled out. But this is not true. Sentence-BERT HAVE social, gender, and racial biases because its data generated by humans - and we all have biases. So while Sentence-BERT is able to capture the meaning of language with a high accuracy like a human, it also interprets the meaning with its own bias, like a human. You need to acknowledge that whatever it shows you is NOT absolute. With that, you can have a better chance using this model to its full potential. For more information on bias in contextual representations, you can check this paper.
Remember, you cannot reach conclusions from DocViz - it does not answer your question, instead it helps you come up with questions. You need to actively interpret the visualization, ideally combining other methods, such as close reading or text mining. You need to be responsible for understanding the assumptions of the models and their limitations. You are the one who draw meanings and information from the visualization.
What can I do with DocViz?
With all that's been said above, DocViz is still a useful tool for all types of humanities or computational research. Let me show you through an example. Below is an interactive graph. The data is Obama and Trump's inaugural speech transcripts.
Suggestions:
- Observe the graph by viewing it from different angles. Rotate it and view it in 3D space. Do you observe any clusters of data? Do you see any anomaly in data? Is there a place with uneven distribution of red and blue data (hint: Make America Great Again)? Or are the majority of the data distributions similar?
- After you observed some patterns, come up with questions. For example, if you think that the distributions of data in the space for Trump and for Obama are pretty even, what does it mean? Does it mean that Obama and Trump, two presidents with distinct personalities, spoke similarly in their inaugural speech? Or if you think that there is little cluster somewhere, what is common in that cluster? What questions can you ask about it?
- With the questions you come up, try to answer them - not only with DocViz, but also through other computational or traditional tools: text mining, topic modeling, close reading... Find more data if possible, come up with concrete hypotheses if possible.
Technical Details
Sentence-BERT
Sentence-BERT is a framework from Sentence-Transformer that can generate sentence embeddings from pre-trained BERT models. BERT is a language model pre-trained on BooksCorpus (800M words) and English Wikipedia (2500M words) that can produce contextual word level embeddings. The Sentence-BERT model in DocViz uses Siamese networks and average pooling to obtain sentence-level representations from BERT's output, and is fine-tuned on Natural Language Inference data (SNLI). The model and training detail can be found in this paper. The paper for the original BERT model is here.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensional reduction method for increasing interpretability while minimizing information loss. It does so by finding the largest eigenvalues and their corresponding eigenvectors of the data's covariance matrix. You can understand it as defining a new (and smaller) set of dimensions for the data, and represent data in the space made by that set of dimensions, so that the number of dimensions is reduced while information loss is minimized. For more information regarding PCA, you can check this link.
PCA reduces the dimension of sentence embeddings to 3, which allows DocViz to visualize them on a graph. PCA is an adaptive data analysis technique - its output changes based on the input. This is the reason why you cannot make cross-graph comparisons with DocViz, because the dimensions (axes) in each graph have different meanings.