You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/visual-programming/source/widgets/unsupervised/tsne.md
+27-15Lines changed: 27 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -6,51 +6,63 @@ Two-dimensional data projection with t-SNE.
6
6
**Inputs**
7
7
8
8
- Data: input dataset
9
+
- Distances: distance matrix
9
10
- Data Subset: subset of instances
10
11
11
12
**Outputs**
12
13
13
14
- Selected Data: instances selected from the plot
14
-
- Data: data with an additional column showing whether a point is selected
15
+
- Data: data with t-SNE coordinates and an additional column showing whether a point is selected
15
16
16
-
The **t-SNE** widget plots the data with a t-distributed stochastic neighbor embedding method. [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a dimensionality reduction technique, similar to MDS, where points are mapped to 2-D space by their probability distribution.
17
+
The **t-SNE** widget creates a visualization using t-distributed stochastic neighbor embedding (t-SNE). [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is a dimensionality reduction technique, similar to MDS, where points are mapped to 2-D space by their probability distribution.
18
+
19
+
The widget accepts either a data table or a distance matrix as input. If a data table is provided, the widget will apply the chosen preprocessing option, then calculate distances internally.
17
20
18
21

19
22
20
-
1.[Parameters](https://opentsne.readthedocs.io/en/latest/parameters.html) for plot optimization:
21
-
- measure of [perplexity](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Roughly speaking, it can be interpreted as the number of nearest neighbors to distances will be preserved from each point. Using smaller values can reveal small, local clusters, while using large values tends to reveal the broader, global relationships between data points.
23
+
1. Preprocessing is applied before t-SNE computes the distances between data points in the dataset. These parameters are ignored when the *Distances* input is provided.
24
+
-*Normalize data*: We can apply standardization before running PCA. Standardization normalizes each column by subtracting the column mean and dividing by the standard deviation.
25
+
-*Apply PCA preprocessing*: For datasets with large numbers of features, e.g. 100 or 1,000, or highly correlated variables, we can apply PCA preprocessing to speed up the algorithm and decorrelate the data.
26
+
-*PCA components*: the number of principal components to use when applying PCA preprocessing.
27
+
28
+
2. Optimization parameters. The parameters are explained in-depth [here](https://opentsne.readthedocs.io/en/latest/parameters.html):
29
+
-*Initialization*: PCA positions the initial points along principal coordinate axes. Spectral inialization calculates the spectral embedding of t-SNE's affinity matrix. Only spectral intialization is supported when using precomputed distance matrices.
30
+
-*Distance metric*: The distance metric to be used when calculating distances between data points. This setting is ignored when a precomputed distance matrix is provided.
31
+
-[*Perplexity*](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html): Roughly speaking, perplexity be interpreted as the number of nearest neighbors to which distances will be preserved. Using smaller values can reveal small, local clusters, while using large values tends to reveal the broader, global relationships between data points.
22
32
-*Preserve global structure*: this option will combine two different perplexity values (50 and 500) to try preserve both the local and global structure.
23
33
-*Exaggeration*: this parameter increases the attractive forces between points, and can directly be used to control the compactness of clusters. Increasing exaggeration may also better highlight the global structure of the data. t-SNE with exaggeration set to 4 is roughly equal to UMAP.
24
-
-*PCA components*: in Orange, we always run t-SNE on the principal components of the input data. This parameter controls the number of principal components to use when calculating distances between data points.
25
-
-*Normalize data*: We can apply standardization before running PCA. Standardization normalizes each column by subtracting the column mean and dividing by the standard deviation.
26
34
- Press Start to (re-)run the optimization.
27
-
2. Set the color of the displayed points. Set shape, size and label to differentiate between points. If *Label only selection and subset* is ticked, only selected and/or highlighted points will be labelled.
28
-
3. Set symbol size and opacity for all data points. Set jittering to randomly disperse data points.
29
-
4.*Show color regions* colors the graph by class, while *Show legend* displays a legend on the right. Click and drag the legend to move it.
30
-
5.*Select, zoom, pan and zoom to fit* are the options for exploring the graph. The manual selection of data instances works as an angular/square selection tool. Double click to move the projection. Scroll in or out for zoom.
31
-
6. If *Send selected automatically* is ticked, changes are communicated automatically. Alternatively, press *Send Selected*.
35
+
3. Set the color of the displayed points. Set shape, size and label to differentiate between points. If *Label only selection and subset* is ticked, only selected and/or highlighted points will be labelled.
36
+
4. Set symbol size and opacity for all data points. Set jittering to randomly disperse data points.
37
+
5.*Show color regions* colors the graph by class, while *Show legend* displays a legend on the right. Click and drag the legend to move it.
38
+
6.*Select, zoom, pan and zoom to fit* are the options for exploring the graph. The manual selection of data instances works as an angular/square selection tool. Double click to move the projection. Scroll in or out for zoom.
39
+
7. If *Send selected automatically* is ticked, changes are communicated automatically. Alternatively, press *Send Selected*.
32
40
33
41
Preprocessing
34
42
-------------
35
43
36
-
t-SNE uses default preprocessing if necessary. It executes it in the following order:
44
+
If necessary, t-SNE applies the following preprocessing steps by default, in the following order:
37
45
38
46
- continuizes categorical variables (with one feature per value)
39
47
- imputes missing values with mean values
40
48
41
49
To override default preprocessing, preprocess the data beforehand with [Preprocess](../data/preprocess.md) widget.
42
50
51
+
The "Preprocessing" section also contains user-controllable options that are applied to a data table before distances are computed.
52
+
53
+
If a distance matrix is provided as input, preprocessing is not applied.
54
+
43
55
Examples
44
56
--------
45
57
46
58
The first example is a simple t-SNE plot of *brown-selected* data set. Load *brown-selected* with the [File](../data/file.md) widget. Then connect **t-SNE** to it. The widget will show a 2D map of yeast samples, where samples with similar gene expression profiles will be close together. Select the region, where the gene function is mixed and inspect it in a [Data Table](../data/datatable.md).
47
59
48
60

49
61
50
-
For the second example, use [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget from the Single Cell add-on to load *Bone marrow mononuclear cells with AML (sample)* data. Then pass it through **k-Means** and select 2 clusters from Silhouette Scores. Ok, it looks like there might be two distinct clusters here.
62
+
For the second example, use [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget from the Single Cell add-on to load *Bone marrow mononuclear cells with AML (sample)* data. We can use t-SNE to visualize the dataset. The t-SNE visualization shows that there indeed appear to be clusters of cells in our dataset.
51
63
52
-
But can we find subpopulations in these cells? Select a few marker genes with the [Marker Genes](https://orangedatamining.com/widget-catalog/bioinformatics/marker_genes/) widget, for example natural killer cells (NK cells). Pass the marker genes and k-Means results to [Score Cells](https://orangedatamining.com/widget-catalog/single-cell/score_cells/) widget. Finally, add **t-SNE** to visualize the results.
64
+
Let's try to determine which cluster of cells corresponds to natural killer cells (NK cells). The *Marker Genes* widget from the Single Cell add-on contains collections of known marker genes for different cell types. Select the markers for NK cells.
53
65
54
-
In **t-SNE**, use *Cluster* attribute to color the points and *Score* attribute to set their size. We see that killer cells are nicely clustered together and that t-SNE indeed found subpopulations.
66
+
We can then score how much each of our cells corresponds to these marker genes using the *Score Cells* widget. We can then visualize the result in our t-SNE plot. We color the points and determine their size according to the computed *Score*. The brightly-colored, larger points correspond to cells that had high expression values of our marker genes. We can conclude that this, upper-left cluster of cells corresponds to NK cells.
0 commit comments