Labels

When your points are labeled it can be helpful to show these labels in the scatterplot.

Hover to play video and turn on audio

You can draw text labels with Jupyter Scatter's label() function:

scatter.label(by='df_column_name')

The referenced column needs to be either categorical or contain strings. For all points with the same value, a text label is drawn.

For instance, say we have the following DataFrame:

	x	y	cat	pval
0	0	0	A	0.51
1	1	0	A	0.80
2	1	1	A	0.25
3	0	1	A	0.01
4	2	2	B	0.65
5	3	2	B	0.99
6	3	3	B	1.33
7	2	3	B	0.01

We can label points using the cat column via scatter.label(by='cat').

When displaying labels, Jupyter Scatter automatically manages label collision and overcrowding. It uses an importance-based static placement strategy such that overlapping labels with a lower priority are visualized at a higher zoom when the collision is resolved. To handle many labels, Jupyter Scatter uses a tiling approach where only a limited number of labels (default: 100) are shown per tile.

You can control the label density with the max_number parameter. For instance, to show fewer labels per tile do scatter.label(by='cat', max_number=50).

INFO

For demo notebooks on how to use labels, see https://github.com/flekschas/jupyter-scatter-tutorial.

Importance

Whenever two text labels would collide, the label with the lower priority is hidden. You can specify the importance via the importance parameter as shown below. If no importance information is used, the number of points labeled by a value is used.

scatter.label(by='cat', importance='pval')

Since importance values are point specific, we need to aggregate multiple values to derive the label importance. By default, Jupyter Scatter uses the mean but you can change this behavior to 'min', 'median', 'max', or 'sum'.

For instance, in the following we use the maximum point importance as the label importance.

scatter.label(by='cat', importance='pval', importance_aggregation='max')

Additionally, it's also possible to specify a custom aggregator function that takes as input an array of floats and must return a single float.

TIP

If you want to count the points and use this as the importance, simply omit importance altogether as that's the default behavior.

Customization

Appearance

You can customize the appearance and placement of labels in various ways. The font, color, and size parameters allow you to adjust the font face, color, and size.

scatter.label(
  by='cat',
  font='arial bold',
  color='red',
  size=36,
)

By default, the label size is constant (i.e., zoom invariant) but you can also enlarge labels as you soon in. To do this, set scale_function to "asinh", which enlarges the label using the inverse hyperbolic sine function.

scatter.label(
  by='cat',
  size=36,
  scale_function='asinh'
)

The inverse hyperbolic sine function is only applied when zooming in and increases the label size sublinearly compared to the camera zoom as follows:

label_scale = asinh(zoom_scale) / asinh(1)

Hover to play video and turn on audio

INFO

Resolving collisions with inverse hyperbolic sine-scaled labels is computationally more involved than constant scaling. Hence, if you have many labels (i.e., >=1000) we recommend using constant scaling.

Position

You can also control the center position of labels, their alignment around this position, and offset using the positioning, align, and offset parameters.

scatter.label(
  by='cat',
  positioning='largest_cluster',
  align='top',
  offset=(2, 2),
)

Jupyter Scatter offers three positioning algorithms with different tradeoffs as outlined below. You may want to experiment to see which one works best for your specific use case.

Highest Density

The default positioning method ('highest_density') places the label at the point of highest density within the group. This algorithm:

Is fast
Calculates density based on how many points are clustered in each area
Places labels where the most points are concentrated
Works well for irregular clusters with varying densities
Gives okay results for many datasets

Center of Mass

The 'center_of_mass' method places the label at the geometric center of all points in the group:

Is fast
Calculates the center position using the Shoelace formula
Creates a balanced label position
Works well for single-cluster and evenly-distributed points
Not recommended when your labels consist of disconnected clusters

Largest Cluster

The 'largest_cluster' method identifies the largest sub-cluster within the group using HDBSCAN and places the label at its center of mass:

Slowest method
Detects clusters within each group
Places the label at the center of mass of the largest cluster
Works well when points naturally form multiple clusters
Typically results in the best label placement overall

INFO

"largest_cluster" is an additional feature that's not included in the default installation as it relies on HDBSCAN. To use the feature install Jupyter Scatter via pip install "jupyter-scatter[all]".

Line Breaks

If your labels tend to be on the longer side, you might want to introduce line breaks. To make your life easier, you can specify an target aspect ratio for which Jupyter Scatter will then try to find optimial line breaks.

scatter.label(
  by='cat',
  target_aspect_ratio=5,
)

Point Labels

By default, all points with the same value in a column are grouped and given a single label. However, sometimes you may want to label each individual data point instead. Jupyter Scatter supports this through a special syntax - simply append an exclamation mark to the column name:

For example, with the following DataFrame, using scatter.label(by='city!') will label each individual city. Hence, even though there are two cities called "Berlin", they each get their own label.

	x	y	city
0	0.13	0.27	Paris
1	0.87	0.93	New York
2	0.10	0.25	Berlin
3	0.03	0.90	Rome
4	0.19	0.78	Tokyo
4	0.99	0.81	Berlin

scatter.label(by='city!')

This is useful when each data point represents a unique entity but their labels are not unique (like cities on a map)

INFO

Currently, only one column can be marked as a point label. If multiple columns are marked with exclamation marks, only the first one will be used as a point label.

Multiple and Hierarchical Label Types

Jupyer Scatter also supports multiple and even hierarchical label types. For instance, let's assume the data frame contains another categorical or string column.

	x	y	cat	sub	pval
0	0.13	0.27	A	A1	0.51
1	0.87	0.93	B	B1	0.80
2	0.10	0.25	A	A1	0.25
3	0.03	0.90	A	A2	0.01
4	0.19	0.78	B	B1	0.65

We can render out labels for both "cat" and "sub" as follows:

scatter.label(by=["cat", "sub"])

You can customize multiple labels in two ways. You can provide a list of values. For instance, to draw cat labels in black bold at 24px and sub labels in red italics at 18px, you can do:

scatter.label(
  by=['cat', 'sub']
  font=['bold', 'italic'],
  color=['black', '#ff0000'],
  size=[24, 18],
)

If you want to be even more specific with settings, you can also pass a dictionary of <type>:<value> pairs. For instance, if you want all cat to be black but B to be green, you can do the following:

scatter.label(
  by=['cat', 'sub'],
  font=['bold', 'italic'],
  color={'cat': 'black', 'cat:B': 'green', 'sub': '#ff0000'},
  size=[24, 18],
)

When working with multiple label types, collisions are still resolved in order of importance such that the colliding labels with lower priorities appear only at higher zoom levels when they no longer collide with labels of higher importance. To adjust this behavior you can specify type-specific zoom ranges. Zoom ranges are declared as zoom levels where zoom_scale = 2 ^ zoom_level.

scatter.label(
  by=['cat', 'sub'],
  zoom_ranges={'cat': (-math.inf, 2), 'sub': (2, 10)}
)

In the above example, cat labels are allowed to appear up until zoom level 2 and sub labels are allowed to appear from zoom level 2 onward.

INFO

Note, zoom ranges do not enforce that labels are shown in that given range but rather they specify the allowed zoom range at which they can appear given the labels' importance and overlap with other labels.

If your label types describe a strict hierarchy, as is the case for the example specified above, then you can set hierarchical to True. For hierarchical label types, Jupyter Scatter automatically enforces that labels in at a lower hierarchical level are shown before labels with a higher hierarchical level.

scatter.label(
  by=['cat', 'sub'],
  hierarchical=True
)

For instance, in the example above, irrespective of the labels' importance, cat labels will be shown before sub labels if they collide. Non-colliding labels might be shown at the same time.

Exclude Labels

Sometimes you do not want to show all labels. For instance, when clustering labels it's common to label unclear points as noise. You can exclude unwanted labels using the exclude parameter. For instance, in the following we exclude the label B.

scatter.label(by=['cat'], exclude=['B'])

When using multiple label types, you need to specify excluded labels via <type>:<label>. For instance, in the following we exclude sub label A2

scatter.label(by=['cat', 'sub'], exclude=['sub:A2'])

Precompute Labels

When working with many labels it can take a moment to compute the labels. If you want to use the same labels in multiple scenarios, it can be beneficial to precompute labels and later load them from file.

To precompute labels, use the LabelPlacement class. It accepts many of the same parameters as the Scatter's label function.

labels = LabelPlacement(data=df, x='x', y='y', by='cat')
labels.compute()

TIP

For tracking the progress while precomputing very large label sets you can show a progress bar via labels.compute(show_progress=True). This feature requires the complete installation via pip install "jupyter-scatter[all]".

Once the labels placement has been precomputed, you can persist them to disk as parquet files.

labels.to_parquet('my_labels')

Later on you can then recreate the label placement instance.

labels.from_parquet('my_labels')

Importantly, you can pass this label placement instance directly to your scatter instance.

scatter.labels(using=labels)

INFO

The label placement class is responsible for statically resolving label collisions by determining at which zoom level labels should appear. This calculation is based on a tile size parameter, which defaults to 256 × 256 pixels.

It's important to understand that font size and zoom ranges are relative to this tile size. When using Scatter's label() function directly, the tile size defaults to height × height (the widget's height), which means labels are optimized for the initial view.

When precomputing labels with a different tile size than your widget's height, you may notice differences in how labels appear:

If you specify a smaller tile size (e.g., 100) for a taller visualization (e.g., height=200), labels using the 'asinh' scale function will appear larger than expected in the initial view.
With the 'constant' scale function, font size remains consistent regardless of tile size.

For most reliable results when precomputing labels for various display sizes, the 'constant' scale function is recommended.

Labels ​

Importance ​

Customization ​

Appearance ​

Position ​

Highest Density ​

Center of Mass ​

Largest Cluster ​

Line Breaks ​

Point Labels ​

Multiple and Hierarchical Label Types ​

Exclude Labels ​

Precompute Labels ​

Labels

Importance

Customization

Appearance

Position

Highest Density

Center of Mass

Largest Cluster

Line Breaks

Point Labels

Multiple and Hierarchical Label Types

Exclude Labels

Precompute Labels