7. Seismogram atlas#
We have tested PCA and ICA for reducing the dimensions of the scattering coefficients and clustering tasks. We will now investigate Uniform Manifold Projection and Approximation (UMAP) for similar purposes. PCA and ICA preserve quite well the pair-wise distances of data points at the costs of losing information about the local structure. Because they preserve the distances, the techniques are often used for clustering tasks later on. UMAP basically learns the high-dimensional manifold of the data and maps it into a lower dimensional space. By doing so, it preserves much better the local structure but at the costs of global structures, meaning that distances between distant data points might be distorted. Therefore, more care has to be taken if we perform clustering in the UMAP space.
In the literature, these UMAP spaces are called atlases, since they resemble a map of the data such as the activation atlas of a convolutional neural network. Therefore, we decided to call the UMAP spaces of a continuous seismogram a “seismogram atlas”. Seismogram atlases were introduced in Steinmann et al. 2023.
Note: This notebook requires the installation of the matplotlib, sklearn and umap-learn libraries, which are not included in the default installation of the scatseisnet library. To install them, please visit the matplotlib website, the scikit-learn website and the umap-learn website.
[13]:
import pickle
from matplotlib import pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN, KMeans
import umap
import obspy
plt.rcParams["date.converter"] = "concise"
%config InlineBackend.figure_format = "svg"
Load scattering coefficients#
First, we load the scattering coefficients and reshape them for any dimensionality reduction model (here UMAP) of the scikit-learn package. The shape of the scattering coefficients are given by the following tuples:
Order 1:
(n_times, n_channel, octaves[0] * resolution[0])Order 2:
(n_times, n_channel, octaves[0] * octaves[1] * resolution[0] * resolution[1])Order n:
(n_times, n_channel, np.prod(octaves) * np.prod(resolution))
We then need to collect the all-order scattering coefficients into a two-dimensional matrix for the UMAP application.
[14]:
# Load data from file
with np.load(
"../example/scattering_coefficients.npz", allow_pickle=True
) as data:
order_1 = data["order_1"]
order_2 = data["order_2"]
times = data["times"]
# Reshape and stack scattering coefficients of all orders
order_1 = order_1.reshape(order_1.shape[0], -1)
order_2 = order_2.reshape(order_2.shape[0], -1)
scattering_coefficients = np.hstack((order_1, order_2))
# transform into log
scattering_coefficients = np.log10(scattering_coefficients)
# print info about shape
n_times, n_coeff = scattering_coefficients.shape
print("Collected {} samples of {} dimensions each.".format(n_times, n_coeff))
Collected 4319 samples of 504 dimensions each.
[15]:
# Load the corresponding time vector
with np.load(
"../example/independent_components.npz", allow_pickle=True
) as data:
times = data["times"]
Apply UMAP#
After loading and stacking the scattering coefficients into a matrix, we can now apply UMAP to the scattering coefficient matrix. The n_components keyword argument informs the model about how many components we want to extract. Since we want to visualize an atlas, we extract 2 components. Other important hyperparameters are n_neighbors and min_dist, which control the local versus global structure preservation. We will work with the default parameters.
[16]:
umap_model = umap.UMAP(n_components=2, random_state=42)
embedding = umap_model.fit_transform(scattering_coefficients)
# Visualize the UMAP results
fig, ax = plt.subplots()
ax.scatter(embedding[:, 0], embedding[:, 1], s=1)
ax.set_aspect("equal")
/Users/seydoux/GitHub/scatseisnet/.venv/lib/python3.13/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
warn(
Interpret the atlas#
The seismogram atlas depicts two distinct clusters, which is very different from the two-dimensional ICA and PCA visualizations in notebook 4. Many cluster algorithms will be able to identify these two distinct structures, so we will use the basic k-means clustering.
[17]:
# clustering task
N_CLUSTERS = 2
cluster_model = KMeans(n_clusters=N_CLUSTERS)
predictions = cluster_model.fit_predict(embedding)
# show clustering results in the atlas
fig, ax = plt.subplots()
for label in np.unique(predictions):
where = label == predictions
ax.scatter(embedding[where, 0], embedding[where, 1], s=1)
ax.set_aspect("equal")
[18]:
SMOOTH_KERNEL = 20
# Convert predictions to one-hot encoding
one_hot = np.zeros((len(times), N_CLUSTERS + 1))
one_hot[np.arange(len(times)), predictions] = 1
# Plot the results
fig, ax = plt.subplots(figsize=(6, 6))
# Plot each cluster as a separate line
for i in range(N_CLUSTERS):
# Obtain the detection rate by convolving with a boxcar kernel
detection_rate = (
np.convolve(one_hot[:, i], np.ones(SMOOTH_KERNEL), mode="same")
/ SMOOTH_KERNEL
)
# Plot the detection rate
ax.plot(times, one_hot[:, i] + i, alpha=0.5)
ax.plot(times, detection_rate + i, color="black")
# Labels
ax.set_xlabel("Time")
ax.set_ylabel("Cluster index")
plt.show()
Evaluate clustering results#
Our k-means clustering in the seismogram atlas separates very sharply the tremor from the non-tremor period. The results look much cleaner than the k-means clustering in the 4-dimensional feature space of ICA from notebook 6.