Dimensionality reduction

Abstract

Modern network measurements may contain hard to discern facets. Metrics may be derived from basic, measurable entities, for example, quality might be a function of the delay, loss, environment, access technology, background noise and so on. System delays may be coupled, network load and server response times may be correlated. Endpoints, moving users or link types affect feedback timers. Mechanisms such as link layer retransmissions are often in effect with slight effects. Buffers in multiservice networks can act to tradeoff between loss or delay. Therefore, discerning quantifiable network facets present in complex datasets is our goal.

Dimensionality reduction is a key technique in data processing, but is not (yet) heavily used within the networking community. Consequently, we evaluated four dimensionality reduction techniques, i) Principal Component Analysis, ii) Kernel PCA, iii) Linear Discriminant Analysis, iv) and t-Distributed Statistical Neighbour Embeddings.

We measured network attributes during the most-viewed boxing match streamed from three globally accessible servers. We measured the network and server delays, the packet loss. The access technologies were WiFi and Ethernet from two environments, home and work. The four algorithms were evaluated in terms of clustering/separation visualisation and CPU / memory performance.

Introduction

Dimensionality reduction has been widely applied in many fields. Reduction techniques have been used to lower the amount of data from the original dataset and leave only the statistically relevant components in the (output) processed data. Dimension reduction also improves the performance of classification algorithms, as it removes noisy irrelevant data points. It also allows for improved visualisation of complex measurement data sets. In the ML community, dimensionality reduction is also known as feature extraction / selection.

In large sets of network data, we also want to quantify the major statistical contributors and their interaction. This interaction can be broadly considered as the covariance between the measured entities. In a networked setting, covariance can arise as coupled delays, which we will explore in Section [coupled].

Measurement data typically has many thousands of data points, but not many explicit features. Implicit features, often hidden, such as the network load are embedded into the measurements. The captured environment constitutes an important feature of networked measurements. In contrast, the number of features in image processing can be in the tens of thousands, but fewer implicit features and data points.

Our contribution is a performance and visualisation evaluation of four dimensionality reduction techniques using measurement data from a popular sporting event.

Four techniques

Rationale and evaluation

We chose PCA as it is ubiquitous in dimensionality reduction, it is computationally efficient, linear and parameterless. It also has many variants, kernel, probabilistic, discriminant, see the specific PCA book @Jolliffe:1986. Kernel PCA performs analysis in the high dimensional space, using a kernel function, to find the principal components, see Table [complexity-table]. LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables that best explain the data. t-SNE is attractive due to its visualisation effectiveness, plus with recent improvements in its performance @t-SNE.

Tech. Type Para- Para- Compu- Memory
metric meters ation.
PCA Linear No – $O(D^3)$ $O(D^2)$
kPCA Non-lin. Yes $k(\cdot, \cdot)$ $O(N^3)$ $O(N^3)$
LDA Linear No – – –
t-SNE Non-lin. Yes $Perp(\cdot)$ $O(N^2)$ $O(N^2)$

Principal Component Analysis (PCA)

PCA is a linear variance-based method for reducing the dimensionality of data. It uses an orthogonal transformation to convert a dataset of correlated variables, into a set of sorted values of linearly uncorrelated variables, called the principal components @Pearson01. The transformation ensures the 1st component has the largest variance, the 2nd the next highest variance, and so on. The number of principal components will always be less than (or equal) to the number of original features or measurement attributes. Dimensionality reduction allows features that contribute little to the dataset to be removed. A PCA visualisation is shown in Figure [pca~e~xample].

Kernel PCA (kPCA)

It is an extension of PCA using linear algebra’s kernel methods. A kernel known as null space or nullspace is the set of vectors in the domain of the mapping which maps to the zero vector. They are key in inverting a matrix, a fundamental operation. Let T : V $\rightarrow$ W be a linear transformation between vector spaces: $$ker(T) = T^{-1} (0) = {v \in V | Tv = 0}.$$ Kernels are used in SVMs, popular in machine learning and in kernel density estimates, a continuous function representing a discrete histogram. Notably, kPCA nonlinearises the linear dimensionality reduction method.

Linear Discriminant analysis (LDA)

LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique, a distinction between independent variables and dependent variables is made. LDA works when the measurements made on independent variables for each observation are continuous quantities, we also make use of categorical independent variables such as the access labels, (Ethernet / Wifi and Home / Work). A comparison is shown in Fig. [pca~v~s~l~da].

T-distributed stochastic neighbourhood embedding (t-SNE)

t-SNE is a non-linear algorithm for clustering high dimensional data by projecting each measurement point onto a lower dimensional (LD) scatter plot VanDerMaaten:2014. On a low dimension visualisation, the similarities and differences of the highly dimensional measurement space may allow data exploration and analysis.

**** *1. High dimensional measurement space*

$X$ High dimensional space
$x_{i}$ Measurement point
$P$ Probability distribution of points
$p_{ij}$ Probability of points $x_i$ and $x_j$ being selected
$p_{i|j}$ Probability of point $x_i$ being selected, given $x_j$
$\sigma_i$ The variance around point $x_i$
$N$ Number of measurement points
**** *2. Low dimensional visualisation map*
$Y$ Low dimensional map
$y_{i}$ Visualisation point in LD map
$Q$ Probability distribution of points in LD map
$q_{ij}$ Probability of points $y_i$ and $y_j$ being selected
$q_{i|j}$ Probability of point $y_i$ being selected, given $y_j$
**** *3. Hyperparameters*
$Perp(\cdot, \cdot)$ Perplexity (data variance, user preference)
$\eta$ Learning rate
$\alpha(t)$ Momentum function
**** *4. Help functions*
$Z(\cdot)$ Normalisation function
$C(\cdot)$ Cost function
$H(\cdot)$ Entropy
$KL(\cdot, \cdot)$ Kullback-Liebler distance function
**** *5. Indexing and counters*
$T$ Number of iterations
$_i, _j$ Indices of chosen points (LD and HD)
$_k, _l$ Indices of all (other) points (LD and HD)

The LD map is optimised by shifting points using the well known gradient descent, a 1st-order iterative algorithm. The t-SNE algorithm is shown in algorithm 1.

[1] Data: $X = {x_1, x_2, … , x_n}$ Parameters: $Perp$, $T$,
$\eta$, $\alpha(t)$ Result: $Y^{(T)} = {y_1, y_2, … , y_n}$

Compute LD affinities $q_{ij}$ $p_{ij}$ $\gets$
$\frac{p_{j|i} + p_{i|j}}{2N}$ sample $Y^{(0)}$ $\gets$
${y_1,y_2, … ,y_n}$ from $N(0, 10^{-4}I)$ Compute $p_{j|i}$ using
data and $Perp$ Compute gradient $\frac{\partial C}{\partial Y}$
$Y^{(t)}$ $\gets$ $Y^{(t-1)}$ + $\eta$$\frac{\partial C}{\partial Y}$ +
$\alpha(t)(Y^{(t-1)} – Y^{(t -2)})$ **** [tsne-alg]

Dataset – fight of the century {#dataset}

A boxing match between Floyd Mayweather (USA) and Manny Pacquiao (Philippines) took place in Las Vegas, May 2015. 10 hours of measurements were taken compromising 40K data points. Three selected endpoints are shown in Table [Boxing-sites].

           **Philstar**                   **Showtime**         **Sky**

Domain philstar.com sho.com sky.com
Role News & US cable UK-based
entertain. & entertain. & entertain.
portal network network
Server Arizona, Amsterdam, Amsterdam,
location USA. NL. NL.
CDN Own
provider (on prem) Akamai Akamai
Time
difference -7 +1 +1

[Boxing-sites]

traceroute was used as a reachability tool to record the number of Internet hops, before and after the event. IP delays were recorded with ping, and the front-end server responses with httping. It records the round trip times for GET requests to the remote webservers. We also recorded the packet losses as well as the access location and technologies. ‘Academic’ is a node on the Swedish academic network to peers around Europe and the US, it is rarely overloaded^2. ’Home’ is a student residence, prone to occasional overloads plus frequent 802.11 interference.

# Feature Philstar Showtime Sky

#hops 12 8 9
2a. Net. delay 196$\pm$36 ms 29$\pm$2 ms 2$\pm$3 ms
2b. Serv. delay 418$\pm$128 ms 79$\pm$130 ms 24$\pm$62 ms
Loss 3% 0% 0%
4a. Acad. over Eth. 418$\pm$128 ms 79$\pm$130 ms 24$\pm$62 ms
4b. Acad. over WiFi 634$\pm$315 ms 154$\pm$75 ms 62$\pm$456 ms
5a. Home over Eth. 351$\pm$227 ms 75$\pm$105 ms 19$\pm$86 ms
5b. Home over WiFi NA 108$\pm203$ ms 48$\pm$221 ms
5c. All active 536$\pm$301 245$\pm$312 ms 115$\pm$250 ms

[Boxing-sites-choose]

Example – coupled delays {#coupled}

Delays arising from a network and a server interact. A loaded network will result in longer response times from a server, and a busy server will produce longer latency for the network. From an external measurement perspective, these delays might be indistinguishable and change in contribution over time. Systems theory, stability, and coupled systems have a rich engineering history @PhysRevLett.80.2109. In the results section we have seen that the major artefact is the delay and its variance. In Figure [2sites~h~ttping~p~ing] we see how the delays both increase toward the event, the variance in the server increases and the systems are indeed coupled.

[2sites~h~ttping~p~ing]

Clearly, the average delay is due to the network however, the variance in the delay is due to the server response. To quantify the contributions of the network and server, a scatter biplot of the delay values in the PCA space is shown in Figure [PCA~b~iplot].

[httping~p~ing]

[PCA~b~iplot]

From the whitened (average removed and standard deviation normalised) data, with the components overlaid, one can see the contribution of two delay components. Both are positive, showing the contribution due to the server (HTTP) delay being four times that of the network (IP) delay. The angle of the eigenvectors indicates their correlation.

Results

Using dimensionality reduction techniques it is possible to shed light on complex measurement paths and varied environments. Where the probabilities in the covariance matrix are similar (network ’similarity’) the points in the visualisation will be close or even overlap. Where there is a significant difference in the measurement space, this will be visible. Note, the data is whitened, meaning distributions are shifted and scaled to have zero mean & unit covariance. No information is lost, but it is necessary to compare the techniques presented here. Figures [PCA~b~oxing~r~esults] and [t-SNE~b~oxing] show PCA and t-SNE applied to our data, all measurement attributes in Table [Boxing-sites] were used, and combinations of the environment and technology from Table [Boxing-sites-choose]. We ran measurements from the environment and technology simultaneously, to correlate path, geography, and access.

Visualisation effectiveness

Given the visualisation is inherently subjective, quantifying the effectiveness is not possible. Indeed the perplexity function in t-SNE controls the visualisation impact for the user. We had 691k measurement points for the boxing event[^3]. Not all are shown in plots [PCA~b~oxing~r~esults] and [t-SNE~b~oxing], but were used for the execution times in table [perf].

PCA: Measurements expose each server clearly; 1. three distinct groupings exist. For the Showtime and Sky servers, their measurement spaces are close, indicated by their relative closeness in the visualisation map. The Philippines grouping is distinct from the other two. 2. The network distance and varying delays are responsible for these artifacts, as it was the non-CDN hosted site. By inspecting the covariance matrix we can see the larger component is from the delay, and the delay variance, see Section [coupled]. 3. The Philippines server has both orthogonal, (uncorrelated) and non-orthogonal (correlated) projections. This can be seen by the points labeled ’Phi’ perpendicular to the 45$^\circ$ gradient of the measurement points. Note, no explicit classification is done at this stage.

Using clustering to identify natural groupings automatically, both PCA and t-SNE would reveal adequately, three natural groupings we have in the datasets in two dimensions. The points away from the central tendency in PCA and at the edges were higher delays. In this work, we are not looking for particular artefacts. Categorical variables can be taken into account for LDA and t-SNE, at a cost of extra complexity and the burden of parameter choice. t-SNE also does some clustering as part of it’s low dimensional layout algorithm, through the perplexity paramter, which is an important parameter in terms of visualisation.

[PCA~b~oxing~r~esults]

t-SNE: We used a perplexity of 20 in Figure [t-SNE~b~oxing], which shows clear separation. The method of selecting points according to their neighbours, results in clearer separation. A central mass according to the student-t distribution could be identified either by eye or by a clustering method such as Kmeans. Note, at the edges of the groups, there is some bleeding of the measurement points into the groups of the other clusters than they could belong. These are not necessarily errors, especially in the Showtime and Sky measurement datasets, but should be looked at in more detail.

[t-SNE~b~oxing]

Runtime performance and memory usage

Although the algorithms are envisaged to be used in an offline manner, performance is relevant. As far as we know, vanilla PCA is the only algorithm currently being considered into a streaming framework such as Flink, and is therefore suitable online reduction/classification.

| l | l | l || l | l | l | l | & Wall & Norm. &
Actual & Norm.\
Tech. & clock & time & mem. & time.\
& time (s) & **** & **** &\
Read & – & – & 233 Kb & -\
PCA & 0.15 & – & 233 Kb & 1\
kPCA & 55.62 & – & 21.45 Mb & -\
LDA & – & – & – & -\
t-SNE & 44.96 & – & 56.46 Mb & -\

Related work {#related}

Machine Learning. Dimension reduction survey articles include [JMLR:v16:cunningham15a, Maaten08:dimred, sorzano2014survey] of these being the author of t-SNE. An approach that attempts to address the problems of PCA-like approaches is Sammon mapping @sammon1969nonlinear. It alters the cost function of classical scaling by dividing the squared error in the representation of each pairwise Euclidean distance by the original Euclidean distance in the HD space. Performance improvements of t-SNE have been suggested in [VanDerMaaten:2014]

Measurements. Internet measurements by Crovella and Krishnamurthy consider embedding network measurements into high dimensional metric spaces for analysis [opac-b1119782]. RIPE Atlas is a project measuring European Internet quality as part of their interaction of traffic and network initiative. 9400 probes have been deployed as of Jan. 2020. The Archipelago Measurement Infrastructure uses 150 Raspberry PIs, globally deployed, to quantify connectivity and congestion using IPv46 active measurements [claffyARK] this is achieved by a crafted sequence of pings along selected network paths.

Discussion

Before we delve into details of this work, it is important to remember that dimensionality reduction is often used as a precursor to classification.

Method comparison: The distance metrics for PCA and t-SNE are not the same, as PCA uses a Euclidean space, whereas t-SNE does not in the visualisation space. PCA deals with labelled data less well, and users have to parameters their data. PCA is an unsupervised learning technique, it doesn’t use class information, whilst LDA is a supervised technique, and uses class information. One can expect LDA to perform better, at higher complexity.
Size Vs. Dimensionality: Networked measurements often gather many data points ($N$). To some degree they can be handled by the computing system (see Volume earlier). Dimensionality ($D$) mandates methods such as those here. ($N$, $D$) can, and should, be chosen to make the tradeoffs for the computational resources available, using values such as those in Table [complexity-table].
Performance: t-SNE produces good visualisation, and does not need a post-classsifcation step, although is slow for large data sets. We have not tried the later algorithmic developments (Barnes-Hut). For fast grouping we suggest using PCA, or one of its derivatives.
Parameterisation: PCA is parameter-free given the data, and with the results we have seen acceptable visualisation and performance. t-SNE relies on some parameterisation, the perplexity produces different visualisations. We did not experiment extensively, however we found a perplexity of 20 (from 5-50) to be acceptable. Recent visualisation tools^4 instruct users how to utilise t-SNE, and choose this important parameter.
Model performance: can one use kPCA? If we want ’good data separation’ then we need to measure it on the training data and use that to find the best kernel. So we want another, class-unrelated, measure of model performance.

Conclusions

We have motivated, explained and compared four dimensionality reduction techniques for the analysis of network data in this relatively short paper. Techniques have thus far been quite overlooked in the networking community by resorting to multiple 2D plots or summary statistics. Visualisation in other communities, however, is very commonplace and useful. The techniques preserve the measurement correlations, but project them down into easier comprehend two dimensions. The techniques even help resolve complex dependencies, shown in the example of coupled delays.