Artist attribution results
The results of ensemble ML of 100 different trained networks for attribution using patches of side-length 200 pixels (10 mm) are shown in Fig. 2. Each patch is color coded according to the highest probability (most likely artist), with the opacity of the shading proportional to the magnitude of that probability (i.e., more transparent shadings correspond to more uncertain attributions). Out of 180 patches for each artist in the test painting, we found 12, 0, 2, and 14 patches attributed incorrectly for artists 1 through 4, representing an overall accuracy of 96.1%. This is remarkable as 25% accuracy is expected by random choice. Further, we find that most of the patches were attributed with high confidence (more opaque shading) for all four artists. The accuracy of ML prediction from the height data is remarkable, particularly given the similarity of the patches in terms of features distinguishable to the human eye (Fig. 1B), as well as its success in broad monochrome areas of the painted background.
Exploring the effect of patch size on attribution accuracy
The surprisingly accurate attribution of 10 mm patches leads to a natural question: how does the size of the patch affect the machine’s ability to properly attribute? In other words, can we make the patch size smaller than 10 mm and still reliably attribute the hand? Fig. 3 presents results for networks trained on patches with different side-lengths ranging from 10 pixels (0.5 mm) to 1200 pixels (60 mm). The predictions are quantified in terms of overall accuracy for all four artists (solid thick curve) and individual artist F1 score (thin colored curves). We also calculated precision and recall; results are shown in Additional file 1: Fig. S3. To check the self-consistency of the predictions, we conducted repeat training/testing trials at each patch size (details in the Additional file 1). The data points and error bars in Fig. 3 represent the mean and standard deviation for those trials.
The accuracy exhibits a broad plateau around 95% for patches between 100 and 300 pixels (5 and 15 mm). Below 100 pixels there is a gradual drop-off in accuracy, as each individual patch contains fewer of the distinctive features that facilitate attribution. The F1 scores allow us to separate out the network performance for each artist. Consistent with the results in Fig. 2, the attribution is generally better for artists 2 and 3 versus 1 and 4 across patch sizes less than 300 pixels (15 mm). Nonetheless, the F1 scores for all artists are above 90% near the optimal patch size (around 200 pixels or 10 mm).
On the other end of the patch size spectrum, the ML approach faces a different challenge. The size of training sets becomes quite small, even though each individual patch contains many informative features. The single-network accuracy drops off quickly for patch sizes above 300 (15 mm) pixels, decreasing to about 75% at the largest sizes.
Predictions using single-pixel information versus spatial correlations
One of the hallmarks of CNNs is their ability to harness spatial correlations at various scales in an input image in order to make a prediction. However, there is also information present at the single pixel level since each artist’s height data will have a characteristic distribution relative to the mean. The probability densities for these distributions are shown in Fig. 4, calculated from the two paintings in the training sets of each artist. The height distributions are all single-peaked and similar in width, except for Artist 1, who exhibits a broader tail at heights below the mean than the others. In order to determine how important spatial correlations are, we can compare the CNN results to an alternative attribution method that is blind to the correlations: maximum likelihood estimation (MLE). For a given patch in the testing set, we calculate the total likelihood for the height values of every pixel in the patch belonging to each of the four distributions in Fig. 4. Attribution of the patch is assigned to the artist with the highest likelihood. The predictive accuracy of the MLE approach versus patch size is shown as a dashed line in Fig. 3. We expect MLE to perform the best at the largest patch sizes, since each patch then gives a larger sampling of the height distribution, and hence is easier to assign. Indeed, at the patch size of 1200 pixels (60 mm), representing nearly a fifth of the area of a single painting, the MLE accuracy approaches 70%, comparable to the CNN accuracy. In this limit the size of the training set is likely too small for the CNN to effectively learn correlation features. As the patch size decreases, the gap between the CNN and MLE performance grows dramatically. In the range of 100–300 pixels (5–15 mm) where the CNN performs optimally (~ 95%), the MLE accuracy is only around 40%. These small patches are an insufficient sample of the distribution to make accurate attributions based on single pixel height data alone. Clearly the CNN is taking advantage of spatial correlations in the surface heights. This leads to a natural next question: what correlation length scales are involved in the attribution decision?
Using empirical mode decomposition to determine the length-scales of the brushstroke topography
In order to examine the spatial frequency (length) scales most important in the ML analysis, we employed a preprocessing technique used historically in time-series signal analysis called empirical mode decomposition (EMD) [17], which has recently been extended into the spatial domain [18,19,20]. Its versatility is derived from its data-driven methodology, relying on unbiased techniques for filtering data into intrinsic mode functions (IMFs) that characterize the signal’s innate frequency composition [21]. In our case, we have used a bi-directional multivariate EMD [22] to split our 3D reconstruction of each painting’s complex surface structure into IMFs that characterize the various spatial scales present.
The first IMF contains the smallest length scale textures, and subsequent IMFs contain larger and larger features until the sifting procedure is halted and a residual is all that remains. This process is lossless in the sense that by adding each IMF and the resulting residual together, the entire signal is preserved [17, 21]. It is also unbiased in the sense that when compared to standard Fourier analysis techniques, there are no spatial frequency boundaries to define, and no edge effects introduced from defining those boundaries.
By investigating each series of IMFs individually, we can estimate the length scale for each as follows. We use a standard 2D fast-Fourier transform on the IMF and calculate a weighted average frequency for the modes. The length scale is the inverse of the average frequency and is plotted versus IMF number in Fig. 5B. Among the four artists, the typical scale increases from about 0.2 mm for IMF 1 to 0.8 mm for IMF 5. Figure 5A shows a sample patch and the corresponding IMFs, which illustrates the progressive coarsening for the higher numbered IMFs. To see how the length scale affects the attribution results, we repeated the CNN training using each IMF separately, rather than the height data. The resulting mean accuracies for each IMF using three different patch sizes are shown in Fig. 5C. Individual IMFs are by construction less informative than the full height data (which is a sum of all the IMFs), and hence we do not reach the 95% level of accuracy seen in the earlier CNN results. However, IMFs 1 and 2 (the smallest length scales) achieve accuracies of above 80% at patch size 10 mm (200 pixels). There is a drop-off in accuracy as we go to larger patch sizes (IMFs 3–5), indicating that the salient information used for attribution is present at length scales of 0.2–0.4 mm. These are comparable to the dimensions of a single bristle in the types of brushes used by the artists, which were 0.25 and 0.65 mm respectively, as shown as dashed lines in Fig. 5B. This strongly suggests that the key to this attribution using height data lies at scales that are small enough to reflect the unintended (physiological) style of the artist. This result is consistent with the scale-dependent ML results depicted Fig. 3, which indicate that below a patch size of 5 mm, all accuracies are well-above that expected for random attribution (25%). Remarkably, even at the scale of 0.5 mm, that is, the scale of 1–2 bristle widths, ML was able to attribute to 60% accuracy.
Comparing topography versus photography when testing on data with novel characteristics
Image recognition by ML is most often performed on photographic images of the subject depicted by arrays corresponding to the RGB channels of the entire image. The aim of this test is to determine how well CNNs perform at attributing the patches of the images depicted in row A of Fig. 2 as compared to the profilometry data. We were particularly interested in how well ML of the two types of data—photo and height-based—would perform if the testing set had novel colors and subject matter, absent in the training set. This approach better approximates the challenges of real-world attribution, where we would not necessarily have extensive well-attributed training data matching the palette and content of the regions of interest in a painting where the algorithm would be applied. To generate qualitatively distinct training and testing sets, we divided each painting into patches of side-length 100 pixels (5 mm) and then sorted the patches into three categories: background, foreground, and border depending on the color composition of each patch (see Fig. 6A for an example). Among our training set, 25% of the patches are assigned to background, 50% count as foreground, and the remaining 25% are border patches (Fig. 6B). The latter include regions of both background and foreground and were excluded from both training and testing to make it more challenging for the algorithm to generalize from one category to the other. The mostly dark green and black color palette and lack of defined subjects in the background distinguishes it from the foreground, which is dominated by the painted flower, with various shades of yellows and reds. Could a network trained on only background patches still accurately attribute foreground patches, or vice versa? The mean accuracy results are shown Fig. 6C, with the left two bars corresponding to training on the background, testing on the foreground, and the right two bars to the reverse scenario. Because the training sets are significantly smaller (and less representative of the test sets) than in our earlier analysis, we expect lower attribution accuracies. Despite this, networks trained on the height data (blue bars) perform reasonably well, achieving 60% accuracy when trained on background, and 80% when trained on foreground. (We note that the background training set is about half the size of the foreground one). In contrast, networks trained on the photo data did significantly worse (red bars), achieving 27% and 43% accuracies, respectively. Clearly, in this context the color and subject information in the photo data, which was likely the focus of the ML training, was a hindrance, since the test set confronted the network with novel colors and subject matter. In contrast, there is a significant, small-scale, stylistic component that is captured in the height data that is present whether the artist is painting the foreground or background, which is therefore harnessed for attribution.