Image-based metric heritage modeling in the near-infrared spectrum

Digital photogrammetry and spectral imaging are widely used in heritage sciences towards the comprehensive recording, understanding, and protection of historical artifacts and artworks. The availability of consumer-grade modified cameras for spectral acquisition, as an alternative to expensive multispectral sensors and multi-sensor apparatuses, along with semi-automatic software implementations of Structure-from-Motion (SfM) and Multiple-View-Stereo (MVS) algorithms, has made more feasible than ever the combination of those techniques. In the research presented here, the authors assess image-based modeling from near-infrared (NIR) imagery acquired with modified consumer-grade cameras, with applications on tangible heritage. Three-dimensional (3D) meshes, textured with the non-visible data, are produced and evaluated. Specifically, metric evaluations are conducted through extensive comparisons with models produced with image-based modeling from visible (VIS) imagery and with structured light scanning, to check the accuracy of results. Furthermore, the authors observe and discuss, how the implemented NIR modeling approach, affects the surface of the reconstructed models, and may counteract specific problems which arise from lighting conditions during VIS acquisition. The radiometric properties of the produced results are evaluated, in comparison to the respective results in the visible spectrum, on the capacity to enhance observation towards the characterization of the surface and under-surface state of preservation, and consequently, to support conservation interventions.


Multi-view image recording
In the course of the last decade, rapid advancements in passive sensors for 3D recording, workflows for swift data acquisition, automatic or semi-automatic software which implement image-based reconstruction algorithmic approaches and computational systems for processing of large datasets, have taken place. As a result, the use of 3D image-based modeling technologies has become common for various aspects of heritage science.
Multi-view image recording [1,2] has prevailed as a low-cost, efficient, and easily implementable technique for cultural heritage documentation, interpretation, dissemination, and protection [3][4][5]. Multi-view image approaches facilitate the digitization of tangible heritage with reduced needs for supervision and expertise. At the same time, they enable the production of accurate and high-resolution results with images from relatively lowcost digital cameras. These approaches differ from traditional close-range photogrammetry, due to the possibility to use oblique imagery and to simultaneously estimate the internal and external orientation camera parameters, without the need for a prior definition of control points of known coordinates on a reference system [6,7]. However, the implementation of control points is recommended during orientation for more accurate results and is mandatory for georeferencing. For applications that only require scaling in a local coordinate system, a simple, measured distance can be adequate. The processing pipeline starts with feature detection and description on every image of the dataset. Then follows a Structurefrom-Motion (SfM) implementation to determine camera positions in 3D space and the coordinates of the scene, producing a sparse point cloud. Given the camera orientations, dense image matching algorithms enable further densification of the point cloud, as almost every pixel of the scene is reconstructed in 3D -a procedure typically named Multiple-View-Stereo (MVS) or dense stereo-matching. Later, these dense point clouds can be transformed into textured models via surfacing algorithms and texture mapping.
Applications of terrestrial multi-view image-based modeling, and texturing, for heritage applications, vary considerably [8,9]. This technique has been implemented towards high-resolution recording and evaluation of the state of preservation of stone [10], wood [11], and painted artworks [12]. Although, unless measurable geometrical changes have occurred on the object, close-range photogrammetry with visible imagery has little to offer for the assessment of degradation on its own.

Near-infrared imaging and modeling
In the context of many heritage analyses, and especially for the study of polychromatic artworks and the investigation of surface or undersurface deterioration of historical artifacts, the use of visible-spectrum textures is often not adequate. Near-infrared imaging has often been explored towards this direction [13,14], with sensors that employ complementary metal-oxide-semiconductors (CMOS) based on InGaAs (indium gallium arsenide, 750-1700 nm) or PtSi (platinum silicide, 750-5000 nm) detectors, developed in the 1990s [15][16][17].
Recently, the use of commercial digital cameras employing CCD (charge-coupled device) and CMOS sensors, modified for near-infrared or full-spectrum acquisition (combined with external NIR filters) has been commonized in heritage science. They consist of lower cost, high-resolution alternatives with spectral imaging capabilities, which retain user-friendly features and interfaces to a broad variety of photographic accessories and software [18][19][20][21][22].
The availability of high-resolution, easily operated digital cameras for NIR acquisition, in combination with SfM/MVS image-based modeling techniques, has made feasible the 3D spectral modeling for heritage applications. Contemporary research, which showed promising results [22][23][24][25][26][27][28], led us to the motivation for further experimentation.

Research aims
In this work, we evaluate the use of imagery from consumer-grade digital reflex cameras modified for near-infrared imaging, in combination with image-based 3D reconstruction techniques, to produce high-fidelity models of tangible heritage assets textured with spectral information. To perform a thorough evaluation of this combined approach, we have acquired rigid datasets of images for case studies of archaeological importance with varied geometry and characteristics, and maintained constant the parameters of the acquisition for the visible and non-visible spectra involved. We produced 3D meshes utilizing software of different algorithmic implementations and compared the data produced by each method, in terms of surface deviation. Furthermore, we explored the usability of produced spectral textures. The principal aims of our research were to objectively quantify the quality of the 3D models produced by this integrated spectral modeling method, and to assess its applicability for different heritage case studies. In addition, we attempted to understand how this method can potentially enhance the preservation of surface detail on the reconstructed 3D digital models. Finally, we aimed to evaluate the capacity of the fabricated near-infrared textures, for augmenting observations towards the characterization of the state-of-preservation.

Case studies
To evaluate the presented approach, which combined multi-view photogrammetric modeling and near-infrared spectral imaging (from modified consumer cameras), we applied it for four historical artwork case studies. The objects to be studied were chosen due to their dissimilar dimensions, shapes, surface roughness, reflectivity and environmental conditions of preservation (Fig. 1 The fourth, and last, case study was a small 19th century religious stone sculpture of Christ crucified (0.31 × 0.22 m 2 ) from Castello di Casotto (Garessio, Province of Cuneo, Piedmont) owned by the Region of Piedmont. The Casotto Castle was originally a Carthusian monastery, later acquired by the Savoy and transformed into a castle and hunting lodge by Carlo Alberto.

Instrumentation, software and tools
NIR images were collected using a Nikon D810 professional camera modified for near-infrared acquisition (1650$) and a low-cost Canon EOS Rebel SL1 camera modified for full-spectrum acquisition (500$). The D810 is a Digital Single-Lens Reflex (DSLR) camera that employs a full-frame CMOS sensor (35.9 × 24 mm 2 , 4.88 μm pixel size), with a maximum resolution of 36.3 effective Megapixels. It was modified by replacing the internal NIR-cut filter (that would not allow acquisition in the near-infrared spectrum) with an internal NIR-pass filter by the company MaxMax LDP, which essentially transforms the camera into an imaging sensor sensitive only in the near-infrared spectrum. The Rebel SL1 is a DSLR camera that employs an APS-C CMOS sensor (22.3 × 14.9 mm 2 , 4.38 μm pixel size), with a maximum resolution of 18.0 effective Megapixels. It was modified by removing the internal NIR-cut filter by the company LifePixel, thus making the camera sensitive to a spectrum wider than the visible, approximately between 280 nm and 1400 nm. For the Rebel SL1 camera, we used an external NIR-pass (700-1400 nm) filter to capture the near-infrared images. VIS images were captured using a Canon EOS 1200D compact DSLR that employs an APS-C CMOS sensor (22.3 × 14.9 mm 2 , 4.40 μm pixel size), with a maximum resolution of 18.0 effective Megapixels. Also, the modified Canon EOS Rebel SL1, described above, was additionally used in combination with an external VIS-pass filter. For the Nikon camera, a 24 mm prime lens (160$) was used. For the Canon cameras, an 18-55 mm zoom lens (50$) was used. In order to avoid camera microshake effects and to produce better quality photos, the cameras were mounted on a tripod. For all photo-editing operations, we used Adobe Lightroom Classic.
Multi-view image-based reconstruction was conducted with Agisoft Metashape Professional 1.5.1 (AMP) for the first case study, and additionally with 3DFlow Zephyr Aerial 4.519 (FZA) [29] for the rest of the case studies, in order to also compare the performance of different algorithmic implementations with NIR imagery. Both commercial software are SfM/MVS approach based. Agisoft Metashape Professional employs SIFT-like detection and description, then calculates approximate camera location and used Global bundle-adjustment to refine them, a type of MVS disparity calculation for dense reconstruction and Screened Poisson surface reconstruction. 3DFlow Zephyr Aerial employs a modified Difference-of-Gaussian (DoG) detector, a combination of Approximate Nearest Neighbor Searching, M-estimator Sample Consensus and Geometric Robust Information Criterion for matching. Then uses hierarchical SfM and Incremental adjustment, and dense MVS reconstruction with fast visibility integration, tight disparity bounding, Finally, in FZA, meshing with an edge-preserving algorithmic approach was selected, to differentiate from AMP.

Data acquisition
For the first two case studies, lighting conditions were not optimal to acquire photogrammetric datasets. The marble statue was illuminated non-uniformly by various spotlights. Thus, it was decided not to use flash as it resulted in more shadows at the occluded areas instead of reducing them. The Coromandel lacquerware was also under non-uniform lighting, with glaring effect visible on optical photos at the upper part, and flash could not be used, because of the highly reflective nature of the lacquer. For the rest of the case studies, we were able to appropriately utilize the built-in flash of the cameras, thus eliminating shadows, to acquire imagery datasets with as homogeneous radiometric characteristics as possible. An x•rite ColorChecker ® Classic with 24 colors was used for color balancing, utilizing middle gray for the visible-spectrum datasets. Scaling was performed with an invar scale bar of 1.000165 m (± 15 nm), barring for the small stone sculpture, where we used as reference the dimensions of the wooden cross. Additionally, small targets were placed on the base and body of the marble statue to facilitate further metric comparisons.
Dense acquisition of images was planned in such a manner as to acquire rigid datasets with large overlaps (> 80%) from a very close range. Furthermore, we maintained the capturing conditions (internal parameters, external orientation parameters, and lighting) constant between VIS and NIR spectra, taking into consideration the different sensors employed and the surrounding conditions to obtain comparable results. For the first case study, we were not able to use exactly the same camera positions for VIS and NIR imaging, but we marked and maintained a constant distance from the object in both cases and approximately the same angles between each position. For case study 4, we used a turntable during the acquisition, and we rotated the object four times after placing different sides towards the camera. Very low ISO values were used to prevent sensor luminance noise, simultaneously maintaining exposure under the clipping limit value. All images were acquired in RAW format. Acquisition conditions are summarized in Table 1. The Ground-Sampling Distances (GSDs) that we present in Table 1 refer to an approximate average of the size of the pixel on each object, which was calculated inside the SfM/MVS based software.
A STONEX F6 SR structured light scanner was used to perform control surveys of the case studies, thus creating reference models of the objects to perform metric comparisons. The F6 SR scanner has an accuracy of 90 μm at 250 mm distance, an effective range of 250-500 mm and a resolution of 0.4 mm at 250 mm distance. All models produced by scanning were down-sampled to match the density of the photogrammetric models.

Data processing
Image-based reconstruction from visible and nearinfrared imagery followed a semi-automatic SfM/MVS workflow, standard for large-scale archaeological photogrammetric applications [30,31], very similar on both employed software, despite considerable differences on the algorithmic implementations. The implemented images were masked accordingly, to exclude the unwanted areas of each scene. In the last case study, outof-focus (blurry) areas on the stone sculpture, were also masked, to increase the quality of imagery, thus reducing noise levels and processing times. Between each stage of the reconstruction, thorough visual checks were performed to determine quality. Then, denoising procedures were followed, in an identical manner for all the produced meshes, to not reduce the comparability of results.
The 3D models were semi-automatically generated in a four-step process. The first step included a sparse reconstruction of each scene, with a simultaneous approximate calculation of cameras' relative orientation at the moment of image acquisition, and autocalibration, with SfM approaches. For this step, the selected accuracy and density parameters were the highest available in both software. The sparse point clouds were cleaned according to reprojection errors, and local cluster distances with statistical filtering. During the second step, the results were densified by employing MVS stereo-matching algorithms. The third step consisted of meshing the dense 3D point clouds into triangular surfaces (3D Delaunay algorithm). The produced meshes were then cleaned from small unconnected components and spikes. The final step referred to the application of texture mapping to obtain single-file high-resolution textures from the original photographs. Given the high quality of the original imagery, we limited color balancing and blending between images to reduce the possibility of radiometric errors. The processing parameters inside both software were selected to minimize surface noise and to optimize the final textured results. However, at the same time, we attempted to maintain as similar parameter values as possible between software and spectra, to ascertain the metrological validity of the conducted research. Into choosing the resolution of textures, we considered sampling distances to be at least two or three times higher than the original pixel sizes.

Results and discussion
With the intention of evaluating the photogrammetric results we recorded in detail the processing times, reconstruction errors and volumes of the results, catalogued in Table 2. Detailed geometric comparisons were performed, to assess surface deviations among visible and near-infrared models, as well as amongst software, by computing distances between the vertices of the final meshes. Comparisons were also performed with models produced with the F6 SR scanner. These tasks were performed in the open source software CloudCompare with the Cloud-to-Cloud Hausdorff distances tool.
In the case study of the marble statue (Fig. 2), both spectral scenarios resulted in full digital reconstructions of the scene. Near-infrared image-based modeling produced sparser initial results than the visible spectrum modeling, however, dense results were of the same volumes. The NIR mesh contained a very low level of surface noise, especially in areas with less overlapping images, although the surface details had been equally preserved. In terms of geometric differences, the two models produced with Metashape Pro fluctuated by 0.7 mm ± 0.6 mm (Fig. 3), meaning a differentiation of less than four-times the GSD (1 RMS) or less than 2.5% of the smallest dimension of the object. In both cases, the distances between photogrammetric models and the mesh produced from the F6 SR scanner were in the range of 0.8 ± 0.7 mm. Thus, considering the evaluated precision of the method, the models can be considered metrically identical and could be interchangeable for geometric recording and visualization on 1:5-1:10 scales-very common for detailed applications of decay mapping. Texturing results were of the same quality. As visible, shadows could not be effectively eliminated in either scenario.
Observations on the 3D textured results for the marble statue (Fig. 4) showed that generally, higher nearinfrared intensities correspond to healthier areas of the marble's surface, whereas lower intensities correspond to more deteriorated areas [32]. Thus, the NIR threedimensional textured mesh could be used to roughly visualize in 3D the levels of decay on the statue. Undeniably, complete elimination of shadows would help optimize these results. We should further underline, that under no circumstance could this visualization substitute the detailed chemical characterizations. Although, for small, homogeneously shadowed areas of the same material, it can provide a fast approximation of the state-of-preservation and quantification of decay.
Regarding the case study of the lacquered screen panel, the mesh produced from NIR imagery by Metashape Pro was the only complete one and had a higher quality of surface detail. Metashape Pro produced about three-times denser results than Zephyr Aerial, but the imaging spectrum did not seem to have a significant effect on the density. However, the geometry of the 3D VIS models was problematic, highlighting the consequences of the glaring effect during visible spectrum data capturing (Fig. 5). Zephyr Aerial models were noisier than those produced by Metashape Pro. Distances between VIS and NIR geometries recorded with Metashape Pro were 1.0 mm ± 0.8 mm, meaning a differentiation of less than three-times the GSD (1 RMS), again, metrically very similar, considering the precision of the method. Still, distances computed for Zephyr Aerial were 2.2 mm ± 1.6 mm, which should be considered significant. Additionally, differences between the VIS models produced with different software ranged between 2 and 2.5 mm, the same as for the NIR models. As observed, the differences calculated between the 'perforated' Metashape Pro VIS model and the respective NIR model corresponded roughly to deficiencies and noise of the digitization in the visible spectrum (Fig. 6).
High-resolution modeling and texturing with nearinfrared imagery did not only produce better topological results for the case study of the Chinese lacquerware but also enabled the identification of previous conservation interventions that are not recognizable on the  visible textures and cannot be easily observed otherwise (Fig. 7). In both cases, the distances between photogrammetric models and the meshes produced from the F6 SR scanner were in the range of 1.0 ± 1.0 mm.
For the case study of the wooden furniture part all, scenarios produced high-resolution results, with reconstruction errors of the same grade of magnitude (0.35-0.50 pixels). Although, final point clouds for Zephyr Aerial were approximately five times sparser.
Distances between VIS and NIR geometries recorded with Metashape Pro were 2.2 mm ± 1.4 mm, and 2.4 mm ± 2.0 mm with Zephyr Aerial. These differences can be interpreted as significant inaccuracies regarding the size of the pixel on the object (0.02 mm). However, if we consider the 2 ± 2 mm differences between the photogrammetric models and the models produced with F6 SR as the precision of the methodology, then they can be considered acceptable (Fig. 8). A percentage of this problem though can be attributed to the flat almost two-dimensional geometry of the object that does not favor the reconstruction. In addition to that, both 3D models from Zephyr Aerial included medium levels of noise. On the contrary, the respective 3D models from Metashape Pro had a very smooth surface while preserving most of the detail (Fig. 9).
The detail on acquired texturing results both for visible and near-infrared, was adequate for conservation-oriented observations. With the near-infrared textures we were able to successfully locate and measure cracks, paint defects and retouchings (Fig. 10).
Regarding the last case study, that of the stone sculpture of Christ crucified, all processing scenarios resulted in great detail and high texture-quality models (Fig. 11). Only the triangular 3D mesh produced with the NIR imagery from Zephyr Aerial contained a low magnitude of noise. Unlike to the other case studies, Zephyr Aerial produced sparser results than Metashape Pro both for the NIR and the VIS images. The initial reconstructions of the object were sparser in near-infrared but, dense reconstruction results were of very similar volume for VIS and NIR imagery in both software. The variation between VIS and NIR vertices ranged below 0.15 mm (0.08 mm ± 0.07 mm) for both software, a very accurate result considering the GSD of original photos, the precision of the method, the type  of sensor, and the distances between the visible models, which had a 0.11 mm ± 0.14 mm (Fig. 12). The variation between both VIS and NIR photogrammetric models and the scanning model ranged below 0.27 mm. The very-high-resolution NIR textures made feasible the identification of deterioration and were further exploited to roughly map the different levels of state of preservation (Fig. 13). This mapping could assist the precise work required by conservators on demanding restoration applications.

Conclusions
In this work, we constructed detailed models of historical artwork with high-resolution near-infrared textures, combining multi-view image 3D recording and near-infrared images captured with modified DSLR cameras, thus proving the feasibility of this integrated technique. Apart from the obvious advantage of reduced cost, compared to previously applied nearinfrared three-dimensional acquisition methods, we also observed an increased versatility and convenience of implementation. We used two commercial software to compare the behavior of different Structure-from-Motion, Multiple-View-Stereo, and Meshing algorithmic implementations in near-infrared. Metashape Pro generally provided complete meshes, and less noisy than Zephyr Aerial, with higher density and slightly smaller statistical reconstruction errors. Furthermore, we constructed as a reference, detailed models of the same heritage case studies from visible imagery, acquired under the same conditions, and processed with similar parameters, to ensure comparability. We observed that the initial reconstruction results from all case studies were sparser in near-infrared, with slightly heightened reconstruction errors, but the dense results were of very similar volume and produced on comparable times. For the case study of a Chinese screen panel, which was from highly reflective material, digitization from near-infrared imagery even improved the reconstruction results, compensating for the glaring effects under visible light. Metric checks for all case studies proved that the combined 3D spectral technique   is also very accurate comparing to visible 3D imagebased modeling, by taking into consideration sensors, lenses, capturing distances, and the precision of SfM/ MVS based approaches. The high-resolution infrared textures improved the visibility of deterioration on the sculptures, potentially providing a valuable low-cost tool for three-dimensional decay mapping if certain light conditions are met during the data acquisition. In addition, they assisted the identification of defects and previous restoration works on the lacquerware and painted surfaces case studies, thus ascertaining the significance of the discussed approach towards heritage restoration and protection.