Editor’s note: we are excited to present this work, the winner of Novi’s inaugural Student Blog Contest. Many thanks to Esben and the other submissions we received!
This blog post highlights features and functionality of an early developmental version of CorePy, a collaborative effort with Toti Larson and other research team members of the Mudrock Systems Research Laboratory consortium at the University of Texas’ Bureau of Economic Geology (BEG).
IN THIS POST
Why care about pXRF data?
Thanks to relatively short scanning times and low operating cost, portable X-ray Fluorescence (pXRF) scanning of geologic core samples has become a burgeoning industry, with the technique capable of generating high resolution datasets containing comprehensive major and trace element abundances present in the rock. When applied to continuous slabbed core, this translates to high-resolution, multivariate, depth registered elemental data that can be used in conjunction with sedimentological observations and lithological core descriptions to deepen our understanding of subsurface geology and adds an additional correlation toolkit for use in stratigraphic comparison between core control. Integrated with petrophysical, mineralogical, and organic matter measurements, this approach has the potential to unlock a broad new understanding of geological systems. This is particularly relevant in mudrock systems, where geochemical processes such as surface water biologic productivity, bottom water redox conditions, and diagenesis are paramount for two of the things that matters the most to define a good shale reservoir: total organic content and permeability.
We will explain the thought process behind the underlying preprocessing, computation, and visualization of pXRF data by using a case study from a well that penetrates a proximal portion of the Pearsall Formation, a mixed system of carbonates and siliciclastic of Cretaceous age deposited in South and Central Texas with potential as a shale play (with potential co-development of the Eagle Ford!) that is currently developed in the Indio Tanks and Pena Creek fields. We will be using Scikit-learn, an open source machine learning library that supports supervised and unsupervised learning towards a variety of applications. Here we will be utilizing principal component analysis and k-means clustering algorithms to manipulate and segment the data before visualization. The process outlined in this blog post develops labeled datasets that can be tied to the core, ahead of the implementation of a neural network.
The Dataset
The scans in this dataset were accomplished using a Bruker Tracer IV ED-XRF portable x-ray fluorescence (pXRF) instrument, measured at 10 cm (4 in) spacing along the face of slabbed core. Major elements included are: Na, Mg, Al, Si, P, S, K, Ca, Ti, Mn, Fe, Ba, Cr, and V. Trace elements included are: Co, Ni, Cu, Zn, Ga, As, Pb, Se, Th, Rb, U, Sr, Y, Zr, Nb, and Mo. Raw output data (% for major elements, and ppm for trace elements) was calibrated following the semiquantitative methods outlined in (Rowe et al., 2012).

Let us start with Guadalupe River Damsite 7-1, cored in Kendall Co TX. Guadalupe River Damsite 7-1 has a cored interval of 286 feet in length, with 926 data points along the length of the core. Each depth-registered data point contains elemental abundance data for 14 major and 16 trace elements. This leaves us with 27,780 data points. In the time since this core was scanned, we have decreased pXRF scan spacing to 5 cm (2 in), effectively doubling both the resolution and size of our XRF datasets per core.
Here is a single element biplot of Si against Al. I want you to look carefully.

Do you have it stored in your mind? I hope so, because next I would like you to make note of how that compares to looking at Si against Mn.

Great!
But now what about the rest of the data? With a system comprised of 30 elements, we would need to look at 435 individual cross plots simultaneously. Unfortunately, it is quite challenging (impossible) for the human brain to keep track of 30 dimensions (at least for me). Fortunately, by taking advantage of machine learning and some visualization techniques, we can extract meaning from the data no matter the number of features.
Data Preprocessing & Standardization
First, we will want to remove outliers from the dataset. In this example we apply a simple multiplier of four times the standard deviation as our outlier detection limit.

These flagged data points are then excluded from subsequent analysis and segmentation processes. Next, the data goes through a process of standardization, which transforms the values of our variables to all fall within the same numerical range, having a mean value of zero and a standard deviation of one. This helps account for any errors in our segmentation that may otherwise arise due to the weighting of attributes within our model. As an example, if a dataset contained Sr values ranging within 20 to 5,000 ppm and Cu values ranging from 5 to 25 ppm, a non-standardized model would determine Cu to be insignificant to Sr.
Dimensionality Reduction with PCA
Principal component analysis is a method used to take datasets with a large number of variables, and transform these original variables into “principal components.” PCA generates as many of these principal components as there are variables in the dataset, arranged in order, 1 to n, that describe the total variance in the data. This means that principal component one will account for the largest amount of variability, principal component two will account for the second largest amount of variability, and so on. The power of PCA comes from the ability to directly choose the principal components to carry forward into further analysis and interpretation, thereby enabling the ability to simplify the system and remove background noise while maintaining the proponents of the greatest amount of variance unchanged.

Here we see that for this dataset, principal components one and two account for 60% of the cumulative variance (roughly 47% and 13% respectively). This method follows the premise that mineralogy is the fundamental control of major elements in sedimentary rock, and therefore, mineralogy is captured by principal component one, and principal components two and onwards involve trace element variability. This trace element variability can be affected by things such as surface water productivity, bottom water redox conditions, and any detrital-derived trace element enrichment. Thus, we can achieve dimensionality reduction from 30 components to less than 5 while still effectively representing the mineralogy and trace element interactions critical to capturing the variance of the system.
K-means clustering to visualize PCA chemofacies
For simplicity, let us select the first two principal components for our segmentation process. We will use k-means clustering, which is a simple segmentation method that divides the dataset into a selected number of groups. After choosing the number of clusters, those clusters are then given a seed value, or set of starting points. Once the initial seeds are plotted, each data point becomes assigned to the nearest seed based on proximity. The next step is to calculate the centroid for each seed, which is defined as the geometrical center of the data points assigned to each respective seed. The previous steps are then repeated until the system converges and we have reached a clustering solution where the cluster centroids can no longer be adjusted. These final groupings are then defined as our chemofacies, which are effectively chemomineral end members based on genetically similar bulk and trace element distributions.
Now let us visualize the results of our principal component analysis and clustering on a 2-D plane. To accomplish that we will choose the first two principal components as axes.

Principal component 1 is represented by the x-axis, while principal component 2 is represented by the y-axis. The centroids of each of the 5 clusters are denoted with the black ‘X’ symbols. The arrows that extend outward from the origin are eigenvectors, which indicate the loading that the element has on each principal component. The degree of influence that each element has on a principal component is visualized by the angle and length of the arrow relative to the axes. Understanding this, you can then see that our chemofacies groups are then ultimately defined by the set of elements they are loaded by with respect to PC1 and PC2.
How do you begin to interpret PCA biplots?
While this is a much more digestible format for XRF data than cross plots, it can still be tricky to digest at first. So, what do these groupings really mean? Let us start simple. For example, the blue cluster (chemofacies 2) is strongly loaded by calcium. In fact, that entire group resides within negative PC1 values. Roughly opposite calcium is the arrow for silicon. All other groups experience some degree of loading by silicon, with red (chemofacies 5) the most strongly correlated. From a first order mineralogic context, we would then expect our blue chemofacies to be associated with our cleanest carbonate buildups, and red to be associated with our sandier to mixed siliciclastic intervals. These distributions support our premise that minerals control the concentrations of major elements in sedimentary systems. In a moment we will be able to compare this observation to a lithological core description over the same interval.
Principal component two is more heavily loaded by trace elements. Mo, and Co are commonly used as indicators of bottom water redox conditions, while Ni, Cu, and Zn are associated with surface water productivity. Production of organic matter and subsequent preservation or oxidization is critical in understanding mudrock depositional systems. Additionally, trace elements Th, Ti, Zr, and Rb are proxies of detrital influence. With these proxies in mind, we can see that our purple cluster (chemofacies 3) is strongly loaded by Rb, Zr, Th, Ti indicating detrital flux, as well as being loaded by Mo and Co, indicating a sustained state of bottom water dysoxia during deposition. These data points will be found to be within the most shale rich, heightened TOC intervals of the core.
This effort of establishing a workflow to combine evaluation of both bulk mineralogic trends and trace element oceanographic trends on the same plot enables a path for the integrated interpretation of a number of key depositional system drivers, including sediment sourcing and evolving ocean water chemistry, which in the end, drive the organic matter production and preservation that is key in shale plays.
Putting it all together & Ground Truthing
After we have defined our chemofacies groupings, we can then go back and reapply our learnings as a 1D depth series along the core.

Immediately, we can see that our blue chemofacies does indeed correlate with clean carbonate intervals as interpreted in the lithologic core description, ranging from skeletal grain dominated packstones and grainstones to a reefal bioherm near the top of the core. In comparison, our red chemofacies is associated with a reduction in calcium concentration and an increase of silicon, indicating a substantial increase of sand proportion during deposition. Our purple chemofacies is indeed associated with the most organic rich shale intervals (near the middle of the core), despite missing data points due to some parts of the core missing, bagged intervals of crumbled core, and the overall fissile nature of that facies.
As with any unsupervised clustering method, it is important to be able to test and ground truth the model. To accomplish this, we process raw photographs of core boxes and plot our chemofacies along the core, next to the location the core was scanned.

This allows us to do a quick visual check against the physical rock samples the pXRF points are associated with. This section of core ranges from depth 32.7 to 45.5 ft. The blue chemofacies is plotting on the clean carbonate biohermal buildup as expected, and below that, we transition into the less calcium rich and mixed carbonate-siliciclastic chemofacies loaded with Mg, Na, and V instead. With this visualization, we can both test the validity of our model, and provide tuning based on comparison to the rock data.
What comes next?
Classification of chemofacies is the beginning, not the end goal.
Ultimately, our goal becomes taking PCA clustered chemofacies classification schema and developing a trained neural network for a fully deployed chemofacies identification across core control. This approach of applying a self-consistent chemofacies framework across available core control represents a large step forward in the quantification, comparison, and applied usage of pXRF data as a tool for the characterization of highly complex systems.
Once chemofacies classifications have been established, we can use these groupings to explore correlation with other rock and fluid properties, such as rock strength, TOC, Poro-Perm etc.
A future expansion of our methodology could then include the export of our 1-D logs to well visualization software such as Petrel for the development and population of 3D geologic models with chemofacies and associated rock properties. This would allow both new and existing 3D subsurface models to incorporate high resolution chemofacies trends and relationships that may help inform important source and reservoir property characteristics, including things that could drastically influence well performance, such as landing zone selection. Within this workflow, the assignment and classification of rock intervals by chemofacies is the starting point, rather than the end goal, which would instead be to build upon this classification scheme through the correlation of rock and fluid properties to chemofacies in order to develop fully integrative subsurface models capable of tying micro-scale geochemical observations back into macro-scale reservoir properties and the overarching depositional model.
Conclusions
- pXRF scanning is a relatively fast way to create a high resolution, labeled chemomineral dataset
- Principal components extracted via Python toolkits allow for comprehensive preprocessing, segmentation, and visualization of pXRF data—distilling the data to useful trends like bulk mineralogy and trace elements enrichment.
- pXRF data can be a useful supplementary dataset when used in conjunction with sedimentological and lithologic core interpretations
- Providing a deep neural network with PCA segmented training datasets from type-core could allow these chemofacies to be deployed and normalized across field scale applications