Chapter 3: Materials & Methods
In this chapter, an overview of the dataset is initially presented in Section 3.1. This includes agronomic, geospatial, and meteorological data. Subsequently, in Section 3.2, in conjunction with the foundational information presented in the preceding
3.1 Data
3.1.1 Agronomic Data
We have access to historical milk sampling data from three major Swiss cow breeding associations (Citation: Swissherdbook, 2024 Swissherdbook (2024). Swiss herdbook - swiss cattle breeding association. https://www.swissherdbook.ch/de/. ; Citation: Braunvieh Schweiz, 2024 Braunvieh Schweiz (2024). Braunvieh Schweiz - Swiss Brown Cattle Breeding Association. https://homepage.braunvieh.ch/. ; Citation: Holstein Switzerland, 2024 Holstein Switzerland (2024). Holstein Switzerland - The Swiss Holstein Breeding Association. https://www.holstein.ch/de/. ). The three associations have aims that are intimately linked to the breeding and promotion of particular cow breeds within Switzerland. Swissherdbook focuses mainly on Holstein, Simmental, and Swiss Fleckvieh. Holstein Switzerland is dedicated exclusively to Holstein and Red Holstein cattle. Meanwhile, Braunvieh Schweiz is centered around Braunvieh and Original Braunvieh cattle.
The data pool contains more information than test-day milk sample records. Along with these, the associations store breeding data, such as udder and other physical traits, insemination, and calving data. The data exchange standard is defined by Citation: ASR, 2024 ASR (2024). ASR - Arbeitsgemeinschaft Schweizerischer Rinderzüchter. https://asr-ch.ch/de/Willkommen. and is maintained by Citation: Qualitas AG, 2024 Qualitas AG (2024). Qualitas AG - IT and Quantitative Genetics for Livestock. https://qualitasag.ch/. . This guarantees a common database format among the different breeding associations1.
Generally, breeding associations collect monthly test-day milk samples on dairy farms where the corresponding farmer participates in the breeding program. The exact sampling guidelines can be consulted in the milk inspector manual (Citation: Swissherdbook, 2019 Swissherdbook (2019). Handbuch für Milchkontrolleure. https://www.swissherdbook.ch/fileadmin/02_Dienstleistungen/02.3_Leistungspruefungen/1_Milchleistungspruefung/2019-07-02_Handbuch_Milchkontrolleure.pdf. ). Not all dairy farmers are part of a breeding association. Some conduct their own herd and breeding management on their farms and do not actively participate with their cows on the breeding market. However, the available data should be representative for Switzerland when considering the geospatial distribution of the data. The latter is further discussed in Section 3.1.2 and visualized in Figure 3.3.
Data Processing
For each of the three associations, the milk sample data is provided in a slightly different raw tabular database format. We write custom parsers on a best effort basis to process the individual datasets and merge the data from the three different providers into a single dataset. Table 3.1 summarizes a selection of available variables (Citation: ASR, 2024 ASR (2024). ASR - Arbeitsgemeinschaft Schweizerischer Rinderzüchter. https://asr-ch.ch/de/Willkommen. ). Note that each farm identifier can be linked with the corresponding farm metadata. This allows us to determine the farm location at the ZIP code level. The ZIP code associated with each sample is important for the matching process with the weather data. A finer granularity at the exact postal address level would be technically feasible but is not applied for data protection and privacy reasons.
Milk Performance Variable | Description / Unit / Type |
---|---|
milk | Day milk yield \( [ \text{kg} ] \) |
fat | Fat \( [ \text{\%} ] \) |
protein | Crude protein \( [ \text{\%} ] \) |
ecm_yield | Energy-corrected milk yield \( [ \text{kg} ] \) |
lactose | Lactose \( [ \text{\%} ] \) |
samplePersistence | Sample persistence \( [ \text{\%} ] \) |
somaticCellCount | Somatic cell count \( [ \text{x 1000/ml} ] \) |
milkUreaNitrogen | Milk urea nitrogen \( [ \text{mg/dl} ] \) |
citrate | Citrate \( [ \text{mg/dl} ] \) |
aceton | Aceton \( [ \text{mg/l} ] \) |
caseinMeasured | Casein \( [ \text{\%} ] \) |
acetonMmol | Aceton \( [ \text{mmol/l} ] \) |
acetonIr | Aceton IR (infrared spectroscopy) \( [ \text{mmol/l} ] \) |
bhbConcentration | BHB (beta-hydroxybutyrate) \( [ \text{mmol/l} ] \) |
Metadata Variable | |
animalId | Animal identification number \( [ \text{categorical} ] \) |
animalBreedCode | Animal breed code \( [ \text{categorical} ] \) |
farmIdLocationSample | Farm identification \( [ \text{categorical} ] \) |
calvingDate | Last calving date \( [ \text{date} ] \) |
lactationNumber | Lactation number \( [ \text{integer} ] \) |
sampleWeighingDate | Sample date \( [ \text{date} ] \) |
days_in_milk | sampleWeighingDate \( - \) calvingDate \( [ \text{days} ] \) |
zip | ZIP matching the farm associated with farmIdLocationSample \( [ \text{categorical} ] \) |
Multi-Stage Data Cleaning Strategy
We clean the raw data with the following multi-stage data cleaning strategy: First, we apply an interquartile range filter to milk, fat, and protein per breed and year. A sample is removed from the dataset if one of the three variables is an outlier. An outlier is defined as the value that lies below the lower bound \( Q_1 - 1.5 \times \text{IQR} \) or above the upper bound \( Q_3 + 1.5 \times \text{IQR} \), where \( Q_1 \) is the first quartile, \( Q_3 \) is the third quartile, and IQR is the range between the first and third quartiles, calculated as \( \text{IQR} = Q_3 - Q_1 \). Second, only those samples are considered where milk, fat, and protein measurements are all simultaneously available. Third, any samples linked to research farms, educational farm facilities, breeding associations, or research institutions are excluded as non-representative. Fourth, as there are some samples from outside the country, only Swiss farms, identifiable by their ZIP codes, are included. Fifth, samples where days_in_milk exceed 600 days are discarded. Seventh, farms with less than 100 samples are excluded. Lastly, any animals contributing fewer than 5 samples are also eliminated.
Breed Code | Section | Mapped Code | Breed Name |
---|---|---|---|
RH | Red Holstein | HO | Holstein |
HO | Holstein | HO | Holstein |
RF | Holstein | HO | Holstein |
SF | Swiss Fleckvieh | SF | Swiss Fleckvieh |
BS | Brown Swiss | BS | Brown Swiss |
60 | Simmental | SI | Simmental |
70 | Simmental | SI | Simmental |
SI | Simmental | SI | Simmental |
OB | Original Braunvieh | OB | Original Braunvieh |
JE | Jersey | JE | Jersey |
Breeds
In the refined dataset, the animalBreedCode variable encompasses 10 distinct breed codes. In accordance with the Swissherbook regulations (Citation: Swissherdbook, 2022 Swissherdbook (2022). Herdebuchreglement. https://www.swissherdbook.ch/fileadmin/02_Dienstleistungen/02.2_Herdebuch/2022-08-18_Herdebuchreglement.pdf. ), certain breed codes correspond to identical breeds. Breeding objectives are determined at the breed level; thus, codes 60, 70, and SI are unified under the same breeding goals. The same applies for the Holstein codes RH, HO, and RF. A comprehensive overview is presented in Table 3.2. This mapping is employed to streamline our dataset, enabling us to focus on 6 breeds in total: Holstein (HO), Jersey (JE), Brown Swiss (BS), Original Braunvieh (OB), Simmental (SI), and Swiss Fleckvieh (SF). Bloodline-specific breed information is available in the Swissherbook regulations.
Symbol | Breed | Purpose | Milk \( [ \text{kg/d} ] \) | Protein \( [ \text{\%} ] \) | Fat \( [ \text{\%} ] \) | ECM \( [ \text{kg/d} ] \) |
---|---|---|---|---|---|---|
\( \circ \) | Holstein | Milk | 27.20 (±8.58) | 3.42 (±0.42) | 4.22 (±0.67) | 30.13 (±8.90) |
\( \times \) | Swiss Fleckvieh | Milk/Meat | 22.45 (±7.51) | 3.46 (±0.42) | 4.27 (±0.69) | 25.02 (±7.88) |
\( \bigtriangleup \) | Brown Swiss | Milk | 23.04 (±7.55) | 3.56 (±0.43) | 4.13 (±0.61) | 25.56 (±8.02) |
\( \diamondsuit \) | Simmental | Milk/Meat | 19.39 (±6.50) | 3.44 (±0.37) | 4.04 (±0.60) | 21.14 (±6.89) |
\( + \) | Original Braunvieh | Milk/Meat | 19.33 (±6.48) | 3.42 (±0.42) | 4.01 (±0.57) | 20.95 (±6.77) |
\( \square \) | Jersey | Milk | 18.88 (±6.10) | 3.97 (±0.51) | 5.30 (±0.94) | 24.17 (±7.15) |
Table 3.3 summarizes the national average dairy performance metrics for 2023. These values match the data reported by
Citation: BLW, 2023
BLW
(2023).
Agrarbericht 2023.
.
and
Citation: Schweizer Bauernverband, 2024
Schweizer Bauernverband
(2024).
Milchstatistik der Schweiz 2023.
https://www.sbv-usp.ch/fileadmin/user_upload/MISTA2023_def_online.pdf.
. In terms of milk yield, the Holstein breed is distinguished by its superior milk and ECM performance, whereas the Jersey breed is recognized for its elevated protein and fat yield. The latter characteristic is manifested in a comparatively greater enhancement in ECM-corrected performance relative to other breeds. These values also reflect the major characteristics of Swiss dairy farming presented in
Symbol | Breed | # Samples | # Farms | # Animals | Timespan |
---|---|---|---|---|---|
\( \circ \) | Holstein | 27'536'089 | 24'963 | 971'198 | 1985-2023 |
\( \times \) | Swiss Fleckvieh | 31'484'784 | 27'392 | 1'038'291 | 1984-2023 |
\( \bigtriangleup \) | Brown Swiss | 56'695'579 | 26'585 | 1'719'156 | 1982-2023 |
\( \diamondsuit \) | Simmental | 8'731'867 | 19'411 | 299'698 | 1984-2023 |
\( + \) | Original Braunvieh | 4'996'060 | 18'613 | 149'478 | 1982-2023 |
\( \square \) | Jersey | 734'685 | 4'302 | 23'675 | 1998-2023 |
Total | 130'179'064 | 46'975 | 4'201'494 | 1982-2023 |
In Table 3.4, we summarize the sample distributions across breeds, farms and animals. The over 130 million samples are not equally distributed among the breeds. Brown Swiss is leading with over 56 million samples, followed by Swiss Fleckvieh with over 31 million samples. Holstein2 has about 27 million samples. Simmental and Original Braunvieh follow with 8.7 million and almost 5 million samples. For the Jersey breed over 730'000 samples are available. On average, 30 samples per cow are available. Along with the number of samples, we report the number of farms and animals as well as the time range covered by the samples available. The overall dataset has samples from more than 4 million cows spanning a time period of 42 years. The ratios between the number of samples, farms and animals reveal different farm structures for the different breeds. Therefore, data subsampling strategies, if required, need to be applied with care. Moreover, farms can host multiple dairy cow breeds.
Data Structure: Sparsity and Dynamics
Figure 3.1 depicts the sparse structure and intrinsic dynamics of the dataset. We describe subfigures top down and from left to right. The initial subfigure presents the distribution of samples across individual farms, indicating that the majority possess several thousand samples. The subsequent distribution demonstrates the longitudinal data availability across years for the farms, highlighting that some farms intermittently enter or exit the panel, while others continuously contribute samples over nearly four decades. The third figure underscores the spatial dynamics of livestock. We observe that most cattle are kept on the same farm. However, some cattle relocate, a phenomenon attributable to seasonal pasture changes as well as commercial activities such as breeding exchanges. The fourth figure elucidates the distribution of the number of samples collected over an animal’s lifespan or its tenure in the panel. The subsequent subfigure indicates that the majority of animals are sampled over a period ranging from 2 to 5 years. Consequently, similar to farms, animals enter and exit the panel. The final figure reveals that each year, only a limited number of samples are available from the animals.


In conclusion, the dataset reflects actual agricultural practices and farm structures in Switzerland, rather than being designed for controlled experimental studies measuring biological responses of dairy cows. The data is primarily collected to monitor long-term breeding progress. Hence, the dataset is notably sparse, in stark contrast to experimental research, where data collection from farms and cows occurs with greater frequency and regularity. Addressing our research question necessitates a diverse set of THI data points. When breeding associations conduct sampling on a farm, they sample all cows on the test day. Consequently, on average, there are 7 weather data points recorded per cow, farm, and year. Therefore, a representative sample covering a broad spectrum of THI data points requires data from at least hundreds of farms and thousands of animals.
Data Structure: Time and Seasonality
The average milk yield per cow shows a continuous annual increase. Figure 3.2 illustrates the average daily milk yield of cows recorded on the corresponding days in the years 1990, 2000, 2010, and 2020 in Switzerland across all breeds. There is a notable increase in daily milk production over the years attributable to advancements in breeding. Additionally, seasonal fluctuations are apparent throughout each year. The minor spikes observed are a consequence of the sampling strategies employed by breeding associations, as farms are sampled sequentially rather than simultaneously. The peak of lactation occurs approximately in April and May, influenced by farm management practices. Seasonal calving during winter months results in lactation peaks during these months, supported by the availability of high-quality grasslands in the spring season. Consequently, the reduced milk yield observed during the summer months can be attributed not solely to heat stress, but also to the biological cycle, geospatial factors, farm management practices, and the availability of grassland. These long-term and seasonal factors must be appropriately modeled to accurately assess the impact of the THI on dairy performance.

3.1.2 Geospatial Data
The milk data described in Section 3.1.1 is provided with anonymized farm metadata with associated ZIP codes. In Switzerland, each ZIP code can be mapped to a locality. A locality is a geographically defined area designated by an unambiguous place name and unique ZIP code. The ZIP belongs to one or more political municipalities. The latter have a unique BFS3 number assigned by the Swiss Federal Statistics Office. Since the locality distribution is more fine-granular than the political municipalities, we map each farm to the closest geographical locality center. We count a total of \( 3194 \) localities. The geospatial data and altitude raster grids are retrieved from Citation: swisstopo, 2024 swisstopo (2024). Amtliches Ortschaftenverzeichnis der Schweiz und des Fürstentums Liechtenstein. https://www.swisstopo.admin.ch/de/amtliches-ortschaftenverzeichnis. and Citation: swisstopo, 2024 swisstopo (2024). DHM25 / 200m - Digitales Höhenmodell der Schweiz. https://www.swisstopo.admin.ch/de/hoehenmodell-dhm25-200m. . By matching the locality data with the milk samples ZIP codes, we delineate the geospatial distributions of samples per breed, as illustrated in Figure 3.3. The Simmental and Swiss Fleckvieh populations are predominantly located in the Bernese Alps4, whereas the Holstein breed is primarily found in Western Switzerland. In contrast, the Brown Swiss and Original Braunvieh breeds are more prevalent in eastern Switzerland. The Jersey breed, however, demonstrates a more scattered geographical distribution.

3.1.3 Hydroclimatic Data
Meteorological variables such as daily mean, minimum and maximum temperatures in \( [ \text{°C} ] \), precipitation \( [ \text{mm} ] \) and relative sunshine duration \( [ \text{\%} ] \) are available as a gridded dataset from Citation: Chair of Hydrology and Water Resources Management, ETH Zurich, 2024 Chair of Hydrology and Water Resources Management, ETH Zurich (2024). MeteoSwiss Data Access Point for ETH Researchers. https://hyd.ifu.ethz.ch/research-data-models/meteoswiss.html. with a \( 1 [ \text{km} ] \times 1 [ \text{km} ] \) resolution. We allocate each ZIP code locality center to the nearest point within the gridded dataset. The relative humidity is not available in gridded form. Nonetheless, the THI metric requires the relative humidity measure. Citation: MeteoSchweiz, 2024 MeteoSchweiz (2024). Datenportal für lehre und forschung (IDAweb). https://www.meteoschweiz.admin.ch/service-und-publikationen/service/wetter-und-klimaprodukte/datenportal-fuer-lehre-und-forschung.html. offers daily mean, minimum and maximum relative humidity for some weather stations. We follow Citation: Bucheli et al., 2022 Bucheli, J., Dalhaus, T. & Finger, R. (2022). Temperature effects on crop yields in heat index insurance. , 107. 102214. https://doi.org/10.1016/j.foodpol.2021.102214 and map available height-corrected stations to the closest height-corrected locality centers with the following distance function:
\[ d_{s,c} = \sqrt{(\text{lat}_s - \text{lat}_c)^2 + (\text{long}_s - \text{lat}_c)^2 + \left(\psi \cdot (\text{alt}_s - \text{elevation}_c)\right)^2} \]The height correction is integrated in \( d_{s,c} \) as a factor \( \psi=100 \). This correction is applied because relative humidity is more likely to vary with elevation changes than with horizontal movements. The temperature varies with altitude and changes the saturation capacity.


We retrieve all daily mean, maximum and minimum relative humidity data available from the MeteoSuisse portal from 1982 until the end of 2023. The following requirements for a station to be considered: First, the station must not have more than 500 missing samples during its lifetime. We treat a station as long as it exists as a fixed measurement point and do not control for technical changes such as hardware upgrades, which increases risk of instrumentation bias. Under these constraints, \( 159 \) stations are available in 1982 and \( 540 \) by the end of 2023. Figure 3.4 shows the weather station distribution in these two years. The station concentration is increasing more in the alpine regions than in the Swiss plateau. During our study period from 1982-2024, all major regions in Switzerland are equipped with weather stations, and they increase in density over time.
3.2 Models
Informed by modeling elements derived from prior research, background information regarding dairy farming in Switzerland, biological aspects, and the organization of our data, we establish an empirical strategy. Then, we derive two model classes: single-breed models which are estimated with data for each breed-separately, and multi-breed models which are estimated with data from all breeds together. Single-breed models have lower complexity and are easier to estimate. However, this comes with statistical challenges for model comparison. A multi-breed model that explicitly models the breed aspects introduces more complexity and might be harder to estimate. However, in certain circumstances, this is more appropriate for valid statistical comparisons between breeds.
3.2.1 Empirical Strategy
Our goal is to estimate the average effect of THI on cow-level dairy production performance variables while controlling for breed, animal-physiological aspects, and other environmental effects such as farm management practices, regulatory changes, and breeding progress. Hence, we need an additive statistical approach to separate the THI effect from these other factors. This requirement leads us to a model with the following structure:
$$ \textit{Variable of Interest} = \textit{Weather} + \textit{Controls} + \textit{Unobserved Heterogeneity} + \textit{Error}. $$The variables of interest are the daily milk yield and daily ECM yield. To estimate the average effect of weather, we need to model the THI as a component. We know from prior research that the effect of THI on milk and milk component performance is non-linear. In our models, we will use the three-day mean THI. This is the average THI value from the milk sample date \( t \), \( t-1 \) and \( t-2 \):
$$ \textit{3 Day Mean THI} = \frac{THI_t + THI_{t-1} + THI_{t-2}}{3}, $$where \( t \) is the reported milk sample date and
$$ \text{THI}_t = (1.8 \times \text{T}_t) + 32 - \left[ (0.55 - 0.0055 \times \text{RH}_t) \times (1.8 \times \text{T}_t - 26) \right]. $$\( T_t \) and \( RH_t \) are the mean temperature and mean relative humidity of day \( t \).
We opt for the three-day THI mean based on several considerations: Firstly, when a farm is sampled, cows are sampled in the morning and in the evening. These samples may either be collected on the same calendar day or across two consecutive days when the samples are taken in the evening and the morning of the following day. In the latter case, the sample date is recorded as the morning of the subsequent day; this is irrespective of the first sample being taken the day prior. The specific sequence of sampling for each set is not discernible within the dataset. Although such details are inconsequential for practical breeding operations, they become critical when aligning milk sample data with meteorological information. Consequently, it is imperative to account for weather conditions on the sampling day as well as one day prior. Secondly, we incorporate an additional preceding day to address the lagged effect of the THI on dairy cows. Currently, there is no standardized consensus on the number of lag days to be considered. However, we have aligned our methodology with the approach delineated in Citation: Bryant et al., 2007 Bryant, J., López‐Villalobos, N., Pryce, J., Holmes, C. & Johnson, D. (2007). Quantifying the effect of thermal environment on production traits in three breeds of dairy cattle in New Zealand. , 50(3). 327–338. https://doi.org/10.1080/00288230709510301 .
The model needs to control for the non-linear lactation curve of the cow. Right after calving, the milk performance increases until it reaches a peak point before it drops again until the next dry period starts. A viable proxy for this purpose is days in milk (DIM). Moreover, dairy performance also depends on the the parity - we make a binary distinction between primi- and multiparous cows. Primiparous cows have not developed their full biological potential and have a lower milk performance than multiparous cows. Moreover, we also need to control for the within-year seasonality, geospatial differences, and long-term effects such as breeding progress or regulatory changes. These aspects are motivated by the analysis of effects observed in Figure 3.2.
The specific characteristics of the farm location, including the barn infrastructure, the topographical features and the management strategies, cannot be explicitly measured. We assume farmers constantly adapt and change their best practices. Similarly, the fundamental performance attributed to the genetic and phenotypic attributes of individual cows cannot be directly observed. It is imperative to account for this latent heterogeneity.
3.2.2 Single Breed Model
We first introduce a single breed model before expanding it in the next section. Let \( y_{ijkt} \) be a dairy performance variable of a sample of animal \( i \), at the farm location \( j \), at farm \( k \) and at the sampling date \( t \). Equation 3.5 defines an additive mixed model to estimate the average effect of THI on milk performance variables. The coefficient \( \beta_0 \) is the intercept which represents the mean performance of the given breed data, excluding all other factors included in the model. \( \mathbb{I}(\text{Primiparous}_i) \) is the indicator variable whether the current cow \( i \) is primi- or multiparous. \( \textit{THI}_{kt} \) is the three-day mean THI at the farm location \( k \) at the sampling date \( t \). THI is modelled as a separate non-linear function \( f_{1,1} \) and \( f_{1,2} \) for primi-and multiparous cows. \( \textit{DIM}_{it} \) is the days milk of the cow \( i \) sampled on day \( t \). Following the same analogy as for THI, DIM is also modelled separately as a non-linear function \( f_{2,1} \) and \( f_{2,2} \) for primi- and multiparous cows. \( \beta_1 \) is the fixed effect coefficient for being primiparous compared to multiparous. \( \mathbb{I}(\text{Year}_t = m) \) is a variable indicating whether the sampling date \( t \) falls in the year \( m \), where \( \mathcal{M} \) is the set of years covered by the samples included. \( \beta_{2,m} \) is the effect of year \( m \) compared to the reference year. \( \iota_{kt} \) is the random intercept for \( ZIP \) of farm \( k \) and month of the sample date \( t \). This random effect models the annual seasonality accounting for geospatial differences; due to the topological heterogeneity of Switzerland, timing and farming practices might vary. For example, in high-altitude regions the grazing season might start later than in the Swiss Plateau. Another aspect which is covered by this random intercept are spatial spillovers. Neighbouring farmers might apply similar practices simply due to geospatial constraints but also due to networking effects (Citation: Young & , 1959 Young, J. & Coleman, A. (1959). Neighborhood norms and the adoption of farm practices. , 24(4). 372. ). The factor levels for this random intercept are created with a ZIP code - month interaction term. The ZIP code is a natural choice originating from our farm-ZIP code matching introduced in Section 3.1 and the monthly granularity is justified by the monthly sampling frequency by the breeding associations5. \( \phi_{k} \) is the random intercept for the location of farm \( k \). First, since different farms have unique characteristics such as soil, land or barn we capture the variability between farms. Moreover, Figure 3.1 shows us the hierarchical structure of the data. Grouping within entities such as farms needs to be taken into account in the model. With our data we cover very long periods of time. Farmers constantly adopt and change their practices over time. Some of the long-term national average aspects are captured by the \( \textit{Year} \) fixed effect. The constant individual adaptation by each farmer is not explicitly modelled. Consequently, the effects \( f(\text{THI}) \) and \( f(\text{DIM}) \) are, at least in part, implicitly reflective of adaptive changes. Additionally, each individual cow \( i \) is incorporated as a random intercept \( \alpha_i \) to account for repeated sampling per cow and distinctive animal characteristics. The residual error term is denoted by \( \epsilon_{ijkt} \).
In Equation 3.2.2 we provide a simplified notation for the model introduced in Equation 3.2.1 solely focusing on the covariates and not their coefficients.
The variable \( y \) denotes the dependent dairy performance variable. The parameter \( \mu \) serves as the overall intercept of the model. The function \( f(\text{THI}) \) is a non-linear function relating to the THI. Likewise, \( f(\text{DIM}) \) is a non-linear function that models the lactation curve. The variable \( \text{P} \) signifies parity, representing a binary distinction between primiparous and multiparous cows. The term \( \text{Y} \) symbolizes the year as a categorical variable, capturing potential year-to-year variations. The random intercept \( \iota \) accounts for variation linked to the zip code by month, reflecting geographic and seasonality. The random intercept \( \phi \) pertains to the farm location, managing variability across different farms and sampling sites. The random intercept \( \alpha \) addresses the variability among individual cows. Lastly, \( \epsilon \) denotes the residual error term of the model.
3.3.3 Mutli-Breed Model
Our hypothesis posits varying effects of THI on dairy performance across different breeds. Consequently, our aim is to determine whether there are statistically significant heterogeneous treatment effects. The treatment under consideration is the exposure of cows to weather (THI), while the heterogeneity arises from the various dairy cow breeds. Dividing the data based on breed and estimating separate models for each breed is appealing for reducing modeling complexity; however, challenges may arise if not all covariates are interacted with breed in a multi-breed model, since statistical significance is contingent upon all covariates (Citation: Le, 2020 Le, A. (2020). Interaction Term vs Split Sample. https://anhqle.github.io/interaction-term/. ). With biological reasoning, it is justifiable to consider separate intercepts as well as interaction terms involving \( f(\text{THI}) \), \( f(\text{DIM}) \), parity, and year. The interaction of random effects for seasonality and location by breed requires further investigation. The central question pertains to whether we should assume a common biological response across breeds with respect to location and farm. Given that each cow is associated with a single breed, it should not matter whether the random intercept for cow \( \alpha_i \) is interacted by breed or not. An example multi-breed models following the arguments above is
where \( \mathcal{B} \) is the set of available breeds and \( B_i \) a binary indicator for breed.
3.2.4 Model Estimation Strategy
Given the presence of repeated unbalanced measurements for cows and farms as illustrated in Table 3.4 and Figure 3.1, the application of mixed models with maximum likelihood estimation is deemed appropriate for our model estimation (Citation: Wood, 2017, p. 61 and 74 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ). Various methodologies exist to model the non-linear components \( f \), such as binning, piecewise regression (broken stick), or splines. The contemporary approach for estimating non-linearities is through smooth terms, which can be implemented within a Generalized Additive Model framework (Citation: Wood, 2017, p. Chapter 5 and 6 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ). Smooth terms, being splines with penalized coefficients, facilitate a more accurate representation of the data. Consequently, we characterize our nonlinear functions \( f \) as smooth terms. The components of our models are a combination of smooth terms, fixed effects, and random effects, thereby constituting Generalized Additive Mixed Models (Citation: Wood, 2017, p. 288 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ). To address our research question with an estimated model, we numerically derive the turning points in the smooth terms \( f \). We use mgcv (Citation: Wood, 2011 Wood, S. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. , 73(1). 3–36. ; Citation: Wood et al., 2016 Wood, S., N., Pya & Sæfken, B. (2016). Smoothing parameter and model selection for general smooth models (with discussion). , 111. 1548–1575. ; Citation: Wood, 2003 Wood, S. (2003). Thin-plate regression splines. , 65(1). 95–114. ; Citation: Wood, 2017 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ; Citation: Wood & , 2020 Wood, S. & Scheipl, F. (2020). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. Retrieved from https://cran.r-project.org/web/packages/gamm4/index.html ) for estimation and gratia (Citation: Simpson, 2024 Simpson, G. (2024). gratia: Graceful ggplot-based graphics and other functions for GAMs fitted using mgcv. https://gavinsimpson.github.io/gratia/. ) for the turning point search. This purely data-driven approach for inflection point determination is a considerable improvement compared to previous works which make preliminary assumptions about the non-linearities. Implementation details and library modifications required for our data are discussed in the following sections.
3.3 From Linear Mixed Models to Generalized Additive Mixed Models
To link our model definitions in Section 3.2 with the GAMM framework, we provide the basics around mixed models, GAMs and GAMMs and explain how they are linked. Understanding the relationship between these three frameworks is vital for a successful and computationally efficient estimation of Equation 3.2.1 and Equation 3.2.3.
3.3.1 Generalized Linear Mixed Models
We use the definitions from Citation: Wood, 2017 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ; Citation: Bates et al., 2015 Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01 to introduce mixed models. Let \( \bm{y} \in \mathbb{R}^{n} \) be the response data of \( n \) samples from a random variable \( \mathcal{Y} \) belonging to the exponential family. Let \( g(\mu_i)=\mathbf{X}_i \bm{\beta} \), where \( \mu_i \equiv \mathbb{E}(Y_i) \), \( g \) is the link function, \( \mathbf{X}_i \) is the \( i \)-th row of the model matrix \( \mathbf{X} \in \mathbb{R}^{n \times p} \) and \( \bm{\beta} \in \mathbb{R}^{p} \) the vector of the \( p \) fixed effect parameters. The response variable samples \( y_i \) are assumed to be independent and have a distribution that belongs to the exponential family. Let \( \mathbf{Z}_i \) be the \( i \)-th row of the random effect model matrix \( \mathbf{Z} \in \mathbb{R}^{n \times q} \), and let \( \bm{b} \in \mathbb{R}^{q} \) be the q-dimensional random vector from a random variable \( \mathcal{B}\sim \mathcal N(\bm{0},\bm{\psi_{\theta}}) \) containing random effects with zero expected value and covariance matrix \( \bm{\psi_{\theta}} \in \mathbb{R}^{q \times q} \) with unknown parameters \( \bm{\theta} \). Let \( \mathbf{\Lambda_{\theta}} \in \mathbb{R}^{n \times n} \) be the residual covariance matrix to model the residual autocorrelation. We assume that the residuals are distributed independently and identically. Hence, \( \bm{\Lambda{\theta}} = \mathbf{I}\sigma^2 \). This simplification comes with computational advantages, but also with modelling drawbacks regarding spatial autocorrelation in our data.
For the Gaussian case, a Linear Mixed Model (LMM) is defined as
where \( y\sim N(\mathbf{X}\bm{\beta}, \mathbf{Z}\bm{\psi_\theta}\mathbf{Z}^T + \mathbf{I}\sigma^2) \) and \( (Y \mid \mathcal{B} = \bm{b}) \sim \mathcal{N}(\mathbf{X}\bm{\beta} + \mathbf{Z}\bm{b}, \sigma^2 \bm{I}) \).
The extension of an LMM to non-normal data is given as the Generalized Linear Mixed Model (GLMM) with
where \( (\mathcal{Y} \mid \mathcal{B} = \bm{b}) \sim \mathcal{D}(g^{-1}(\mathbf{X}\bm{b} + \mathbf{Z}\bm{b}), \phi) \) and \( \mathcal{B} \sim \mathcal{N}(\bm{0}, \bm{\psi_{\theta}}) \). \( \mathcal{D} \) is the distribution of the exponential family, and \( \phi \) is a common scale parameter. The error term in Equation 3.3.1 comes from the normal distribution of the response variable \( \mathcal{Y} \). The remaining variability is related to the assumed distribution \( \mathcal{D} \). Therefore, no explicit error term \( \bm{\epsilon} \) is available in the generalized case.
3.3.2 Generalized Additive Models
In contrast to the fixed effect model matrix \( \mathbf{X} \) in Sec. 1.3.1 we introduce a model matrix \( \mathbf{X^*} \) for the strictly parametric part, where \( \mathbf{X^*_i} \) is the \( i \)-th row of this design matrix. The corresponding parametric vector is defined as \( \bm{\beta^*} \). Let \( f_j \) be the smooth function of the covariates \( \bm{x}_j \), where a smooth function \( f_j \) may be a function of one or more covariates (Citation: Wood, 2017 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ; Citation: Wood, 2004 Wood, S. (2004). Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models. , 99(467). 673–686. https://doi.org/10.1198/016214504000000980 ). The matrix \( \mathbf{x_j} \in \mathbb{R}^{n \times |f_j|} \) comprises samples for the covariates assigned to the smooth \( f_j \). Therefore, the smooth components may be uni- or multivariate. This leads to the definition of a GAM as the sum of smooth functions given by
where some or all covariates are represented with a smooth term. The smooth terms can be understood as a basis expansions for the associates covariates, where the smoothness is controlled by penalty terms. A common way to fit GAMs and choose the appropriate smoothness are likelihood maximization methods such as the Restricted Maximum Likelihood Method (REML) (Citation: Corbeil & , 1976 Corbeil, R. & Searle, S. (1976). Restricted Maximum Likelihood (REML) Estimation of Variance Components in the Mixed Model. , 18(1). 31–38. https://doi.org/10.2307/1267913 ).
3.3.3 Generalized Additive Mixed Models
There is a mathematical equivalence between smooth terms and random effects (Citation: Wood, 2017 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ) p. 239. Simple random effect structures \( \mathbf{Z} \) with random intercepts and random slopes can be defined as smooth terms \( f_j(z_j) \) and are fitted with gam by default or bam for large datasets (Citation: Wood et al., 2015 Wood, S., Goude, Y. & Shaw, S. (2015). Generalized Additive Models for Large Data Sets. , 64(1). 139–155. https://doi.org/10.1111/rssc.12068 ; Citation: Wood, 2023 Wood, S. (2023). Mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. Retrieved from https://cran.r-project.org/web/packages/mgcv/index.html ). Equation 3.3.4 illustrates this equivalence. Hence, a GAM with random effect structures, a generalized additive mixed model (GAMM), can be represented and fitted in the form of a GAM defined in Equation 3.3.3 . Further implementation details for random effects as smooth terms are available in (Citation: Wood, 2024 Wood, S. (2024). Simple Random Effects in GAMs. https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.construct.re.smooth.spec.html. ). These are not further discussed in the scope of this work.
Random effect structures often come with sparsity. The aforementioned default implementations do not take advantage of this sparse effect structure. For a modest number random effects these methods are still computationally appealing. However, mgcv is not optimized for a growing number of group factor levels for random effect structures. Citation: Brooks et al., 2017 Brooks, M., Kristensen, K., Benthem, K., Magnusson, A., Berg, C., Nielsen, A., Skaug, H., Mächler, M. & Bolker, B. (2017). glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. , 9. 378–400. https://doi.org/10.32614/RJ-2017-066 highlight the exponentially growing estimation time with a growing number of random effect levels using the default routines in the aforementioned library. For a large number of random effect factor levels, the duality principle between GAMs and GLMMs permits the reformulation of a GAMM as a GLMM. Let \( \bm{f}_j \) be the vector of a smooth from a sample \( i \), \( f_{ij} = f(x_{ij}) \). Each smooth term has one or more basis functions. The collection of basis functions within a smooth can be interpreted as a vector.
In the mixed model form \( \bm{f}_j = \mathbf{X'}_j\bm{\beta'}_j + \mathbf{Z'}_j\bm{b'}_j \), where \( \bm{b'}_j \sim \mathcal{N}(\bm{0}, \bm{I}\sigma^2_{b_{j}}) \). The columns of \( \mathbf{X'}_j \) are a null space basis of the smoothing penalty. Equivalently, the columns of \( \mathbf{Z'}_j \) are the range basis. More specifically, each smooth term corresponds to a fixed effect part \( \mathbf{X'}_j \) and a random effect part \( \mathbf{Z'}_{j} \). Let \( \mathbf{X'_\Sigma} \) be the combined effect concatenated from the individual smooth fixed effects \( \mathbf{X'}_j \) and \( \mathbf{Z'_\Sigma} \) the combined smooth term random matrix concatenated from the individual \( \mathbf{Z'}_{j} \). Hence, generalizing to an arbitrary number of smooths, a GAMM as a GLMM is defined as
The individual fixed effect terms \( \mathbf{X_i^*}\bm{\beta^*} + \mathbf{X'_{i \Sigma}}\bm{\beta'_{\Sigma}} \) and the individual random effect terms \( \mathbf{Z'_{i \Sigma}}\bm{b'_{\Sigma}} + \mathbf{Z_i}\bm{b} \) in Equation 3.3.5 can be concatenated into one fixed effect and one random effect structure as in Equation 3.3.2. This concludes our definition of GAMMs as mixed models. In R, gamm (Citation: Wood, 2023 Wood, S. (2023). Mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. Retrieved from https://cran.r-project.org/web/packages/mgcv/index.html ) and gamm4 (Citation: Wood & , 2020 Wood, S. & Scheipl, F. (2020). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. Retrieved from https://cran.r-project.org/web/packages/gamm4/index.html ) are the available routines for estimating GAMs with mixed models engines. gamm uses nlme (Citation: Pinheiro et al., 2017 Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., Heisterkamp, S., Van Willigen, B. & Maintainer, R. (2017). Package “nlme”. , 3(1). 274. ) and gamm4 uses lme4 (Citation: Bates et al., 2015 Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01 ) as the underlying engine.
3.4 Estimating Generalized Additive Mixed Models
In Section 3.1.1 we discuss the sparsity of our data. Estimating our model with cows and farms as random intercepts for representative subsamples covering a broad range of weather data points results in tens of thousands of random effect factor levels. Therefore, the default methods gam and bam are computationally not appropriate for estimations with reasonable model fitting times and memory consumption. This is because the sparse structure of random effects is not leveraged when using the random-effect-as-smooth approach given in Equation 3.3.4. In the following Section 3.4.1, we briefly discuss some subsampling strategies with which we experiment to potentially circumvent these limitations, before discussing other off-the-shelf approaches. Since these are not successful, we discuss the development of an appropriate alternative method in detail in the last subsection. All experiments in this study are performed on a 32 core machine with 512 GB of RAM.
3.4.1 gam and bam
We explore various variants of the multi-breed model as delineated in Equation 3.4.1 by employing different subsampling methodologies.
Model 3.4.1 represents a simplification of the multi-breed model proposed in Equation . The term “different variants” pertains to experiments involving diverse breed interaction terms, varying smooth terms, and the interaction of farm intercepts with ZIP codes \( \phi \). We refrain from providing an exhaustive list of options due to the recurrent challenges we encounter: Firstly, with a limited number of samples, the smooth terms, upon visual inspection, appear biologically nonsensical. Secondly, increasing the sample size leads to a substantial rise in model estimation time and the computational load of the summary function. Thirdly, memory constraints become apparent when the sample size is excessively large. Fourthly, while simplifying the model by removing random effect terms like the animal intercepts permits the estimation of models with larger sample sizes, the resultant smooth terms remain biologically unintelligible. Fifthly, we implement several subsampling techniques to curtail the number of farms and animals considered, while retaining geospatial diversity and ensuring adequate breed representation.
In all scenarios, the models either fail to converge, encounter memory issues, or yield meaningless outcomes. A sampling strategy suggested by the ETH Statistical Consulting Service involves selecting a single sample per farm and omitting the modeling of animal random intercepts. Theoretically, given the extensive number of farms in our dataset (Table ), the quantity of data points should suffice for modeling the smooth terms. Additionally, this method could potentially alleviate violations of iid error term assumptions and spatial autocorrelation challenges. Nevertheless, even with this approach, obtaining biologically meaningful smooths remains elusive. While adhering to the research objective of estimating smooth terms to ascertain the non-linear impact of THI, we resolve to investigate alternative methodologies such as gamm4.
3.4.2 gamm4
gamm4 estimates GAMs as mixed models with lme4 as the underlying engine which has sparse matrix support and is known to be efficient. The gamm4 routine performs three steps: First, in the pre-processing phase, a GAMM model is converted into a GLMM. Second, in the fitting-phase, the parameter estimation is executed with the lme4 engine. Third, the estimated parameters are re-transformed into a GAM specification, and the confidence intervals for the smooth terms are estimated. This step is crucial for practical purposes and allows the fitted model to be treated as a regular GAM for downstream model analysis, diagnostics, as well as inference.
We run several experiments to assess the performance of gamm4 for a single breed model with the Jersey data:
Table 3.5 illustrates that the estimation times rise as the levels of random effect factors increase. Unlike the subsampling experiments and modeling trade-offs using gam and bam outlined earlier, gamm4 enables us to achieve biologically meaningful outcomes, evidenced by the visual credibility of the smooth terms for THI, especially for the last experiment listed. This observation provides us some indication about the number of samples required to build accurate multi-breed models.
The model used for this experiment does not correspond to a mature single-breed model and is designed primarily for performance assessment purposes. For example, modelling the DIM as a fixed effect instead as a smooth term, or interacting farms with the month of the year, are limitations. The main goal of this comparison is to emphasize the performance implications of varying factor levels. This analysis reveals the computational difficulties to be encountered as the number of samples and factor levels grow, particularly with larger and more representative data subsets or multi-breed models. Simple experiments with multiple breeds over a million of samples explode in execution time or fail with memory errors during a Cholesky decomposition step.
# Samples | # Cows \( \alpha \) | # Farm \(\times\) Month \( \phi \) | Fitting Time \( [s] \) |
---|---|---|---|
175'346 | 24'170 | 10'000 | 342 |
259'088 | 25'433 | 15'000 | 822 |
426'397 | 26'201 | 25'000 | 2'332 |
660'110 | 26'641 | 39'223 | 5'447 |
3.4.3 MixedModels.jl
An alternative mixed model library to lme4 is MixedModels.jl implemented in Julia in contrast with previous libraries that use R. The latter is notably faster6 than lme4 (Citation: Markwick, 2022 Markwick, D. (2022). Mixed Models Benchmarking. https://dm13450.github.io/2022/01/06/Mixed-Models-Benchmarking.html. ). GAMs cannot be estimated with the default MixedModels.jl interface. However, the mechanisms implemented in gamm4 could be modified to execute the model fitting step with the high-performance and memory-efficient Julia library. To assess if MixedModels.jl is suitable and scales with our data, we perform an experiment with the following model inspired by (Citation: Bryant et al., 2007 Bryant, J., López‐Villalobos, N., Pryce, J., Holmes, C. & Johnson, D. (2007). Quantifying the effect of thermal environment on production traits in three breeds of dairy cattle in New Zealand. , 50(3). 327–338. https://doi.org/10.1080/00288230709510301 ):
where \( n_B \) is the number of breeds and \( n_Y \) the number of years.
# Samples | Breeds | # Cows | # Farm \( \times \) Month | Fitting Time \( [s] \) |
---|---|---|---|---|
660'110 | JE | 26'641 | 39'223 | 580 |
7'381'552 | HO, JE, SF, SI, BS, OB | 1'833'470 | 106'642 | 15'231 |
Table 3.6 shows the model estimation times for two data subsets. These results confirm the scalability assumption for MixedModels.jl with millions of factor levels with acceptable execution times and low memory consumption7. With the goal to fit a multi-breed GAM, with millions of samples, we decide to modify gamm4 and replace the lme4 estimation engine with MixedModels.jl.
3.4.4 Putting it all together: gamm4 Modifications
In order to overcome challenges described in previous subsections, we study the libraries and identify opportunities for their enhancement and combination to achieve our goals.
gamm4 implements the GAMM mixed model equivalence described in Equation 3.3.5. During our source code analysis, with the goal to understand how to translate a mixed model prepared for lme4 to a MixedModels.jl format, we discover bottlenecks during the pre- and post-processing phases in gamm4. Therefore, we develop two distinct modified library versions: A first version, gamm4b, which fixes bottlenecks during the pre-and post-processing phases and still estimates the model parameters with lme4. A second version, gammJ which modifies the pre-processing to estimate the parameters with a modified version of MixedModels.jl and also further optimizes the post-processing steps. A modified MixedModels.jl is required to support the mixed model structures generated by Equation 3.3.5. To summarize, the artifacts of our effort to accommodate a larger number of factors levels in GAMMs are threefold: First, the bug-fixed and accelerated gamm4 library, which we refer to as gamm4b. Second, the gammJ prototype which executed the model fitting in a modified MixedModels.jl. Third, the modified MixedModels.jl which supports a broader set of mixed models than the default version. These three contributions must be considered as prototypes.
3.4.4.1 gamm4b
In this section, we summarize the three major modifications to gamm4 while still using lme4 as the underlying mixed model parameter estimation engine. We plan to share these modifications with the authors of (Citation: Wood & , 2020 Wood, S. & Scheipl, F. (2020). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. Retrieved from https://cran.r-project.org/web/packages/gamm4/index.html ) since these are relatively simple but powerful changes to the existing gamm4 source code, assuming a setup with R and Python is available.
Random Effect Matrix Substitution Process When a smooth term \( f_j \) is converted to its fixed effect part \( \mathbf{X'_j} \) and the random effect part \( \mathbf{Z'_j} \), the number of columns in the random effect matrix depends on the number of basis functions of the smooth \( j \). lme4’s regular user interface does not support arbitrary random effect matrices. They are generated by R’s formula interface for particular random intercept and random slope structures. Hence, the random effect matrices \( \mathbf{Z} \) generated with this interface do only over a subset of models which could be fitted with the underlying engine. The latter supports a broader set of random effect matrices8, including the \( \mathbf{Z'_j} \) generated by the GAMM to GLMM model conversion. In gamm4 the following trick is applied: Consider each smooth term as a random intercept with \( k \) factor levels, where \( k \) is the number of basis functions for the given smooth. The lme4 model building process then allocates the corresponding columns in the \( \mathbf{Z} \) matrix with a regular binary random intercept structure. Then replace the corresponding columns in the generated \( \mathbf{Z} \) with the \( \mathbf{Z'_j} \) from the smooth to random effect conversion. This replacement process, a sparse matrix substitution, is extremely inefficiently implemented and is a limitation of (Citation: Bates et al., 2024 Bates, D., Mächler, M. & Jagan, M. (2024). Matrix: Sparse and Dense Matrix Classes and Methods. Retrieved from https://cran.r-project.org/web/packages/Matrix/index.html ). The performance impact becomes noticeable when fitting models with hundreds of thousands of samples with tens of thousands of random factor levels, where this step takes up to thousands of seconds in our debugging analysis. Our first contribution to gamm4b is a more efficient implementation of this sparse matrix substitution process. More specifically, lines 225 - 235 of the gamm4 source code are replaced.
Default Optimizer Parameter Bug Fix During our gamm4 dissection process, we discover a parameter error on the gamm4 source code line 237. A custom optimizer parameter passed to the gamm4 routine is not properly passed to the underlying fitting engine. This discovery is made because during the gammJ (Section 3.4.4.2) development we use gamm4 as a reference implementation to compare the estimated parameters with the estimates in MixedModels.jl. Generally, we use the BOBYQA optimizer (Citation: Powell, 2009 Powell, M. (2009). The BOBYQA algorithm for bound constrained optimization without derivatives. Retrieved from https://www.semanticscholar.org/paper/The-BOBYQA-algorithm-for-bound-constrained-without-Powell/0d2edc46f81f9a0b0b62937507ad977b46729f64 ) in all our experiments. Presumably, this is also the default optimizer in lme4. However, we observe differences in the estimated parameters and debugging outputs when we do not specifically pass “bobyqa” as the optimizer argument in lmerControl.
Post-Processing Cholesky Decomposition Using our GAMM definition from Equation 3.3.5, let \( \mathbf{Z} \) be the random effects model matrix of the non-smooth terms and let \( \bm{\psi_\theta} \) be the corresponding random coefficient matrix associated with the estimated random effect parameters. The post-processing step to fully represent the fitted GLMM as a GAM includes the computation of the data covariance matrix
The inverse of \( \mathbf{V} \), \( \mathbf{V}^{-1} \) is used for further downstream computations. \( \mathbf{V} \in \mathcal{R}^{n \times n} \) is growing quadratically with the number of samples. (Citation: Wood, 2017 Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279 ) page 289 warn about the computational complexity and advise to leverage the special structure of \( \mathbf{V} \). Since \( \mathbf{V} \) is symmetric and positive-definite, a Cholesky decomposition can be applied. gamm4 implements a Cholesky decomposition on line 374 with functionality from the (Citation: Bates et al., 2024 Bates, D., Mächler, M. & Jagan, M. (2024). Matrix: Sparse and Dense Matrix Classes and Methods. Retrieved from https://cran.r-project.org/web/packages/Matrix/index.html ) library. The construction of \( \mathbf{V} \) and the computation of its Cholesky decomposition do not scale well with our data. We replace the Cholesky decomposition with a Python call (Citation: Ushey et al., 2024 Ushey, K., Allaire, J. & Tang, Y. (2024). reticulate: Interface to ’Python’. https://rstudio.github.io/reticulate/. ) to the Cholmod (Citation: Chen et al., 2008 Chen, Y., Davis, T., Hager, W. & Rajamanickam, S. (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. , 35(3). 22:1–22:14. https://doi.org/10.1145/1391989.1391995 ) function of scikit-sparse to overcome this limitation. In our experiments, with this modification, we can estimate GAMMs up to a sample size up to approximately 500'000 to 750'000, depending on the exact random effects structure. Larger sizes lead to memory allocation problems in R because the sparse matrix indexing in (Citation: Bates et al., 2024 Bates, D., Mächler, M. & Jagan, M. (2024). Matrix: Sparse and Dense Matrix Classes and Methods. Retrieved from https://cran.r-project.org/web/packages/Matrix/index.html ) only supports 32-bit indexes, which is not sufficient with our size of \( \mathbf{V} \). This further motivated the exploration of Julia software packages to avoid the limitations of R libraries.
3.4.4.2 gammJ
With gammJ we fit GAMMs as mixed models in a modified MixedModels.jl and apply further optimizations to operations related to the data covariance matrix introduced in Equation 3.4.1. These modifications further accelerate the estimation process and remove the matrix 32-bit indexing issues described in Section 3.4.4.1. In Section 3.4.4.3 we describe our modifications applied to MixedModels.jl to accommodate GAMMs as GLMMs. In Section 3.4.4.4 we specify implementations details for operations related to \( \mathbf{V} \).
3.4.4.3 Extended MixedModels.jl to support GAMMs as GLMMs
MixedModels.jl is very similar to lme4 and is partially developed by the same authors (Citation: Bates et al., 2015 Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01 ; Citation: JuliaStats, 2024 JuliaStats (2024). MixedModels.jl. https://github.com/JuliaStats/MixedModels.jl. ). However, the library is faster in many scenarios because the implementation leverages particular matrix structures and optimizes the corresponding operations. A key equation for estimating the mixed effect problem as given in Equation 3.3.1 and Equation 3.3.2, is the Cholesky decomposition given in (Citation: Bates et al., 2015 Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01 ) [page 14, equation 18]:
As previously defined in Equation 3.3.1 and Equation 3.3.2 the matrix \( \mathbf{Z} \) is the random effects design matrix while \( \mathbf{X} \) is the design matrix for the fixed effects. The relative covariance factor \( \Lambda_{\theta} \) represents a block-diagonal matrix that consists of blocks associated with the variance components of the random effects. Each block \( \Lambda_i \) corresponds to a specific random effect term and is a function of the parameter vector \( \bm{\theta} \), which defines the random effects’ variance structure. \( \mathbf{W} \) is a diagonal weight matrix, typically \( \mathbf{I} \). \( \mathbf{L}_{\theta} \) is the lower triangular Cholesky factor of \( \Lambda_{\theta}^\top \mathbf{Z}^\top \mathbf{W} \mathbf{Z} \Lambda_{\theta} + \mathbf{I} \) and plays a crucial role in simplifying the computation of the variance components during the estimation process. \( \mathbf{R}_{ZX} = \mathbf{X}^\top \mathbf{W} \mathbf{Z} \Lambda_{\theta} \) describes the interaction between the fixed and random effects. Finally, \( \mathbf{R}_{X} \) is the upper triangular Cholesky factor of the fixed-effects cross-product matrix \( \mathbf{X}^\top \mathbf{W} \mathbf{X} \). All these matrices work together and enable the estimation of both fixed and random effects in the model. The curious reader is invited to consult Citation: Bates et al., 2015 Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01 for more details on the meaning of the individual matrices in Equation 3.4.2.
The random effects matrix \( \mathbf{Z} \) has a blocked structure if there are multiple random effects such as cows and farms in our case. Moreover, depending on the nature of the individual random effects or slopes, the entries of the individual matrix blocks can be as simple as binary for random intercepts or real numbers for slopes. In many cases, the blocks are sparse. MixedModels.jl optimizes for many of these aspects. Let us consider a model with \( q \) random effects, resulting in a random effects matrix \( \mathbf{Z}=\begin{bmatrix} \mathbf{Z}_1 & \mathbf{Z}_2 & \dots & \mathbf{Z}_q \end{bmatrix} \) and the corresponding relative covariance factor \( \mathbf{\Lambda}_{\theta} = \text{diag}\left( \Lambda_{\theta_1}, \Lambda_{\theta_2}, \dots, \Lambda_{\theta_q} \right) \), both with \( q \) blocks. Expanding the left-hand side of Equation 3.4.2 leads to
\( \mathbf{L} \) of the right-hand side of Equation 3.4.2 expands to
\[ \mathbf{L} = \left[ \begin{array}{cc} \begin{bmatrix} \mathbf{L}_{\theta_1} & 0 & \dots & 0 \\ \mathbf{L}_{21} & \mathbf{L}_{\theta_2} & \dots & 0 \\ \vdots & \vdots & \ddots & 0 \\ \mathbf{L}_{q1} & \mathbf{L}_{q2} & \dots & \mathbf{L}_{\theta_q} \end{bmatrix} & 0 \\ \begin{bmatrix} \mathbf{X}^\top \mathbf{W} \mathbf{Z}_1 \Lambda_{\theta_1} & \dots & \mathbf{X}^\top \mathbf{W} \mathbf{Z}_q \Lambda_{\theta_q} \end{bmatrix} & \mathbf{R}_{X} \end{array} \right], \]where the top-left block is the lower triangular Cholesky factor \( \mathbf{L}_{\theta} \) of the block random effects matrix. This block matrix consists of the Cholesky factor \( \mathbf{L}_{\theta_i} \) of each block \( \Lambda_{\theta_i}^\top \mathbf{Z}_i^\top \mathbf{W} \mathbf{Z}_i \Lambda_{\theta_i} + \mathbf{I}_i \) and the Cholesky factor \( \mathbf{L}_{ij} \) of the cross terms between different random effects blocks \( \Lambda_{\theta_i}^\top \mathbf{Z}_i^\top \mathbf{W} \mathbf{Z}_j \Lambda_{\theta_j} \). The top-right block is \( 0 \) because the lower triangular matrix has zeros above the diagonal. The bottom-left block contains the interactions between the fixed-effects design matrix and the random effects blocks, i.e., \( \mathbf{R}_{ZX}^\top = \begin{bmatrix} \mathbf{X}^\top \mathbf{W} \mathbf{Z}_1 \Lambda_{\theta_1} & \dots & \mathbf{X}^\top \mathbf{W} \mathbf{Z}_q \Lambda_{\theta_q} \end{bmatrix} \). The bottom-right block \( \mathbf{R}_{X} \) is the Cholesky factor of the fixed-effects design matrix cross-product \( \mathbf{X}^\top \mathbf{W} \mathbf{X} \).
The different \( \mathbf{Z}_q \) have different structures depending on the model definition. For example if a block \( \mathbf{Z_q} \) corresponds to a random intercept, the columns of the matrix are indicator columns and the operation \( \mathbf{Z}_q^{\top}\mathbf{Z}_q \) leads to a block diagonal matrix. MixedModels.jl supports different \( \mathbf{Z}_q \) blocks structures and implements optimized matrix operations between them, accounting for all possible term interactions resulting from Equation 3.4.3. However, the block structure \( \mathbf{Z}'_j \) resulting from a smooth term from a GAMM to GLMM conversion and the corresponding matrix multiplications are not supported in MixedModels.jl. Our extension adds this block type, referred as FlexReMat type in our source code, and the associated matrix operations. Some sparse matrix operations are not fully optimized in our implementation, and open opportunities for additional improvements. Moreover, at the current stage, tests have only been executed for REML and the Gaussian family with the identity link function.
3.4.4.4 Scalable Cholesky Decomposition of Data Covariance Matrix
To overcome the computational limitations presented in Section 3.4.4.1 of the data covariance matrix \( \mathbf{V} \) operations we implement a scalable Julia version with the latest high-performance sparse matrix library SuiteSparseGraphBlas.jl (Citation: Davis, 2022 Davis, T. (2022). Algorithm 10xx: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra. . ; Citation: Davis, 2019 Davis, T. (2019). Algorithm 1000: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra. , 45(4). https://doi.org/10.1145/3322125 ; Citation: Pelletier et al., 2021 Pelletier, M., Kimmerer, W., Davis, T. & Mattson, T. (2021). The GraphBLAS in julia and python: The PageRank and triangle centralities. https://doi.org/10.1109/HPEC49654.2021.9622789 ) and CHOLMOD operations (Citation: Davis & , 2005 Davis, T. & Hager, W. (2005). Row Modifications of a Sparse Cholesky Factorization. , 26(3). 621–639. https://doi.org/10.1137/S089547980343641X ; Citation: Davis & , 1999 Davis, T. & Hager, W. (1999). Modifying a Sparse Cholesky Factorization. , 20(3). 606–627. https://doi.org/10.1137/S0895479897321076 ; Citation: Chen et al., 2008 Chen, Y., Davis, T., Hager, W. & Rajamanickam, S. (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. , 35(3). 22:1–22:14. https://doi.org/10.1145/1391989.1391995 ; Citation: Davis & , 2009 Davis, T. & Hager, W. (2009). Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves. , 35(4). 27:1–27:23. https://doi.org/10.1145/1462173.1462176 ). The operations are fast, but memory intensive. With sample sizes above 1'100'000-1'500'000, depending on the complexity of the random effect structures, we encounter the memory limitations of our computational resources during the Cholesky decomposition because the sparse matrix \( \mathbf{V} \) has billions of non-zero entries, which hinders us from estimating multi-breed models with meaningful sample sizes and remains an interesting area for future work.
Dataset | Samples | Breed | # Cows | # zip \( \times \) month | # Farms | Total Factor Levels |
---|---|---|---|---|---|---|
1 | 500,027 | JE | 17,555 | 13,818 | 2,966 | 34,339 |
2 | 1,000,000 | HO | 50,111 | 7,666 | 1,064 | 58,841 |
3.4.4.5 Performance Comparison – gamm4, gamm4b, gammJ
Finally, to compare the performance between gamm4, gamm4b and gammJ we use the following single-breed model given in Equation 3.2.2 with two example datasets described in Table 3.5. The differences in the number of factor levels between the two datasets emphasize the structural distinctions between farms with the corresponding breeds. For example, farms with Holsteins have more cows and, therefore, more samples per animal than farms with Jerseys.
Dataset | gamm4 | gamm4b | gammJ |
---|---|---|---|
1 | 15'097 | 4'758 | 1'863 |
2 | Crash | Crash | 1'646 |
Table 3.6 presents the different estimation times within our environment. With the first data set, we observe an 8x speed-up when comparing gamm4 with gammJ. Nonetheless, the performance boost by gamm4b is considerable as well. Furthermore, with the second example dataset, we are unable to retrieve a model with gamm4 and gamm4b. In both cases, the process can fit the model, but the post-processing operations related to the computation or Cholesky decomposition of \( \mathbf{V} \) fail because of the limitations discussed in Section 3.4.4. Despite the higher number of samples, the gammJ execution time is faster for the second dataset because the random intercept structure is less complex than for the first dataset. However, as mentioned in Section 3.4.4.3, some sparse matrix operations for certain block structures could be further optimized and further boost gammJ’s performance.
With this framework and our available computational resources, we are able to estimate single breed models with reasonable execution times, memory consumption, more random effect factor levels, and more samples than possible with the default gamm4 implementation.
3.5 Bibliography
- Davis (2022)
- Davis, T. (2022). Algorithm 10xx: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra. .
- JuliaStats (2024)
- JuliaStats (2024). MixedModels.jl. https://github.com/JuliaStats/MixedModels.jl.
- BLW (2023)
- BLW (2023). Agrarbericht 2023. .
- ASR (2024)
- ASR (2024). ASR - Arbeitsgemeinschaft Schweizerischer Rinderzüchter. https://asr-ch.ch/de/Willkommen.
- Bates, Mächler, Bolker & Walker (2015)
- Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. , 67. 1–48. https://doi.org/10.18637/jss.v067.i01
- Bates, Mächler & Jagan (2024)
- Bates, D., Mächler, M. & Jagan, M. (2024). Matrix: Sparse and Dense Matrix Classes and Methods. Retrieved from https://cran.r-project.org/web/packages/Matrix/index.html
- Braunvieh Schweiz (2024)
- Braunvieh Schweiz (2024). Braunvieh Schweiz - Swiss Brown Cattle Breeding Association. https://homepage.braunvieh.ch/.
- Brooks, Kristensen, Benthem, Magnusson, Berg, Nielsen, Skaug, Mächler & Bolker (2017)
- Brooks, M., Kristensen, K., Benthem, K., Magnusson, A., Berg, C., Nielsen, A., Skaug, H., Mächler, M. & Bolker, B. (2017). glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. , 9. 378–400. https://doi.org/10.32614/RJ-2017-066
- Bryant, López‐Villalobos, Pryce, Holmes & Johnson (2007)
- Bryant, J., López‐Villalobos, N., Pryce, J., Holmes, C. & Johnson, D. (2007). Quantifying the effect of thermal environment on production traits in three breeds of dairy cattle in New Zealand. , 50(3). 327–338. https://doi.org/10.1080/00288230709510301
- Bucheli, Dalhaus & Finger (2022)
- Bucheli, J., Dalhaus, T. & Finger, R. (2022). Temperature effects on crop yields in heat index insurance. , 107. 102214. https://doi.org/10.1016/j.foodpol.2021.102214
- Chen, Davis, Hager & Rajamanickam (2008)
- Chen, Y., Davis, T., Hager, W. & Rajamanickam, S. (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. , 35(3). 22:1–22:14. https://doi.org/10.1145/1391989.1391995
- Corbeil & Searle (1976)
- Corbeil, R. & Searle, S. (1976). Restricted Maximum Likelihood (REML) Estimation of Variance Components in the Mixed Model. , 18(1). 31–38. https://doi.org/10.2307/1267913
- Davis & Hager (2009)
- Davis, T. & Hager, W. (2009). Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves. , 35(4). 27:1–27:23. https://doi.org/10.1145/1462173.1462176
- Davis & Hager (1999)
- Davis, T. & Hager, W. (1999). Modifying a Sparse Cholesky Factorization. , 20(3). 606–627. https://doi.org/10.1137/S0895479897321076
- Davis & Hager (2005)
- Davis, T. & Hager, W. (2005). Row Modifications of a Sparse Cholesky Factorization. , 26(3). 621–639. https://doi.org/10.1137/S089547980343641X
- Chair of Hydrology and Water Resources Management, ETH Zurich (2024)
- Chair of Hydrology and Water Resources Management, ETH Zurich (2024). MeteoSwiss Data Access Point for ETH Researchers. https://hyd.ifu.ethz.ch/research-data-models/meteoswiss.html.
- Davis (2019)
- Davis, T. (2019). Algorithm 1000: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra. , 45(4). https://doi.org/10.1145/3322125
- Pelletier, Kimmerer, Davis & Mattson (2021)
- Pelletier, M., Kimmerer, W., Davis, T. & Mattson, T. (2021). The GraphBLAS in julia and python: The PageRank and triangle centralities. https://doi.org/10.1109/HPEC49654.2021.9622789
- Simpson (2024)
- Simpson, G. (2024). gratia: Graceful ggplot-based graphics and other functions for GAMs fitted using mgcv. https://gavinsimpson.github.io/gratia/.
- Hall (2023)
- Hall, M. (2023). Invited review: Corrected milk: Reconsideration of common equations and milk energy estimates. , 106(4). 2230–2246. https://doi.org/10.3168/jds.2022-22219
- Holstein Switzerland (2024)
- Holstein Switzerland (2024). Holstein Switzerland - The Swiss Holstein Breeding Association. https://www.holstein.ch/de/.
- Le (2020)
- Le, A. (2020). Interaction Term vs Split Sample. https://anhqle.github.io/interaction-term/.
- Markwick (2022)
- Markwick, D. (2022). Mixed Models Benchmarking. https://dm13450.github.io/2022/01/06/Mixed-Models-Benchmarking.html.
- MeteoSchweiz (2024)
- MeteoSchweiz (2024). Datenportal für lehre und forschung (IDAweb). https://www.meteoschweiz.admin.ch/service-und-publikationen/service/wetter-und-klimaprodukte/datenportal-fuer-lehre-und-forschung.html.
- Wood (2011)
- Wood, S. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. , 73(1). 3–36.
- Wood, N., Pya & Sæfken (2016)
- Wood, S., N., Pya & Sæfken, B. (2016). Smoothing parameter and model selection for general smooth models (with discussion). , 111. 1548–1575.
- Wood (2003)
- Wood, S. (2003). Thin-plate regression splines. , 65(1). 95–114.
- Wood (2024)
- Wood, S. (2024). Simple Random Effects in GAMs. https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.construct.re.smooth.spec.html.
- Pinheiro, Bates, DebRoy, Sarkar, Heisterkamp, Van Willigen & Maintainer (2017)
- Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., Heisterkamp, S., Van Willigen, B. & Maintainer, R. (2017). Package “nlme”. , 3(1). 274.
- Powell (2009)
- Powell, M. (2009). The BOBYQA algorithm for bound constrained optimization without derivatives. Retrieved from https://www.semanticscholar.org/paper/The-BOBYQA-algorithm-for-bound-constrained-without-Powell/0d2edc46f81f9a0b0b62937507ad977b46729f64
- Qualitas AG (2024)
- Qualitas AG (2024). Qualitas AG - IT and Quantitative Genetics for Livestock. https://qualitasag.ch/.
- Ushey, Allaire & Tang (2024)
- Ushey, K., Allaire, J. & Tang, Y. (2024). reticulate: Interface to ’Python’. https://rstudio.github.io/reticulate/.
- Schweizer Bauernverband (2024)
- Schweizer Bauernverband (2024). Milchstatistik der Schweiz 2023. https://www.sbv-usp.ch/fileadmin/user_upload/MISTA2023_def_online.pdf.
- Swissherdbook (2024)
- Swissherdbook (2024). Swiss herdbook - swiss cattle breeding association. https://www.swissherdbook.ch/de/.
- Swissherdbook (2019)
- Swissherdbook (2019). Handbuch für Milchkontrolleure. https://www.swissherdbook.ch/fileadmin/02_Dienstleistungen/02.3_Leistungspruefungen/1_Milchleistungspruefung/2019-07-02_Handbuch_Milchkontrolleure.pdf.
- Swissherdbook (2022)
- Swissherdbook (2022). Herdebuchreglement. https://www.swissherdbook.ch/fileadmin/02_Dienstleistungen/02.2_Herdebuch/2022-08-18_Herdebuchreglement.pdf.
- swisstopo (2024)
- swisstopo (2024). Amtliches Ortschaftenverzeichnis der Schweiz und des Fürstentums Liechtenstein. https://www.swisstopo.admin.ch/de/amtliches-ortschaftenverzeichnis.
- swisstopo (2024)
- swisstopo (2024). DHM25 / 200m - Digitales Höhenmodell der Schweiz. https://www.swisstopo.admin.ch/de/hoehenmodell-dhm25-200m.
- Wood & Scheipl (2020)
- Wood, S. & Scheipl, F. (2020). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. Retrieved from https://cran.r-project.org/web/packages/gamm4/index.html
- Wood, Goude & Shaw (2015)
- Wood, S., Goude, Y. & Shaw, S. (2015). Generalized Additive Models for Large Data Sets. , 64(1). 139–155. https://doi.org/10.1111/rssc.12068
- Wood (2017)
- Wood, S. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279
- Wood (2023)
- Wood, S. (2023). Mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. Retrieved from https://cran.r-project.org/web/packages/mgcv/index.html
- Wood (2004)
- Wood, S. (2004). Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models. , 99(467). 673–686. https://doi.org/10.1198/016214504000000980
- Young & Coleman (1959)
- Young, J. & Coleman, A. (1959). Neighborhood norms and the adoption of farm practices. , 24(4). 372.
In practice, some subtle differences between the different databases and schema versions exist. ↩︎
According to Holstein Switzerland, some data is irrecoverably lost. ↩︎
Bundesamt für Statisik - Federal Statistics Office ↩︎
We can also refer them as the “typical” mountain cow breeds ↩︎
Both of these hyperparameters are somewhat arbitrary and could be optimized in a model selection process. ↩︎
According to ETH Statistical Consulting up to 30 times faster ↩︎
memory consumption at Gigabytes for the process – htop observations during runtime – no systematic profiling ↩︎
Further specifications about potential constraints are out of scope for this work. ↩︎