### See original article:

## Abstract

Despite its extensive use in physiological and clinical research, the analysis of HRV (heart rate variability) is still poorly supported by rigorous reliability studies. The main aim of the present study was to perform an in-depth assessment of absolute and relative reliability of standard indexes of HRV from short-term laboratory recordings. In 39 healthy subjects [mean age (min–max): 38 (26–56) years; 18 men and 21 women] we recorded 5 min of supine ECG during spontaneous and paced (15 breaths/min) breathing. The test was repeated on the next day under the same conditions. From the RR intervals we computed standard indexes of HRV: SDNN (S.D. of RR interval values), RMSSD (root-mean-square of successive RR interval differences), LF (low frequency) and HF (high frequency) power (absolute and normalized units) and LF/HF. Absolute reliability was assessed by 95% limits of random variation; relative reliability was assessed by the ICC (intraclass correlation coefficient). The sample size needed to detect a mean difference ≥30% of the between-subject S.D. was also estimated. Although there was no significant mean change between the two tests, we found that in individual subjects the second measurement was as high/low as 1.9/0.5 times (SDNN, best case) and 3.5/0.3 times (LF/HF, worst case) the first measurement, due to pure random variation. For most parameters the ICC was >0.8 (range 0.65–0.88). The estimated sample size ranged from 24–98 subjects. Reliability indexes tended to improve during paced breathing. We conclude that short-term HRV parameters are subject to large day-to-day random variations. Random error, however, represents a limited part of the between-subject variability; therefore observed differences between individuals mostly reflect differences in the subjects' error-free value rather than random error. Overall, paced breathing improves reliability.

- ECG
- heart rate variability
- high-frequency power
- low-frequency power
- paced breathing
- RR interval

## INTRODUCTION

After almost 25 years since the publication of the pioneering paper from Akselrod and co-workers [1], HRV (heart rate variability) is still a matter of great interest in clinical and physiological research. According to the PubMed database (www.pubmed.com), from 2000 to 2006 the number of yearly publications related to HRV has increased steadily from 391 to 584, and this trend is going to be continued in 2007. Although such a broad use of this methodology would implicitly assume that the reliability of HRV measurements has been thoroughly evaluated previously, an unprejudiced look at available reliability studies clearly shows that this evaluation has often been inadequate [2]. This is particularly true for HRV indexes derived from short-term laboratory experiments, which are those most commonly used in non-invasive investigations of autonomic cardiovascular control. Major methodological limitations of published reliability studies include: (i) inadequate protocol (e.g. replicate measurements were taken too far from each other); (ii) insufficient sample size; (iii) too short or too long recordings according to current guidelines [3]; (iv) limited selection of studied HRV parameters; (v) inadequate assessment of reliability due to the use of inappropriate reliability indexes or to misapplication/incomplete inclusion of appropriate indexes; and (vi) lack of indications on the practical implications of computed reliability indexes (e.g. for the assessment of individual responses or for sample size estimation). A detailed review of these issues can be found in papers published previously [2,4,5].

In the present study we accurately and comprehensively assessed the reliability of short-term HRV measurements in healthy individuals in order to overcome major methodological limitations of previous investigations. Both absolute and relative reliability have been considered and implications for sample size calculation and assessment of individual changes are presented. As laboratory recordings are mostly carried out during spontaneous and/or paced breathing, both experimental conditions have been considered.

## MATERIALS AND METHODS

### Subjects

We studied 39 healthy volunteers [mean age (min–max): 38 (26–56) years, 18 men and 21 women]. None of the subjects were on medication or suffered from chronic or acute disease. The study was approved by the local Ethics Committee, and all subjects gave their written informed consent before participation.

### Study protocol

All tests were performed between 08.00 and 13.00 hours, with the subjects in the supine position in a quiet and dimmed room at a comfortable temperature. Before the study all subjects fasted for more than 2 h and refrained from smoking, alcohol or coffee for 24 h.

After instrumentation, subjects carried out a session of familiarization with the paced breathing protocol [6]. They followed a recorded voice instruction to breathe in and out at a frequency of 0.25 Hz. Stabilization of the signals was then performed for 15 min, following which subjects breathed spontaneously for 8 min, and then breathed at the paced breathing frequency for another 8 min. During these two sessions ECG recordings were performed. An identical session was repeated on the next day at the same time of day.

### Signal analysis and measurements

Beat-by-beat RR interval values (resolution 1 ms) were obtained from the ECG using a software package developed in-house [7]. The RR interval time series was then re-sampled at 2 Hz by cubic spline interpolation. The analysis was carried out on the central 5-min window of each recording. This analysis window has become the standard in short-term HRV [3].

After detrending via least-square second-order polynomial fitting, the power spectral density of the RR interval time series was estimated by the Blackman–Tukey method using a Parzen window with a spectral bandwidth of 0.015 Hz [8]. The power in the LF (low frequency; 0.04–0.15 Hz) and HF (high frequency; 0.15–0.45 Hz) bands were obtained by numerical integration. These spectral indexes will be referred to as LF_BT and HF_BT. Spectral analysis was also performed using the autoregressive method (Burg algorithm) with spectral decomposition (Johnsen and Andersen algorithm). The autoregressive model order was set at 26, but was interactively increased when negative components appeared in the spectral decomposition table [8]. Spectral powers of the LF and HF bands (termed LF_AR and HF_AR respectively) were computed by summing the respective spectral components. Components showing <10% of the overall power in the band were ignored as they probably represented pure noise contributions. The LFnu (LF power in normalized units) was computed as LF_AR/(LF_AR+HF_AR). To avoid redundancy, the HF power in normalized units, being equal to 1−LFnu, was not analysed.

We also computed time-domain HRV parameters suitable for short-term analysis, including SDNN (S.D. of RR interval values) and RMSSD (root-meansquare of successive RR interval differences) [3]. Although the mean RR interval is not, strictly speaking, an HRV parameter, it was included in the analysis as a major index of cardiac autonomic control.

### Statistical analysis

A short introduction to the statistical basis for the assessment of reliability is described in the Appendix. The key point is that observed measurements were assumed to be the sum of (i) a fixed quantity, which represents the true or error-free value of the characteristic being measured in each subject, and (ii) a random quantity, commonly referred to as random error of measurement, which accounts for within-subject variability. According to these concepts, we first examined the distribution of the measurements obtained in the two tests, as well as of their difference, to detect and discard possible outliers. An observed value was deemed to be an outlier if it was greater/less than the upper/lower quartile ±1.5 times the interquartile range [9]. This method required a preliminary log-transformation of skewed variables (assessed by the Shapiro–Wilk test for normality).

Following the recommendations of Bland and Altman [10], we plotted the difference between the two measurements (*X*_{2}−*X*_{1}) against their average. This graphical method allows for any systematic change between the tests to be observed and also to check whether the random error is related to the size of the characteristic being measured (this is referred to as heteroscedasticity). This qualitative investigation was followed by a formal verification of the assumptions underlying the assessment of reliability (see Appendix). Specifically, we tested the hypothesis of normality and zero mean of the difference between the two tests (two-sided paired Student's *t* test at the 0.05 significance level); we also verified the homoscedasticity assumption by regressing the absolute differences between the two measurements against their average [5,10].

As many of the variables showed non-normality and heteroscedasticity, this problem was solved by log-transformation using the natural logarithm (ln) [10].

One-way random effects ANOVA was carried out on all variables to estimate the SEM (standard error of measurement), the major index of absolute reliability. This was obtained as the square root of the WMS (within-subject mean square) from the ANOVA table [11,12]. From the SEM value we derived the 95% limits of random variation, i.e. the range of values within which 95% of the differences between two measurements (*X*_{2}−*X*_{1}) are expected to lie due to pure random variation. For log-transformed variables, these limits were back-transformed (antilogarithm), giving the range of values within which 95% of the ratios between the two measurements (*X*_{2}/*X*_{1}) are expected to lie due to pure random variation [5,10].

From mean square values of the ANOVA table we also computed the ICC (intraclass correlation coefficient) [11,12] as: where BMS is the between-subject mean square. 95% Confidence intervals for ICC were calculated [13].

As the magnitude of the random error affects the sample size of an experiment [14], we estimated the sample size needed to detect a relevant change in the mean of HRV parameters after a treatment (test–retest experiment). Conventionally, we considered a change of ≥30% of between-subject S.D as ‘relevant’. We also assumed a two-sided test with a significance level of 5% and a power of 80%.

All analyses were carried out using the SAS/STAT statistical package version 8.02 (SAS Institute).

## RESULTS

### Spontaneous breathing

All variables with the exception of mean RR interval and LFnu had a marked right-skewed distribution (*P*<0.01, as determined by the Shapiro–Wilk test). One subject had an outlier in RMSSD, HF_BT and HF_AR, whereas another one had an outlier only in RMSSD. These measurements and those derived from them (i.e. LFnu and LF/HF) were ignored in the subsequent analyses. Descriptive statistics of parameters measured in the two tests are shown in Table 1.

Bland–Altman plots of the difference between the two measurements against their average displayed a symmetrical distribution of points around the zero line in all parameters, indicating the absence of a systematic change. The width of the scatter around the same line, however, showed a clear increasing trend in SDNN, RMSSD, LF_BT, LF_AR, HF_BT, HF_AR and LF/HF, indicating heteroscedasticity. Two representative examples of a homoscedastic variable (mean RR interval) and a heteroscedastic variable (LF_BT) are shown in Figures 1(a) and 1(b) respectively. Visual findings were confirmed by regression analysis.

Heteroscedastic variables were successfully log-transformed, at the same time obtaining homoscedasticity and normality. A representative Bland–Altman plot is shown in Figure 1(c).

For all HRV parameters the difference between the two tests was negligible and non-significant, indicating the absence of a systematic change.

Reliability indexes for homoscedastic variables (mean RR interval and LFnu) are shown in Table 2. If we consider LFnu, the limits of random variation indicated that, in order to be 95% confident that a real change has occurred in an individual, the observed difference between two measurements had to be >0.29 or <−0.29. For the same parameter, the ICC indicated that 65% of the variability in LFnu measurements across the population studied was due to variability in the true value of the subjects, whereas the remaining 35% was due to random error. The last column of Table 2 shows that 98 subjects were needed in a test–retest experiment to detect a change in mean LFnu of ≥0.04 (30% of between-subject S.D.), with a significance level of 5% and a power of 80%.

Reliability indexes for heteroscedastic HRV parameters are shown in Table 3. Since the statistical analysis of these variables was carried out after log-transformation, the SEM and the change in the mean for the estimation of the sample size were expressed in log units. The limits of random variation indicated that, in order to be 95% confident that a real change had occurred in an individual, the ratio between two measurements (*X*_{2}/*X*_{1}) had to lie outside the indicated interval. The asymmetry of this interval was simply the result of the antilogarithmic transformation. It can be seen that for all spectral parameters the second measurement can be as large/small as approx. 3.5/0.3 times the first measurement due to pure random variation. To relate these inferential figures to the data of the present study, Figure 2 shows the plots of the ratio *X*_{2}/*X*_{1} against the mean of the two measurements. Note the close agreement between theoretical 95% limits of variation and the raw data.

The ICC values in Table 3 show that the proportion of total measurement variability of heteroscedastic HRV parameters explained by the variability of the subjects' error-free value ranged from 70% (LF_HF) to 86% (HF_BT and HF_AR); therefore random error accounted for 30% to 14% of total measurement variability.

### Paced breathing

Descriptive statistics of HRV parameters derived from paced breathing recordings are shown in Table 4. There was only one outlier in RMSSD. As for spontaneous breathing, SDNN, RMSSD, LF_BT, LF_AR, HF_BT, HF_AR and LF/HF exhibited an heteroscedastical behaviour and were successfully log-transformed. The difference between transformed measurements was largely non-significant.

Reliability indexes for homoscedastic and heteroscedastic variables are shown in Tables 2 and 3 respectively. Compared with spontaneous breathing we found: (i) an improvement of all reliability indexes (SEM, limits of random variation, ICC and sample size) for LF_BT, LF_AR and LF/HF; (ii) an increased ICC and reduced sample size for LFnu; and (iii) a slight worsening of all reliability indexes for SDNN. Analysis indexes of the other HRV parameters were almost unchanged.

## DISCUSSION

Despite its extensive use in physiological and clinical research, the analysis of HRV is still poorly supported by sound reliability studies, and further investigations in this field have recently been advocated [2]. In the present study we performed an in-depth assessment of absolute and relative reliability of standard indexes of HRV from short-term laboratory recordings in healthy subjects during spontaneous and paced breathing. The experimental protocol and the analysis of collected data were performed according to state-of-the art methodology and best-practice criteria. We found that HRV parameters were characterized by large random variations within individuals, thus showing low absolute reliability. Random error, however, in most parameters represented a limited portion of total measurement variability across individuals, thus indicating good relative reliability. Overall, paced breathing improved the reliability of spectral parameters, particularly those derived from ratios of raw quantities (LFnu and LF/HF).

### Reliability of HRV parameters: absolute reliability

The present study reveals the presence of a large random error (SEM) in all HRV parameters, particularly in those computed in the frequency domain. Indeed, increases as great as approx. 3.5 times and decreases of as much as approx. 30% may occur from one measurement to the next due to pure random variation. Limits of random variation are lower for time-domain indexes, being approx. 1.9–2.4 times and 50–60% for SDNN and RMSSD respectively. These results question the use of HRV indexes in assessing treatment effects in individual subjects. Of note, in all HRV parameters, except LFnu, random error increases as the magnitude of the parameter increases, which is the hallmark of heteroscedasticity.

Random error of HRV parameters is in part due to sampling variability of the estimated parameters, as they are nothing but statistics computed on a finite number of RR intervals. Therefore they are subjected to random changes from one sample to another [15]. Part of intra-subject variability is also due to an intrinsic lability of HRV parameters, probably because they are under the influence of such factors as mood, alertness and mental activity, which are very difficult to control for in any study. Changes associated with frequency and depth of respiration also play an important role [6].

In the two previous reliability studies that provide estimates of the SEM (or equivalently of the within-subject S.D.), slightly greater values of this index were found [16,17]. This might be due to differences in the experimental protocol.

### Reliability of HRV parameters: relative reliability

The ICC of HRV parameters ranged between 0.65–0.88. Although the definition of a categorical rating of relative reliability based on ICC is still controversial, these values can reasonably be considered to indicate substantial to good reliability [5,11,18]. For most measurements the ICC was >0.80, indicating that they reflect mostly the true value of HRV parameters relative to random error. From the mathematical definition of ICC (see Appendix), it clearly appears that such high values of relative reliability are the consequence of a large between-subject variability, a fact well-known to investigators involved in HRV analysis. The lowest values of ICC were found in the LFnu and LF/HF parameters during spontaneous breathing (0.65 and 0.70 respectively). This result probably derives from a relatively greater random error, as LFnu and LF/HF ‘carry’ the error of both the LF and HF power. A further insight into the practical meaning of the observed ICCs can be gained by remembering that in the context of our test–retest reliability study the ICC equals the correlation coefficient between paired measurements [19].

Estimated ICCs for time-domain HRV parameters are very close to those reported by Sinnreich et al. [20] and similar to those found by Schroeder et al. [4]. Lower ICCs were obtained in the same parameters by Pitzalis et al. [16] and Gerritsen et al. [21]; their estimates, however, were based on raw (i.e. not transformed) data. The ICCs of spectral parameters were similar to or higher than those found by others [4,16,20]. A comparatively reduced LFnu during spontaneous breathing was also observed by Sandercock et al. [22].

### Implications of reliability in sample size estimation

A major implication of the measurement reliability is the size of the sample needed to test a scientific hypothesis with a preset significance level and power. We explored this point by simulating a simple test–retest study to investigate the effect of a treatment on the mean value of HRV parameters. Since reference values as to what change in HRV parameters would be clinically relevant are lacking, we conventionally adopted the criterion of 30% of between-subject S.D. The rationale is that the more the subjects have dispersed values, the larger the shift in mean value should be to be clinically relevant. We found that the sample size can vary largely from parameter to parameter and that it is inversely proportional to the ICC. The latter result also applies to the estimation of the sample size in group comparison studies [11].

### Spontaneous compared with paced breathing

Paced breathing substantially improved the reliability of LFnu and LF/HF, with a consequent dramatic reduction (>50%) in the estimated sample size. This result is in agreement with the findings of Pitzalis et al. [16]. Moreover, voluntary control of breathing moderately improved the reliability of LF power, while leaving substantially unchanged that of HF power and RMSSD, and slightly decreasing the reliability of SDNN. We argue that the improvements observed during paced breathing might be due to a better stabilization of LF oscillations brought about by the virtual abolition of respiratory-related frequency components within the LF band [6].

### Classical compared with autoregressive spectral estimation

Our present study suggests that the reliability of HRV parameters is not affected by the method used for spectral estimation (classical Blackman–Tukey or autoregressive). A similar result has been found by other investigators [16]. It should be stressed, however, that spectral estimates depend to a certain extent on the design criteria adopted in the analysis. For instance, autoregressive measurements depend on the criterion used for model order selection [8]. Therefore the use of algorithms markedly different from those adopted in the present study may yield different reliability figures.

### Conclusions

In healthy subjects, short-term HRV parameters are subject to large day-to-day random variations which would make the detection of treatment effects in individual subjects difficult. For most indexes, however, random variation represents a limited portion of the between-subject variability; therefore observed differences between individuals mostly reflect differences in the subjects' true value rather than random error. The sample size for an experiment depends markedly on the reliability of the HRV index considered; this implies that the design of experiments based on the measurement of a set of HRV parameters should be tuned to the indexes with lowest reliability. Paced breathing appears to provide more reliable HRV measurements, particularly those related to the spectral content of the LF band.

## APPENDIX

### Statistical basis for the assessment of reliability

A measurement is said to be reliable when the values obtained under identical conditions on the same individuals at different times closely agree with each other. Therefore reliability is synonymous with intra-subject reproducibility, repeatability or consistency. Reliability has classically been investigated assuming the following statistical equation [11,12]:
(1)
In this equation, the measurement *X* made on a given subject is assumed to be the sum of a fixed quantity τ, the ‘true value’ or ‘error-free value’ of the characteristic being measured in that subject, and a random quantity ϵ, commonly referred to as random error of measurement, which accounts for intra-subject variability.

There are some basic assumptions underlying eqn (1): the random error ϵ is normally distributed, has zero mean, is uncorrelated within and between subjects, and has a fixed S.D. independent of τ, the latter property being commonly referred to as homoscedasticity [5]. Moreover, for a population of subjects, τ is assumed to be normally distributed. Before analysing reliability, all these assumptions should be carefully verified or confidently assumed on the basis of a properly conducted experiment. Failure to satisfy homoscedasticity and normality assumptions is commonly dealt with by variable transformation [10].

If we take two replicate measurements on the same individual under identical conditions, from eqn (1) we have that the difference Δ*X* between them will be:
(2)
where the suffixes 2 and 1 indicate the measurements at the two occasions. Eqn (2) clearly shows that the repeatability of the measurement depends on the random quantity δ: the lower the δ, the closer the two measurements will be to each other. The magnitude of δ, as expressed by its S.D., is √2·σ_{ϵ}, where σ_{ϵ}, the so-called SEM, is the S.D. of ϵ. Therefore the repeatability of a measurement ultimately depends on the magnitude of the SEM. Accordingly, the SEM is considered the key statistical indicator of absolute reliability [5,11,19].

The SEM has the following two major uses. First, if we take two measurements on the same individual before and after a treatment and want to be 95% confident that a real change has occurred, the observed difference has to lie outside the interval −1.96·√2·SEM, +1.96·√2·SEM, or −2.77·SEM, +2.77·SEM [5,10]. Therefore the extremes of this interval can be viewed as the 95% limits of random variation. The value 2.77·SEM is called repeatability coefficient [10]. Secondly, the SEM is a crucial parameter in determining the sample size for an experiment [5,14]. This is because the higher the random error of a measurement, the greater the ‘noise’ that will tend to obscure a possible treatment effect.

Another classical way of assessing reliability is through the ICC, which is also commonly referred to as the reliability coefficient. It expresses the proportion of the variability of observed measurements that is explained by the variability of the subjects' error-free value. The ICC is then defined as [11,12]:
(3)
where σ^{2}_{x} is the total variance of observed measurements among the subjects of the considered population, σ^{2}_{τ} is the part of total variance due to differences in the patients' true value andσ^{2}_{ϵ} is the part due to random error (i.e. the square of the SEM).

The ICC ranges from 0 to 1: the lower the random error relative to subject-to-subject variability, the closer the ICC will be to 1; conversely, the greater the random error relative to subject-to-subject variability, the closer the ICC will be to 0. Therefore the higher the ICC the more a measurements will reflect the true value and the more probable differences between individuals will be detected. For these reasons the ICC is considered the key statistical indicator of relative reliability [5]. When reliability is assessed taking two replicate measurements per subject, the ICC turns out to be mathematically equivalent to the correlation coefficient between paired measurements [19]. Various categories of reliability based on ICC have been proposed so far [5,11,18]. Although these criteria do not fully agree with each other, an ICC >0.8 is usually regarded as indicating good to excellent reliability, whereas an ICC between 0.6 and 0.8 may be taken to represent substantial reliability. From eqn (3) it clearly appears that the ICC, besides being dependent on the random variability of the measurement, depends on the variability (i.e. the heterogeneity) of the population being studied. Therefore results obtained in one population cannot be extrapolated to a new and possibly more/less homogeneous population [5,19].

### Assessment of reliability

To assess reliability, the measurement of interest is taken on a sample of individuals during two or more replicated experiments under as close to uniform conditions as possible. To fulfill the statistical assumptions for analysis, replicated experiments should be performed neither to far apart in time, to ensure constancy of the characteristic being measured, nor too close, to avoid potential carry-over or learning effects. Collected data should be carefully checked to detect outliers and to verify the required statistical assumptions (see above) [12]. Graphical methods such as Bland–Altman plots as well as formal hypothesis testing are used for this purpose [5,10,12]. Whenever the data do not satisfy distributional assumptions and/or the homoscedasticity requirement, variable transformation is applied [5,10]. Indexes of absolute and relative reliability (e.g. SEM and ICC) and related parameters (e.g. 95% limits of random variation and sample size) are finally estimated.

**Abbreviations:**
BMS, between-subject mean square;
HF, high frequency;
HF_AR, HF power according to the autoregressive method;
HF_BT, HF power according to the Blackman–Tukey method;
HRV, heart rate variability;
ICC, intraclass correlation coefficient;
LF, low frequency;
LF_AR, LF power according to the autoregressive method;
LF_BT, LF power according to the Blackman–Tukey method;
LFnu, LF power in normalized units;
RMSSD, root-mean-square of successive RR interval differences;
SDNN, S.D. of RR interval values;
SEM, standard error of measurement;
WMS, within-subject mean square

- © The Authors Journal compilation © 2007 Biochemical Society