f2a Analysis Report

lerobot/roboturk

Analysis Time: 2026-03-16T23:59:57+00:00 — Duration: 26.8s
Total: 187,507 rows across 1 subsets / splits

default / train

Overview

187,507
Rows
9
Columns
4
Numeric
1
Categorical
2
Text
0
Datetime
50
Memory Mb

Descriptive Statistics

column type count missing missing_% unique mean median std se cv mad min max range p5 q1 q3 p95 iqr skewness kurtosis top freq
observation.state text 187507 0 0.0000 187507 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
action text 187507 0 0.0000 187507 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
timestamp numeric 187507 0 0.0000 352 6.3121 5.3000 4.8659 0.0112 0.7709 3.3000 0.0000 35.1000 35.1000 0.4000 2.4000 9.3000 15.0000 6.9000 1.0530 1.4028 nan nan
episode_index numeric 187507 0 0.0000 1995 1004.9688 998.0000 573.2126 1.3238 0.5704 500.0000 0.0000 1994.0000 1994.0000 108.0000 501.0000 1503.0000 1890.0000 1002.0000 -0.0093 -1.2072 nan nan
frame_index numeric 187507 0 0.0000 352 63.1209 53.0000 48.6586 0.1124 0.7709 33.0000 0.0000 351.0000 351.0000 4.0000 24.0000 93.0000 150.0000 69.0000 1.0530 1.4028 nan nan
next.reward boolean 187507 0 0.0000 1 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0.0 187507.0000
next.done boolean 187507 0 0.0000 2 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan False 185512.0000
index numeric 187507 0 0.0000 187507 93753.0000 93753.0000 54128.7528 125.0027 0.5774 46877.0000 0.0000 187506.0000 187506.0000 9375.3000 46876.5000 140629.5000 178130.7000 93753.0000 -0.0000 -1.2000 nan nan
task_index categorical 187507 0 0.0000 3 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2 67436.0000

Distribution Histograms

Distribution Histograms

Boxplots

Boxplots

Distribution Analysis

Normality Tests & Shape

column n skewness skew_type kurtosis kurt_type normality_test normality_p is_normal_0.05 shapiro_p dagostino_p ks_p anderson_stat anderson_5pct_cv
timestamp 187507 1.0530 high skew 1.4028 leptokurtic dagostino 0.0000 False NaN 0.0000 0.0000 3223.8222 0.7520
episode_index 187507 -0.0093 symmetric -1.2072 platykurtic dagostino 0.0000 False NaN 0.0000 0.0000 2148.6268 0.7520
frame_index 187507 1.0530 high skew 1.4028 leptokurtic dagostino 0.0000 False NaN 0.0000 0.0000 3223.8205 0.7520
index 187507 -0.0000 symmetric -1.2000 platykurtic dagostino 0.0000 False NaN 0.0000 0.0000 2084.9274 0.7520

Violin Plots

Violin Plots

Q-Q Plots

Q-Q Plots

Correlation Analysis

Correlation Heatmap (Pearson)

Correlation Heatmap (Pearson)

Correlation Heatmap (Spearman)

Correlation Heatmap (Spearman)

Variance Inflation Factor (VIF)

column VIF multicollinearity
episode_index 266.2900 severe
index -0.0000 low
frame_index -57054521769846057730048.0000 low
timestamp -57054521769871039004672.0000 low

Missing Data

column missing_count missing_ratio missing_% dtype
observation.state 0 0.0000 0.0000 object
action 0 0.0000 0.0000 object
timestamp 0 0.0000 0.0000 float32
episode_index 0 0.0000 0.0000 int64
frame_index 0 0.0000 0.0000 int64
next.reward 0 0.0000 0.0000 float32
next.done 0 0.0000 0.0000 bool
index 0 0.0000 0.0000 int64
task_index 0 0.0000 0.0000 int64

Missing Data

Missing Data

Missing Data Matrix

Missing Data Matrix

Outlier Detection

column q1 q3 iqr lower_bound upper_bound outlier_count outlier_% min_outlier max_outlier
timestamp 2.4000 9.3000 6.9000 -7.9500 19.6500 2712.0000 1.4500 19.7000 35.1000
episode_index 501.0000 1503.0000 1002.0000 -1002.0000 3006.0000 0.0000 0.0000 nan nan
frame_index 24.0000 93.0000 69.0000 -79.5000 196.5000 2712.0000 1.4500 197.0000 351.0000
index 46876.5000 140629.5000 93753.0000 -93753.0000 281259.0000 0.0000 0.0000 nan nan

Outlier Detection

Outlier Detection

Categorical Analysis

Summary

column count unique top_value top_frequency top_% entropy norm_entropy
task_index 187507 3 2 67436 35.9600 1.5806 0.9973

Feature Importance

column variance std cv range
index 2929921879.6667 54128.7528 0.5774 187506.0000
episode_index 328572.6540 573.2126 0.5704 1994.0000
frame_index 2367.6604 48.6586 0.7709 351.0000
timestamp 23.6766 4.8659 0.7709 35.1000

Feature Importance

Feature Importance

PCA Analysis

4
N Components
100.0%
Total Variance Explained
2
Components For 90Pct
1
Top Component Variance

Variance Explained

component variance_ratio cumulative_ratio eigenvalue
PC1 0.5058 0.5058 2.0232
PC2 0.4942 1.0000 1.9768
PC3 0.0000 1.0000 0.0001
PC4 0.0000 1.0000 0.0000

Loadings

PC1 PC2 PC3 PC4
timestamp 0.5003 -0.4997 -0.0001 0.7071
episode_index 0.4996 0.5004 -0.7071 0.0000
frame_index 0.5003 -0.4997 -0.0001 -0.7071
index 0.4998 0.5002 0.7071 -0.0000

PCA Scree Plot

PCA Scree Plot

PCA Loadings

PCA Loadings

Warnings

  • High correlation: timestamp <-> frame_index (r=1.0)
  • High correlation: episode_index <-> index (r=0.9999)

Auto-Generated InsightsADV

Executive Summary

Dataset contains 187,507 rows and 9 columns (4 numeric, 1 categorical). 4 high-priority finding(s) detected. 5 moderate observations noted. Key highlights: 1. 2 column pair(s) with |r| > 0.9 2. 2 likely confounded correlation(s) detected 3. 4/4 numeric columns are non-normal

Total Insights0
Critical0
High4
Medium5
Low1

Insight Details

2 column pair(s) with |r| > 0.9HIGH · 0.8
correlation

Near-perfect linear relationships detected. Top pair: 'timestamp' ↔ 'frame_index' (r=1.000).

  • Consider dropping one column from each pair to reduce redundancy
  • Verify these are not data leakage or duplicate columns
2 likely confounded correlation(s) detectedHIGH · 0.7
correlation

Raw correlation differs significantly from partial correlation, suggesting confounding variables. Top: 'episode_index' ↔ 'index' (raw r=1.00, partial r=8183723376125764.00).

  • Do not assume causal relationship from raw correlation for these pairs
  • Investigate which variables are confounders
4/4 numeric columns are non-normalMEDIUM · 0.7
distribution

Most numeric columns fail normality tests (α=0.05). Non-parametric methods may be more appropriate.

  • Prefer non-parametric tests (Kruskal-Wallis, Mann-Whitney) over t-tests/ANOVA
  • Consider power transforms if normality is needed for downstream models
4 column(s) best fit by non-normal distributionsMEDIUM · 0.7
distribution

Distribution fitting reveals non-Normal best fits. Most common: beta (2 columns). Others: {'beta': 2, 'lognorm': 1, 'uniform': 1}.

  • Use the identified distributions for parametric modeling or simulation
  • Transform columns toward normality if Gaussian assumptions are needed
Clear cluster structure found (k=3, silhouette=0.40)HIGH · 0.7
cluster

K-Means identifies 3 well-separated clusters (silhouette=0.40). Cluster sizes: {'cluster_0': 1895, 'cluster_1': 1882, 'cluster_2': 1223}.

  • Profile each cluster to understand segment characteristics
  • Use cluster labels as a feature for downstream modelling
1 column(s) with severe multicollinearity (VIF>10)HIGH · 0.6
correlation

VIF > 10 detected for: ['episode_index']. Worst: 'episode_index' (VIF=266.3). Redundant information may cause model instability.

  • Remove one column from each highly correlated pair
  • Apply PCA or regularization (Ridge/Lasso) to handle collinearity
4 strong feature interaction(s) detectedMEDIUM · 0.6
feature

Top interaction: 'timestamp' × 'episode_index' (strength=0.73). Product features may improve model performance.

  • Create interaction (product) features for the top pairs
2 column(s) benefit from power transformationMEDIUM · 0.6
distribution

Box-Cox / Yeo-Johnson transforms can significantly reduce skewness for columns: ['timestamp', 'frame_index'].

  • Apply the recommended transform (Box-Cox or Yeo-Johnson) in preprocessing
Multi-method anomalies: 120 rows (2.4%)MEDIUM · 0.5
anomaly

A small fraction of rows are flagged by multiple anomaly detection methods.

  • Review flagged rows for data entry errors or special cases
No missing values detected in any columnLOW · 0.3
missing

All columns are fully populated — no imputation needed.

Insight Severity Distribution

Insight Severity Distribution

Top Insights

Top Insights

Advanced Distribution AnalysisADV

Best-Fit Distribution

column best_distribution aic bic ks_statistic ks_p_value fit_quality
timestamp beta 28275.5600 28301.6300 0.0363 0.0000 poor
episode_index beta 75977.6500 76003.7200 0.0150 0.2102 good
frame_index lognorm 20261.0000 20280.5500 0.5098 0.0000 poor
index uniform 121419.0200 121432.0600 0.0136 0.3124 good

Jarque-Bera Normality Test

column jb_statistic p_value is_normal_0.05 skewness kurtosis
timestamp 50024.3984 0.0000 False 1.0530 1.4028
episode_index 11389.0254 0.0000 False -0.0093 -1.2072
frame_index 50024.3684 0.0000 False 1.0530 1.4028
index 11250.4200 0.0000 False -0.0000 -1.2000

Power Transform Recommendation

column original_skewness recommended_method optimal_lambda transformed_skewness needs_transform improvement
timestamp 1.0530 yeo-johnson 0.2569 -0.0496 True 1.0034
episode_index -0.0093 yeo-johnson 0.7184 -0.2834 False -0.2741
frame_index 1.0530 yeo-johnson 0.3990 -0.0965 True 0.9565
index -0.0000 yeo-johnson 0.7071 -0.2916 False -0.2916

KDE Bandwidth Analysis

column n std iqr silverman_bandwidth scotts_bandwidth
timestamp 187507.0000 4.8659 6.9000 0.3862 0.2967
episode_index 187507.0000 573.2126 1002.0000 45.4941 34.9517
frame_index 187507.0000 48.6586 69.0000 3.8619 2.9670
index 187507.0000 54128.7528 93753.0000 4296.0273 3300.5092

Best-Fit Distribution Overlay

Best-Fit Distribution Overlay

ECDF Plot

ECDF Plot

Power Transform Comparison

Power Transform Comparison

Jarque-Bera Normality Test

Jarque-Bera Normality Test

Advanced Correlation AnalysisADV

Partial Correlation Matrix

timestamp episode_index frame_index index
timestamp 1.0000 6.0442 -1.0000 -99974086307298640.0000
episode_index 6.0442 1.0000 -6.0442 8183723376125764.0000
frame_index -1.0000 -6.0442 1.0000 99974086307314160.0000
index -99974086307298656.0000 8183723376125764.0000 99974086307314160.0000 1.0000

Mutual Information Matrix

timestamp episode_index frame_index index
timestamp 0.0000 0.0362 5.0693 0.0353
episode_index 0.0362 0.0000 0.0000 6.3049
frame_index 5.0693 0.0000 0.0000 0.0000
index 0.0353 6.3049 0.0000 0.0000

Bootstrap Correlation 95% CI

col_a col_b pearson_r ci_lower ci_upper ci_width significant
0 timestamp episode_index 0.0221 -0.0039 0.0485 0.0524 False
1 timestamp frame_index 1.0000 1.0000 1.0000 0.0000 True
2 timestamp index 0.0225 -0.0048 0.0508 0.0556 False
3 episode_index frame_index 0.0221 -0.0045 0.0496 0.0541 False
4 episode_index index 0.9999 0.9999 1.0000 0.0000 True
5 frame_index index 0.0225 -0.0054 0.0500 0.0554 False

Distance Correlation Matrix

timestamp episode_index frame_index index
timestamp 1.0000 0.0360 1.0000 0.0361
episode_index 0.0360 1.0000 0.0360 0.9999
frame_index 1.0000 0.0360 1.0000 0.0361
index 0.0361 0.9999 0.0361 1.0000

Partial Correlation Heatmap

Partial Correlation Heatmap

Mutual Information Heatmap

Mutual Information Heatmap

Bootstrap Correlation CI

Bootstrap Correlation CI

Correlation Network

Correlation Network

Distance Correlation Heatmap

Distance Correlation Heatmap

Clustering AnalysisADV

K-Means Summary

3
Optimal K
0
Best Silhouette
1,895
Largest Cluster

DBSCAN Summary

1
N Clusters Dbscan
0.0%
Noise Ratio
1
Eps

Hierarchical Clustering

3
Optimal K
0
Best Silhouette

Cluster Profiles

timestamp episode_index frame_index index
cluster_0 4.1473 1508.9013 41.4728 141229.8786
cluster_1 4.3556 481.7476 43.5563 44448.0032
cluster_2 13.2153 1064.5127 132.1529 99369.0989

Elbow & Silhouette

Elbow & Silhouette

Cluster Scatter

Cluster Scatter

Dendrogram

Dendrogram

Cluster Profiles

Cluster Profiles

Dimensionality ReductionADV

t-SNE Embedding

1
Kl Divergence
5,000
N Points

Factor Analysis

2
N Factors

Factor Loadings

factor_1 factor_2
timestamp 1.0000 -0.0000
episode_index 0.0221 -0.9997
frame_index 1.0000 0.0000
index 0.0225 -0.9997

PCA-Weighted Feature Contribution

column contribution_score rank
timestamp 0.5000 4.0000
episode_index 0.5000 2.0000
frame_index 0.5000 3.0000
index 0.5000 1.0000

PCA Biplot

PCA Biplot

Explained Variance Curve

Explained Variance Curve

Factor Loadings Heatmap

Factor Loadings Heatmap

Feature Engineering InsightsADV

Interaction Detection

col_a col_b interaction_strength corr_product_a corr_product_b corr_a_b recommendation
0 timestamp episode_index 0.7259 0.7480 0.5459 0.0221 Strong interaction
1 episode_index frame_index 0.7259 0.5459 0.7480 0.0221 Strong interaction
2 timestamp index 0.7217 0.7442 0.5495 0.0225 Strong interaction
3 frame_index index 0.7217 0.7442 0.5495 0.0225 Strong interaction

Binning Analysis

column n_bins equal_width_entropy equal_freq_entropy max_entropy recommended_method skewness
timestamp 10 2.2300 3.3211 3.3219 equal_frequency 1.0530
episode_index 10 3.3208 3.3219 3.3219 equal_width -0.0093
frame_index 10 2.2300 3.3211 3.3219 equal_frequency 1.0530
index 10 3.3219 3.3219 3.3219 equal_width -0.0000

Advanced Anomaly DetectionADV

Isolation Forest

250
Anomaly Count
5.0%
Anomaly Ratio

Local Outlier Factor

250
Anomaly Count
5.0%
Anomaly Ratio

Consensus (>=2/3 agree)

120
Consensus Count
2.4%
Consensus Ratio

Anomaly Scatter

Anomaly Scatter

Consensus Anomaly Comparison

Consensus Anomaly Comparison

Statistical TestsADV

Levene's Test (Equality of Variances)

col_a col_b levene_stat p_value log_var_ratio adjusted_p significant_0.05 stars
0 timestamp episode_index 564897.5197 0.0000 9.5380 0.0000 True ***
1 timestamp frame_index 222079.0332 0.0000 4.6052 0.0000 True ***
2 timestamp index 562425.8915 0.0000 18.6338 0.0000 True ***
3 episode_index frame_index 482754.0326 0.0000 4.9329 0.0000 True ***
4 episode_index index 550576.4333 0.0000 9.0957 0.0000 True ***
5 frame_index index 561596.5666 0.0000 14.0286 0.0000 True ***

Kruskal-Wallis Test

grouping_col numeric_col n_groups h_statistic p_value eta_squared effect_magnitude adjusted_p reject_h0_0.05 stars interpretation
0 task_index timestamp 3 10663.7432 0.0000 0.0569 small 0.0000 True *** Significant (η²=0.0569, small)
1 task_index episode_index 3 625.0241 0.0000 0.0033 small 0.0000 True *** Significant (η²=0.0033, small)
2 task_index frame_index 3 10663.6852 0.0000 0.0569 small 0.0000 True *** Significant (η²=0.0569, small)
3 task_index index 3 625.0238 0.0000 0.0033 small 0.0000 True *** Significant (η²=0.0033, small)

Mann-Whitney U Test

col_a col_b u_statistic p_value rank_biserial_r effect_magnitude adjusted_p significant_0.05 stars
0 timestamp episode_index 95185168.0000 0.0000 0.9946 large 0.0000 True ***
1 timestamp frame_index 2516704372.0000 0.0000 0.8568 large 0.0000 True ***
2 timestamp index 1278510.0000 0.0000 0.9999 large 0.0000 True ***
3 episode_index frame_index 34126631417.0000 0.0000 -0.9413 large 0.0000 True ***
4 episode_index index 188532431.5000 0.0000 0.9893 large 0.0000 True ***
5 frame_index index 11929357.5000 0.0000 0.9993 large 0.0000 True ***

Chi-Square Goodness of Fit

column n_categories chi2_stat p_value cramers_v effect_magnitude uniform_0.05 interpretation
task_index 3 1112.7892 0.0000 0.0545 small False Non-uniform distribution

Grubbs Outlier Test

column suspect_value grubbs_statistic critical_value is_outlier n
timestamp 35.1000 5.9163 5.1454 True 187507
episode_index 0.0000 1.7532 5.1454 False 187507
frame_index 351.0000 5.9163 5.1454 True 187507
index 0.0000 1.7320 5.1454 False 187507

Data Profiling SummaryADV

Column Roles

column primary_role confidence secondary_role properties
observation.state id 0.8500 NaN {'unique_ratio': 1.0}
action id 0.8500 NaN {'unique_ratio': 1.0}
timestamp timestamp 0.7000 NaN {'dtype': 'float32', 'hint': 'monotonic numeric with time-like name'}
episode_index numeric_feature 0.8500 NaN {'dtype': 'int64'}
frame_index numeric_feature 0.8500 NaN {'dtype': 'int64'}
next.reward constant 1.0000 NaN {'n_unique': 1}
next.done binary 0.9000 NaN {'n_unique': 2, 'values': [False, True]}
index id 0.9000 NaN {'unique_ratio': 1.0}
task_index categorical_feature 0.8500 NaN {'n_unique': 3, 'unique_ratio': 0.0}

ML Readiness

Overall Score97/100 (A+)
completeness100.0
consistency97.8
balance100.0
informativeness100.0
independence80.0
scale100.0

Blocking Issues

  • 1 constant column(s) — remove before modelling
  • Extreme multicollinearity: VIF=266 for 'episode_index' — remove or combine

Suggestions

  • Remove 3 ID-like column(s) before modelling: observation.state, action, index