f2a Report - lerobot/roboturk

Total: 187,507 rows across 1 subsets / splits

default / train

Overview

187,507

Rows

Columns

Numeric

Categorical

Text

Datetime

Memory Mb

Descriptive Statistics

column	type	count	unique	mean	median	std	se	cv	mad	min	max	range	p5	q1	q3	p95	iqr	skewness	kurtosis	top	freq
observation.state	text	187507	187507	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
action	text	187507	187507	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan
timestamp	numeric	187507	352	6.3121	5.3000	4.8659	0.0112	0.7709	3.3000	0.0000	35.1000	35.1000	0.4000	2.4000	9.3000	15.0000	6.9000	1.0530	1.4028	nan	nan
episode_index	numeric	187507	1995	1004.9688	998.0000	573.2126	1.3238	0.5704	500.0000	0.0000	1994.0000	1994.0000	108.0000	501.0000	1503.0000	1890.0000	1002.0000	-0.0093	-1.2072	nan	nan
frame_index	numeric	187507	352	63.1209	53.0000	48.6586	0.1124	0.7709	33.0000	0.0000	351.0000	351.0000	4.0000	24.0000	93.0000	150.0000	69.0000	1.0530	1.4028	nan	nan
next.reward	boolean	187507	1	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0	187507.0000
next.done	boolean	187507	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	False	185512.0000
index	numeric	187507	187507	93753.0000	93753.0000	54128.7528	125.0027	0.5774	46877.0000	0.0000	187506.0000	187506.0000	9375.3000	46876.5000	140629.5000	178130.7000	93753.0000	-0.0000	-1.2000	nan	nan
task_index	categorical	187507	3	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	2	67436.0000

Distribution Histograms

Boxplots

Distribution Analysis

Normality Tests & Shape

column	n	skewness	skew_type	kurtosis	kurt_type	normality_test	is_normal_0.05	shapiro_p	anderson_stat	anderson_5pct_cv
timestamp	187507	1.0530	high skew	1.4028	leptokurtic	dagostino	False	NaN	3223.8222	0.7520
episode_index	187507	-0.0093	symmetric	-1.2072	platykurtic	dagostino	False	NaN	2148.6268	0.7520
frame_index	187507	1.0530	high skew	1.4028	leptokurtic	dagostino	False	NaN	3223.8205	0.7520
index	187507	-0.0000	symmetric	-1.2000	platykurtic	dagostino	False	NaN	2084.9274	0.7520

Violin Plots

Q-Q Plots

Correlation Analysis

Correlation Heatmap (Pearson)

Correlation Heatmap (Spearman)

Variance Inflation Factor (VIF)

column	VIF	multicollinearity
episode_index	266.2900	severe
index	-0.0000	low
frame_index	-57054521769846057730048.0000	low
timestamp	-57054521769871039004672.0000	low

Missing Data

column	missing_count	missing_ratio	missing_%	dtype
observation.state	0	0.0000	0.0000	object
action	0	0.0000	0.0000	object
timestamp	0	0.0000	0.0000	float32
episode_index	0	0.0000	0.0000	int64
frame_index	0	0.0000	0.0000	int64
next.reward	0	0.0000	0.0000	float32
next.done	0	0.0000	0.0000	bool
index	0	0.0000	0.0000	int64
task_index	0	0.0000	0.0000	int64

Missing Data

Missing Data Matrix

Outlier Detection

column	q1	q3	iqr	lower_bound	upper_bound	outlier_count	outlier_%	min_outlier	max_outlier
timestamp	2.4000	9.3000	6.9000	-7.9500	19.6500	2712.0000	1.4500	19.7000	35.1000
episode_index	501.0000	1503.0000	1002.0000	-1002.0000	3006.0000	0.0000	0.0000	nan	nan
frame_index	24.0000	93.0000	69.0000	-79.5000	196.5000	2712.0000	1.4500	197.0000	351.0000
index	46876.5000	140629.5000	93753.0000	-93753.0000	281259.0000	0.0000	0.0000	nan	nan

Outlier Detection

Categorical Analysis

Summary

column	count	unique	top_value	top_frequency	top_%	entropy	norm_entropy
task_index	187507	3	2	67436	35.9600	1.5806	0.9973

Feature Importance

column	variance	std	cv	range
index	2929921879.6667	54128.7528	0.5774	187506.0000
episode_index	328572.6540	573.2126	0.5704	1994.0000
frame_index	2367.6604	48.6586	0.7709	351.0000
timestamp	23.6766	4.8659	0.7709	35.1000

Feature Importance

PCA Analysis

N Components

100.0%

Total Variance Explained

Components For 90Pct

Top Component Variance

Variance Explained

component	variance_ratio	cumulative_ratio	eigenvalue
PC1	0.5058	0.5058	2.0232
PC2	0.4942	1.0000	1.9768
PC3	0.0000	1.0000	0.0001
PC4	0.0000	1.0000	0.0000

Loadings

	PC1	PC2	PC3	PC4
timestamp	0.5003	-0.4997	-0.0001	0.7071
episode_index	0.4996	0.5004	-0.7071	0.0000
frame_index	0.5003	-0.4997	-0.0001	-0.7071
index	0.4998	0.5002	0.7071	-0.0000

PCA Scree Plot

PCA Loadings

Warnings

High correlation: timestamp <-> frame_index (r=1.0)
High correlation: episode_index <-> index (r=0.9999)

Auto-Generated InsightsADV

Executive Summary

Dataset contains 187,507 rows and 9 columns (4 numeric, 1 categorical). 4 high-priority finding(s) detected. 5 moderate observations noted. Key highlights: 1. 2 column pair(s) with |r| > 0.9 2. 2 likely confounded correlation(s) detected 3. 4/4 numeric columns are non-normal

Total Insights0

Critical0

High4

Medium5

Low1

Insight Details

2 column pair(s) with |r| > 0.9HIGH · 0.8

correlation

Near-perfect linear relationships detected. Top pair: 'timestamp' ↔ 'frame_index' (r=1.000).

Consider dropping one column from each pair to reduce redundancy
Verify these are not data leakage or duplicate columns

2 likely confounded correlation(s) detectedHIGH · 0.7

correlation

Raw correlation differs significantly from partial correlation, suggesting confounding variables. Top: 'episode_index' ↔ 'index' (raw r=1.00, partial r=8183723376125764.00).

Do not assume causal relationship from raw correlation for these pairs
Investigate which variables are confounders

4/4 numeric columns are non-normalMEDIUM · 0.7

distribution

Most numeric columns fail normality tests (α=0.05). Non-parametric methods may be more appropriate.

Prefer non-parametric tests (Kruskal-Wallis, Mann-Whitney) over t-tests/ANOVA
Consider power transforms if normality is needed for downstream models

4 column(s) best fit by non-normal distributionsMEDIUM · 0.7

distribution

Distribution fitting reveals non-Normal best fits. Most common: beta (2 columns). Others: {'beta': 2, 'lognorm': 1, 'uniform': 1}.

Use the identified distributions for parametric modeling or simulation
Transform columns toward normality if Gaussian assumptions are needed

Clear cluster structure found (k=3, silhouette=0.40)HIGH · 0.7

cluster

K-Means identifies 3 well-separated clusters (silhouette=0.40). Cluster sizes: {'cluster_0': 1895, 'cluster_1': 1882, 'cluster_2': 1223}.

Profile each cluster to understand segment characteristics
Use cluster labels as a feature for downstream modelling

1 column(s) with severe multicollinearity (VIF>10)HIGH · 0.6

correlation

VIF > 10 detected for: ['episode_index']. Worst: 'episode_index' (VIF=266.3). Redundant information may cause model instability.

Remove one column from each highly correlated pair
Apply PCA or regularization (Ridge/Lasso) to handle collinearity

4 strong feature interaction(s) detectedMEDIUM · 0.6

feature

Top interaction: 'timestamp' × 'episode_index' (strength=0.73). Product features may improve model performance.

Create interaction (product) features for the top pairs

2 column(s) benefit from power transformationMEDIUM · 0.6

distribution

Box-Cox / Yeo-Johnson transforms can significantly reduce skewness for columns: ['timestamp', 'frame_index'].

Apply the recommended transform (Box-Cox or Yeo-Johnson) in preprocessing

Multi-method anomalies: 120 rows (2.4%)MEDIUM · 0.5

anomaly

A small fraction of rows are flagged by multiple anomaly detection methods.

Review flagged rows for data entry errors or special cases

No missing values detected in any columnLOW · 0.3

missing

All columns are fully populated — no imputation needed.

Insight Severity Distribution

Top Insights

Advanced Distribution AnalysisADV

Best-Fit Distribution

column	best_distribution	aic	bic	ks_statistic	ks_p_value	fit_quality
timestamp	beta	28275.5600	28301.6300	0.0363	0.0000	poor
episode_index	beta	75977.6500	76003.7200	0.0150	0.2102	good
frame_index	lognorm	20261.0000	20280.5500	0.5098	0.0000	poor
index	uniform	121419.0200	121432.0600	0.0136	0.3124	good

Jarque-Bera Normality Test

column	jb_statistic	is_normal_0.05	skewness	kurtosis
timestamp	50024.3984	False	1.0530	1.4028
episode_index	11389.0254	False	-0.0093	-1.2072
frame_index	50024.3684	False	1.0530	1.4028
index	11250.4200	False	-0.0000	-1.2000

Power Transform Recommendation

column	original_skewness	recommended_method	optimal_lambda	transformed_skewness	needs_transform	improvement
timestamp	1.0530	yeo-johnson	0.2569	-0.0496	True	1.0034
episode_index	-0.0093	yeo-johnson	0.7184	-0.2834	False	-0.2741
frame_index	1.0530	yeo-johnson	0.3990	-0.0965	True	0.9565
index	-0.0000	yeo-johnson	0.7071	-0.2916	False	-0.2916

KDE Bandwidth Analysis

column	n	std	iqr	silverman_bandwidth	scotts_bandwidth
timestamp	187507.0000	4.8659	6.9000	0.3862	0.2967
episode_index	187507.0000	573.2126	1002.0000	45.4941	34.9517
frame_index	187507.0000	48.6586	69.0000	3.8619	2.9670
index	187507.0000	54128.7528	93753.0000	4296.0273	3300.5092

Best-Fit Distribution Overlay

ECDF Plot

Power Transform Comparison

Jarque-Bera Normality Test

Advanced Correlation AnalysisADV

Partial Correlation Matrix

	timestamp	episode_index	frame_index	index
timestamp	1.0000	6.0442	-1.0000	-99974086307298640.0000
episode_index	6.0442	1.0000	-6.0442	8183723376125764.0000
frame_index	-1.0000	-6.0442	1.0000	99974086307314160.0000
index	-99974086307298656.0000	8183723376125764.0000	99974086307314160.0000	1.0000

Mutual Information Matrix

	timestamp	episode_index	frame_index	index
timestamp	0.0000	0.0362	5.0693	0.0353
episode_index	0.0362	0.0000	0.0000	6.3049
frame_index	5.0693	0.0000	0.0000	0.0000
index	0.0353	6.3049	0.0000	0.0000

Bootstrap Correlation 95% CI

	col_a	col_b	pearson_r	ci_lower	ci_upper	ci_width	significant
0	timestamp	episode_index	0.0221	-0.0039	0.0485	0.0524	False
1	timestamp	frame_index	1.0000	1.0000	1.0000	0.0000	True
2	timestamp	index	0.0225	-0.0048	0.0508	0.0556	False
3	episode_index	frame_index	0.0221	-0.0045	0.0496	0.0541	False
4	episode_index	index	0.9999	0.9999	1.0000	0.0000	True
5	frame_index	index	0.0225	-0.0054	0.0500	0.0554	False

Distance Correlation Matrix

	timestamp	episode_index	frame_index	index
timestamp	1.0000	0.0360	1.0000	0.0361
episode_index	0.0360	1.0000	0.0360	0.9999
frame_index	1.0000	0.0360	1.0000	0.0361
index	0.0361	0.9999	0.0361	1.0000

Partial Correlation Heatmap

Mutual Information Heatmap

Bootstrap Correlation CI

Correlation Network

Distance Correlation Heatmap

Clustering AnalysisADV

K-Means Summary

Optimal K

Best Silhouette

1,895

Largest Cluster

DBSCAN Summary

N Clusters Dbscan

0.0%

Noise Ratio

Eps

Hierarchical Clustering

Optimal K

Best Silhouette

Cluster Profiles

	timestamp	episode_index	frame_index	index
cluster_0	4.1473	1508.9013	41.4728	141229.8786
cluster_1	4.3556	481.7476	43.5563	44448.0032
cluster_2	13.2153	1064.5127	132.1529	99369.0989

Elbow & Silhouette

Cluster Scatter

Dendrogram

Cluster Profiles

Dimensionality ReductionADV

t-SNE Embedding

Kl Divergence

5,000

N Points

Factor Analysis

N Factors

Factor Loadings

	factor_1	factor_2
timestamp	1.0000	-0.0000
episode_index	0.0221	-0.9997
frame_index	1.0000	0.0000
index	0.0225	-0.9997

PCA-Weighted Feature Contribution

column	contribution_score	rank
timestamp	0.5000	4.0000
episode_index	0.5000	2.0000
frame_index	0.5000	3.0000
index	0.5000	1.0000

PCA Biplot

Explained Variance Curve

Factor Loadings Heatmap

Feature Engineering InsightsADV

Interaction Detection

	col_a	col_b	interaction_strength	corr_product_a	corr_product_b	corr_a_b	recommendation
0	timestamp	episode_index	0.7259	0.7480	0.5459	0.0221	Strong interaction
1	episode_index	frame_index	0.7259	0.5459	0.7480	0.0221	Strong interaction
2	timestamp	index	0.7217	0.7442	0.5495	0.0225	Strong interaction
3	frame_index	index	0.7217	0.7442	0.5495	0.0225	Strong interaction

Binning Analysis

column	n_bins	equal_width_entropy	equal_freq_entropy	max_entropy	recommended_method	skewness
timestamp	10	2.2300	3.3211	3.3219	equal_frequency	1.0530
episode_index	10	3.3208	3.3219	3.3219	equal_width	-0.0093
frame_index	10	2.2300	3.3211	3.3219	equal_frequency	1.0530
index	10	3.3219	3.3219	3.3219	equal_width	-0.0000

Advanced Anomaly DetectionADV

Isolation Forest

250

Anomaly Count

5.0%

Anomaly Ratio

Local Outlier Factor

250

Anomaly Count

5.0%

Anomaly Ratio

Consensus (>=2/3 agree)

120

Consensus Count

2.4%

Consensus Ratio

Anomaly Scatter

Consensus Anomaly Comparison

Statistical TestsADV

Levene's Test (Equality of Variances)

	col_a	col_b	levene_stat	log_var_ratio	significant_0.05	stars
0	timestamp	episode_index	564897.5197	9.5380	True	***
1	timestamp	frame_index	222079.0332	4.6052	True	***
2	timestamp	index	562425.8915	18.6338	True	***
3	episode_index	frame_index	482754.0326	4.9329	True	***
4	episode_index	index	550576.4333	9.0957	True	***
5	frame_index	index	561596.5666	14.0286	True	***

Kruskal-Wallis Test

	grouping_col	numeric_col	n_groups	h_statistic	eta_squared	effect_magnitude	reject_h0_0.05	stars	interpretation
0	task_index	timestamp	3	10663.7432	0.0569	small	True	***	Significant (η²=0.0569, small)
1	task_index	episode_index	3	625.0241	0.0033	small	True	***	Significant (η²=0.0033, small)
2	task_index	frame_index	3	10663.6852	0.0569	small	True	***	Significant (η²=0.0569, small)
3	task_index	index	3	625.0238	0.0033	small	True	***	Significant (η²=0.0033, small)

Mann-Whitney U Test

	col_a	col_b	u_statistic	rank_biserial_r	effect_magnitude	significant_0.05	stars
0	timestamp	episode_index	95185168.0000	0.9946	large	True	***
1	timestamp	frame_index	2516704372.0000	0.8568	large	True	***
2	timestamp	index	1278510.0000	0.9999	large	True	***
3	episode_index	frame_index	34126631417.0000	-0.9413	large	True	***
4	episode_index	index	188532431.5000	0.9893	large	True	***
5	frame_index	index	11929357.5000	0.9993	large	True	***

Chi-Square Goodness of Fit

column	n_categories	chi2_stat	p_value	cramers_v	effect_magnitude	uniform_0.05	interpretation
task_index	3	1112.7892	0.0000	0.0545	small	False	Non-uniform distribution

Grubbs Outlier Test

column	suspect_value	grubbs_statistic	critical_value	is_outlier	n
timestamp	35.1000	5.9163	5.1454	True	187507
episode_index	0.0000	1.7532	5.1454	False	187507
frame_index	351.0000	5.9163	5.1454	True	187507
index	0.0000	1.7320	5.1454	False	187507

Data Profiling SummaryADV

Column Roles

column	primary_role	confidence	secondary_role	properties
observation.state	id	0.8500	NaN	{'unique_ratio': 1.0}
action	id	0.8500	NaN	{'unique_ratio': 1.0}
timestamp	timestamp	0.7000	NaN	{'dtype': 'float32', 'hint': 'monotonic numeric with time-like name'}
episode_index	numeric_feature	0.8500	NaN	{'dtype': 'int64'}
frame_index	numeric_feature	0.8500	NaN	{'dtype': 'int64'}
next.reward	constant	1.0000	NaN	{'n_unique': 1}
next.done	binary	0.9000	NaN	{'n_unique': 2, 'values': [False, True]}
index	id	0.9000	NaN	{'unique_ratio': 1.0}
task_index	categorical_feature	0.8500	NaN	{'n_unique': 3, 'unique_ratio': 0.0}

ML Readiness

Overall Score97/100 (A+)

completeness100.0

consistency97.8

balance100.0

informativeness100.0

independence80.0

scale100.0

Blocking Issues

1 constant column(s) — remove before modelling
Extreme multicollinearity: VIF=266 for 'episode_index' — remove or combine

Suggestions

Remove 3 ID-like column(s) before modelling: observation.state, action, index

f2a Analysis Report

default / train

Overview

Descriptive Statistics

Distribution Histograms

Boxplots

Distribution Analysis

Normality Tests & Shape

Violin Plots

Q-Q Plots

Correlation Analysis

Correlation Heatmap (Pearson)

Correlation Heatmap (Spearman)

Variance Inflation Factor (VIF)

Missing Data

Missing Data

Missing Data Matrix

Outlier Detection

Outlier Detection

Categorical Analysis

Summary

Feature Importance

Feature Importance

PCA Analysis

Variance Explained

Loadings

PCA Scree Plot

PCA Loadings

Warnings

Auto-Generated InsightsADV

Executive Summary

Insight Details

Insight Severity Distribution

Top Insights

Advanced Distribution AnalysisADV

Best-Fit Distribution

Jarque-Bera Normality Test

Power Transform Recommendation

KDE Bandwidth Analysis

Best-Fit Distribution Overlay

ECDF Plot

Power Transform Comparison

Jarque-Bera Normality Test

Advanced Correlation AnalysisADV

Partial Correlation Matrix

Mutual Information Matrix

Bootstrap Correlation 95% CI

Distance Correlation Matrix

Partial Correlation Heatmap

Mutual Information Heatmap

Bootstrap Correlation CI

Correlation Network

Distance Correlation Heatmap

Clustering AnalysisADV

K-Means Summary

DBSCAN Summary

Hierarchical Clustering

Cluster Profiles

Elbow & Silhouette

Cluster Scatter

Dendrogram

Cluster Profiles

Dimensionality ReductionADV

t-SNE Embedding

Factor Analysis

Factor Loadings

PCA-Weighted Feature Contribution

PCA Biplot

Explained Variance Curve

Factor Loadings Heatmap

Feature Engineering InsightsADV

Interaction Detection

Binning Analysis

Advanced Anomaly DetectionADV

Isolation Forest

Local Outlier Factor

Consensus (>=2/3 agree)

Anomaly Scatter

Consensus Anomaly Comparison

Statistical TestsADV