This page documents the structure, sources, and limitations of the datasets provided in The Baseball Scholar Data Library. It is intended to help researchers, writers, and analysts understand how the data is compiled and how it should be used.
Dataset Overview
The Baseball Scholar datasets contain individual Major League Baseball player statistics for every season from 1901 to the present. Each row represents one player season and includes standard batting and pitching statistics, and fielding positions.
- Coverage: Major League Baseball
- Seasons: 1901–Present
- Granularity: One row per player per season
- Formats: CSV and Excel
File Structure
Each dataset is organized as a flat table where rows represent individual player seasons and columns represent statistical fields. No aggregation or normalization is applied beyond standardizing column names and formatting.
Column Definitions
Column names are standardized across seasons to allow for consistent analysis. Statistical definitions reflect the official scoring rules in effect during each season.
- fangraphs_player_id – player id from FanGraphs database
- bbref_player_id – player id from Baseball-Reference database
- season – year of the season
- nameascii – player name
- primary_position – position player appeared in the most during the season
Data Sources
The Baseball Scholar datasets are compiled from publicly available historical Major League Baseball records. Source data is cleaned and standardized to ensure consistent formatting across eras.
No proprietary or restricted data sources are used.
Methodology
Raw historical records are processed to standardize column names, data types, and formatting. Player seasons are preserved as recorded, without retroactive adjustment or era normalization.
- No park adjustments are applied
- No era normalization is applied
- No rate stats are altered or recalculated
- Missing historical fields are left blank where unavailable
Historical Context & Limitations
Baseball rules, season lengths, and statistical tracking practices have changed significantly since 1901. As a result, direct comparisons across eras should be made with appropriate historical context.
- Early seasons may lack complete statistical coverage
- Rule changes affect the interpretation of certain statistics
- Season length varies across eras
- Official scoring practices have evolved over time
Intended Use
These datasets are intended for research, analysis, visualization, and educational use. They are suitable for statistical modeling, historical comparison, and supporting written analysis.
Updates and Versioning
Datasets are updated periodically as new seasons conclude and historical corrections are identified. File names may include version indicators to distinguish updates.
Attribution
If you use these datasets in published work, articles, or public projects, attribution to “The Baseball Scholar” is appreciated.
The Baseball Scholar. MLB Player Statistics Dataset (1901–Present). https://thebaseballscholar.com/data-library