This page documents the structure, sources, and limitations of the datasets provided in The Baseball Scholar Data Library. It is intended to help researchers, writers, and analysts understand how the data is compiled and how it should be used.

Dataset Overview

The Baseball Scholar datasets contain individual Major League Baseball player statistics for every season from 1901 to the present. Each row represents one player season and includes standard batting and pitching statistics, and fielding positions.

  • Coverage: Major League Baseball
  • Seasons: 1901–Present
  • Granularity: One row per player per season
  • Formats: CSV and Excel

File Structure

Each dataset is organized as a flat table where rows represent individual player seasons and columns represent statistical fields. No aggregation or normalization is applied beyond standardizing column names and formatting.

Column Definitions

Column names are standardized across seasons to allow for consistent analysis. Statistical definitions reflect the official scoring rules in effect during each season.

  • fangraphs_player_id – player id from FanGraphs database
  • bbref_player_id – player id from Baseball-Reference database
  • season – year of the season
  • nameascii – player name
  • primary_position – position player appeared in the most during the season

Data Sources

The Baseball Scholar datasets are compiled from publicly available historical Major League Baseball records. Source data is cleaned and standardized to ensure consistent formatting across eras.

No proprietary or restricted data sources are used.

Methodology

Raw historical records are processed to standardize column names, data types, and formatting. Player seasons are preserved as recorded, without retroactive adjustment or era normalization.

  • No park adjustments are applied
  • No era normalization is applied
  • No rate stats are altered or recalculated
  • Missing historical fields are left blank where unavailable

Historical Context & Limitations

Baseball rules, season lengths, and statistical tracking practices have changed significantly since 1901. As a result, direct comparisons across eras should be made with appropriate historical context.

  • Early seasons may lack complete statistical coverage
  • Rule changes affect the interpretation of certain statistics
  • Season length varies across eras
  • Official scoring practices have evolved over time

Intended Use

These datasets are intended for research, analysis, visualization, and educational use. They are suitable for statistical modeling, historical comparison, and supporting written analysis.

Updates and Versioning

Datasets are updated periodically as new seasons conclude and historical corrections are identified. File names may include version indicators to distinguish updates.

Attribution

If you use these datasets in published work, articles, or public projects, attribution to “The Baseball Scholar” is appreciated.

The Baseball Scholar. MLB Player Statistics Dataset (1901–Present). https://thebaseballscholar.com/data-library

Download the Data

Return to the Baseball Scholar Data Library →