=================================================================
SUPPLEMENTARY DATASET S1
Cryptocurrency Market Trends: A Machine Learning-Driven
Time Series Forecasting Using Hybrid VAR-LSTM, XGBoost-LSTM,
and CNN-LSTM Models with Twitter Sentiment Integration

Author:     Mubariz Khan (w23070111), Northumbria University
Supervisor: Dr Saber Farag
=================================================================

FILES IN THIS SUPPLEMENTARY PACKAGE
─────────────────────────────────────
1. Binance_ETHUSDT_1h.csv     – Hourly ETH/USDT OHLCV data (Binance)
2. Binance_DOGEUSDT_1h.csv    – Hourly DOGE/USDT OHLCV data (Binance)
3. vader_sentiment_hourly.csv  – Hourly VADER compound scores (derived)
4. merged_dataset.csv          – Final merged dataset used for training
5. tweet_ids_S1.txt            – Raw tweet IDs (no text; Twitter ToS compliance)
6. Dataset_S1_README.txt       – This file

DATA SOURCES AND LICENCES
──────────────────────────
• Binance OHLCV data:
    Source:   Kaggle – franoisgeorgesjulien/crypto
    URL:      https://www.kaggle.com/datasets/franoisgeorgesjulien/crypto
    Licence:  CC0 1.0 Public Domain
    Columns:  Date, Symbol, Open, High, Low, Close, Volume, tradecount

• Twitter sentiment data:
    Source:   Kaggle – kaushiksuresh147/bitcoin-tweets
    URL:      https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets
    Licence:  CC0 1.0 Public Domain
    Keywords: Bitcoin, BTC, Ethereum, ETH, Dogecoin, DOGE
    Method:   Twitter Academic Research API v2
    Note:     Raw tweet text is NOT redistributed per Twitter Developer
              Agreement. Tweet IDs are provided in tweet_ids_S1.txt;
              full text can be rehydrated via the Twitter API.

VADER SENTIMENT SCORES (vader_sentiment_hourly.csv)
────────────────────────────────────────────────────
Columns:
  Date             – UTC timestamp (hourly, format: YYYY-MM-DD HH:MM:SS)
  vader_compound   – Mean VADER compound score across all tweets in hour
  tweet_count      – Number of tweets aggregated in that hour
  pos_ratio        – Fraction of tweets with compound > 0
  neg_ratio        – Fraction of tweets with compound < 0

VADER scoring: Hutto, C.J. & Gilbert, E. (2014). VADER: A Parsimonious
Rule-based Model for Sentiment Analysis of Social Media Text.
AAAI Conference on Weblogs and Social Media (ICWSM).

MERGED DATASET (merged_dataset.csv)
─────────────────────────────────────
Columns:
  Date, Open, High, Low, Close_eth, Volume_eth,
  Close_doge, Volume_doge,
  log_ret, vol_24h, RSI, MACD, BB_width,
  vader_compound   (10 features total)

Date range: 2019-07-05 to 2023-10-19 (hourly, n=37,552 rows after cleaning)

TRAIN / VALIDATION / TEST SPLIT
─────────────────────────────────
  Training:   rows 0   – 26,285  (70%)   2019-07-05 – 2022-01-19
  Validation: rows 26,286 – 31,917 (15%) 2022-01-19 – 2022-08-11
  Test:        rows 31,918 – 37,551 (15%) 2022-08-11 – 2023-10-19

NOTE: Split is strictly chronological; no shuffling applied.
      Min-Max scaler fitted on training rows ONLY.

CORRELATION STRUCTURE (training split)
────────────────────────────────────────
  ETH Close vs VADER sentiment:  r = 0.42 (p < 0.001)
  DOGE Close vs VADER sentiment: r = 0.34 (p < 0.001)
  ETH Close vs DOGE Close:       r = 0.88 (p < 0.001)

=================================================================
