STATE-CROSS-PROCESS-RECONCILE-1.1 + 1.4 — Plan-Only

Datum: 2026-06-04 (post-cutover Commit c4651dd) Modus: READ-ONLY · keine Code/State/Env-Änderungen Re-Scope: ersetzt MS-LOCK-STATE-1 (Thread-Lock-Annahme war falsch)


1. Root Cause final

Aspekt Klärung
Bot-interne Concurrency keine Thread-Race auf live_trader.state. Scan-Loop ist single-threaded; MS-Runner synchron innerhalb des Scan-Cycles
Pump.fun-WS-Thread Schreibt eigene State-Datei (tier3/pump_fun_monitor.py::STATE_FILE), nie live_portfolio.json
Wallet-Tracker Eigene Datei — kein Conflict
Echter Race-Window Cross-Process Bot↔Worker auf live_portfolio.json
Bot-Mitigation existierend _reload_state_if_externally_changed (trader_core.py:1288) prüft mtime-Cookie nur 1× pro Tick in update_prices (line 1346)
Lücke Worker kann nach dem _reload-Check und vor einem späteren Bot-_save_state zwischenspeichern → der Bot-Save überschreibt Worker-Updates

Konkrete Race-Sequenz

t=0  Bot _reload_state_if_externally_changed → state X eingelesen, cookie=mtime_X
t=1  Bot mutiert state in RAM (z.B. pos['BTC'].current_price=...)
t=2  Worker update_stop_loss → liest Disk X → setzt pos['BTC'].stop_loss=NEU
                              → _save_state → Disk wird X', mtime=mtime_X'
t=3  Bot _save_state (z.B. nach execute_buy) → schreibt RAM (mit STALE SL!)
                              → Disk wird X'' — Worker's SL VERLOREN
                              → cookie=mtime_X''

2. Betroffene Dateien

Datei Zweck der Plan-Änderung
trading/execution/trader_core.py Erweiterung _save_state um Pre-Save-mtime-Recheck + neue _merge_disk_into_ram_for_save Helper
trading/execution/live_trade.py KEIN direkter Touch (erbt _save_state) — aber STATE_FILE-Pfad bleibt LIVE_STATE_FILE
trading/main.py KEIN Touch — alle bestehenden live_trader._save_state() calls bleiben
trading/strategies/multi_strategy_runner.py Optional: _persist_state_silent ruft Save mit allow_abort=True (cooldown loss ist akzeptabel)
trading/command_worker.py KEIN Touch — Worker liest Disk vor jeder Mutation (LiveTrader fresh per call)
trading/db_emitter.py KEIN Touch — bot_statuses.metadata_json schon vorhanden
trading/tests/test_state_cross_process_reconcile_1_1.py NEU — 20 Tests
trading/tests/test_state_observability_1_4.py NEU — 10 Tests

0× DB-Migration · 0× GUI · 0× env · 0× CommandBus · 0× Strategy-Param · 0× Mainnet.


3. Aktuelle _save_state / _load_state Call-Sites

_save_state (27 Aufrufer im Bot-Prozess + 2 im Worker)

Datei Line Pfad / Kontext Tier
trader_core.py 616 execute_buy nach pos eingefügt T1-must-merge
trader_core.py 753 _record_close_cooldown (REPEAT-COOLDOWN-1) T2-may-abort
trader_core.py 865 execute_sell nach cash+closed_trade T1-must-merge
trader_core.py 966 T3 execute_buy_tier3 T1-must-merge
trader_core.py 1037 T3 mark_pending T2-may-abort
trader_core.py 1104 P0-RUNTIME-SAFETY-FIX-1 (pending cleanup) T2-may-abort
trader_core.py 1188 partial_sell T1-must-merge
trader_core.py 1283 DCA-Rescue path T1-must-merge
trader_core.py 1439 update_prices SL trigger → sell loop T1-must-merge
trader_core.py 1469 update_prices final (post-tick) T1-must-merge
live_trade.py 436 _remove_position_on_exchange_only T1-must-merge
live_trade.py 501 SYNC-BALANCE-SANITY-1 Save T2-may-abort
live_trade.py 525 _sync_balance post-cash-update T1-must-merge
live_trade.py 1229 execute_sell pending=True T1-must-merge
live_trade.py 1238 execute_sell pending=False T1-must-merge
live_trade.py 1335 partial pending cleanup T1-must-merge
main.py 576 Re-entry Cleanup T1-must-merge
main.py 822, 834 Recovery-Skip-Cooldown writes T2-may-abort
main.py 1767 Scan-Cycle end T1-must-merge
multi_strategy_runner.py via _persist_state_silent MS-cooldown record T2-may-abort
command_worker.py 622 update_stop_loss Worker-only
command_worker.py 687 update_take_profit Worker-only
baseline_bootstrap.py 283 boot-time auto-import T2-may-abort

_load_state (1 Aufrufer)

_reload_state_if_externally_changed


4. Minimal-Fixplan 1.1 (PRE-SAVE-MTIME-RECHECK-1)

4.1 Neue Save-API

def _save_state(self, *, allow_abort: bool = False, source: str = 'unknown') -> bool:
    """Returns True on success, False on aborted-save."""

Backward-kompatibel: bestehende live_trader._save_state() calls behalten Default allow_abort=False, source='unknown'. Migration der labels passiert in einem Folge-Commit (P3).

4.2 Pre-Save-Recheck-Logik (pseudo)

def _save_state(allow_abort=False, source='unknown'):
    try:
        disk_mtime_ns = STATE_FILE.stat().st_mtime_ns
    except OSError:
        disk_mtime_ns = None

    if disk_mtime_ns and self._state_mtime_ns and disk_mtime_ns > self._state_mtime_ns:
        # Conflict detected
        self._state_coherence['conflict_count'] += 1
        self._state_coherence['last_conflict_at'] = utcnow()
        self._state_coherence['last_external_mtime_ns'] = disk_mtime_ns

        if allow_abort:
            logger.warning(f"[STATE-COHERENCE] save aborted, source={source}")
            self._state_coherence['retry_count'] += 1
            return False

        # must-merge path
        logger.warning(f"[STATE-COHERENCE] pre-save mtime conflict, source={source}")
        merged = self._merge_disk_into_ram_for_save(disk_mtime_ns)
        if merged is None:
            logger.error(f"[STATE-COHERENCE] merge failed — falling back to blind save")
        else:
            self.state = merged
            self._state_coherence['merged_count'] += 1

    # ── existing atomic write path ─────────────────────────────────
    STATE_FILE.parent.mkdir(exist_ok=True)
    tmp_file = STATE_FILE.with_suffix('.tmp')
    with open(tmp_file, 'w') as f:
        json.dump(self.state, f, indent=2)
    tmp_file.rename(STATE_FILE)
    self._state_mtime_ns = STATE_FILE.stat().st_mtime_ns
    return True

4.3 Section-Ownership-Matrix

State-Section Schreibt Merge-Regel
cash Nur Bot RAM gewinnt
closed_trades Bot (sells), Worker (close_position) UNION by position_id (concat dedupliziert)
positions[sym].stop_loss Bot (update_prices SL-adjust), Worker (update_stop_loss) Disk gewinnt wenn Disk ≠ RAM-load-snapshot
positions[sym].take_profit Bot, Worker (update_take_profit) gleiche Regel wie stop_loss
positions[sym].current_price Bot RAM gewinnt
positions[sym].pending Bot RAM gewinnt
positions[sym] Existenz Bot adds, Worker deletes 3-Wege-Merge (siehe 4.4)
sl_cooldowns Nur Bot RAM gewinnt
repeat_loss_count Nur Bot RAM gewinnt
tier3_used/positions/budget Nur Bot RAM gewinnt
ms_candidate_cooldowns Nur Bot (MS-Runner) RAM gewinnt
starting_capital Bot (boot only) RAM gewinnt
created_at Bot (boot only) RAM gewinnt

4.4 _merge_disk_into_ram_for_save Helper (pseudo)

def _merge_disk_into_ram_for_save(disk_mtime_ns) -> Optional[dict]:
    try:
        with open(STATE_FILE, 'r') as f:
            D = json.load(f)
    except Exception as e:
        return None

    R = self.state
    merged = dict(R)  # Bot-only sections → R wins

    # closed_trades: union by position_id
    seen_ids = set()
    union = []
    for t in D.get('closed_trades', []) + R.get('closed_trades', []):
        tid = t.get('position_id')
        if tid and tid in seen_ids:
            continue
        seen_ids.add(tid)
        union.append(t)
    merged['closed_trades'] = union[-500:]

    # positions: 3-way merge
    R_pos = R.get('positions', {})
    D_pos = D.get('positions', {})
    merged_pos = {}
    for sym in set(R_pos.keys()) | set(D_pos.keys()):
        if sym in R_pos and sym not in D_pos:
            merged_pos[sym] = R_pos[sym]                # Bot added → keep
        elif sym in D_pos and sym not in R_pos:
            continue                                     # Worker removed → drop
        else:
            p = dict(R_pos[sym])
            for field in ('stop_loss', 'take_profit'):
                if D_pos[sym].get(field) != R_pos[sym].get(field):
                    p[field] = D_pos[sym].get(field)
            merged_pos[sym] = p
    merged['positions'] = merged_pos

    return merged

4.5 Antworten auf die 8 Operator-Fragen

# Frage Antwort
1 Wo liegt der aktuelle mtime-Cookie? self._state_mtime_ns in TraderCore.__init__ (line 171); aktualisiert von _load_state (277), _save_state (312), _reload_state_if_externally_changed (1326)
2 Welche _save_state Call-Sites? 27 im Bot, 2 im Worker — Tabelle §3
3 Kritische Sections? positions (3-Wege-Merge), closed_trades (union), cash (Bot), ms_candidate_cooldowns (Bot), sl_cooldowns (Bot), repeat_loss_count (Bot), tier3_* (Bot), pending-Felder in positions[sym] (Bot)
4 Minimal sichere Merge-Strategie? Section-Ownership-Matrix §4.3 + 3-Wege-Position-Merge §4.4
5 Wann abbrechen statt mergen? allow_abort=True für Cooldown/Loss-Count/Pending-Marker-Saves (T2). Default-Pfad (T1) merged immer.
6 Wie verhindern wir Verlust neuer Bot-Positionen? 3-Wege-Merge: in RAM aber nicht in Disk → immer keep RAM
7 Wie verhindern wir Verlust Worker-SL/TP-Updates? Per-Feld-Merge in positions[sym]: wenn Disk-SL/TP ≠ Load-SL/TP → Disk gewinnt
8 Partial-Pending-Marker? pending ist Bot-only-Feld in positions[sym] → RAM gewinnt; Position-Existenz folgt 3-Wege-Regel

5. Observability-Plan 1.4 (REGRESSION-OBSERVABILITY-1)

5.1 In-Memory Counter (auf TraderCore-Instanz)

self._state_coherence = {
    'reload_mtime_jump_count':      0,
    'save_precheck_conflict_count': 0,
    'save_precheck_retry_count':    0,
    'save_precheck_merged_count':   0,
    'last_conflict_at':           None,
    'last_external_mtime_ns':     None,
    'last_merge_source':          None,
    'last_merge_summary':         None,
}

5.2 Log-Messages (level=WARNING/INFO, kein DEBUG)

Trigger Level Format
_reload_state_if_externally_changed returns True INFO (existing) unchanged
Pre-save conflict, allow_abort=False WARNING [STATE-COHERENCE] pre-save mtime conflict, source={source} — merging disk into RAM (old={old_ns}, new={new_ns})
Pre-save conflict, allow_abort=True WARNING [STATE-COHERENCE] save aborted, source={source} — external mtime jumped
Merge executed INFO [STATE-COHERENCE] merged external update, source={source} summary={...}
Merge failed (fallback) ERROR [STATE-COHERENCE] merge failed, source={source} — falling back to blind save

5.3 bot_statuses.metadata_json Erweiterung

trading/main.py:1917 emit_bot_status bekommt zusätzlichen Metadata-Key:

metadata={
    ...existing keys...,
    'state_coherence': {
        'reload_mtime_jump_count':    live_trader._state_coherence['reload_mtime_jump_count'],
        'save_precheck_conflict_count': live_trader._state_coherence['save_precheck_conflict_count'],
        'save_precheck_retry_count':    live_trader._state_coherence['save_precheck_retry_count'],
        'save_precheck_merged_count':   live_trader._state_coherence['save_precheck_merged_count'],
        'last_conflict_at':           live_trader._state_coherence['last_conflict_at'],
        'last_merge_source':          live_trader._state_coherence['last_merge_source'],
    },
}

Keine DB-Migration nötigmetadata_json ist JSONB.

5.4 Optional: Tagespanik-Schwelle

save_precheck_conflict_count > 10 / 24 h → Telegram-Alert. Für Phase 1.4 erstmal nur Log + bot_statuses, kein Telegram.


6. Tests

test_state_cross_process_reconcile_1_1.py (20 Tests)

# Test Erwartung
1 test_save_without_external_change_writes_normally Cookie sync → normal save
2 test_save_detects_external_mtime_bump Mock stat → conflict_count += 1
3 test_must_merge_preserves_worker_sl_update Worker setzt pos['BTC'].SL=12; Bot merged → final SL=12
4 test_must_merge_preserves_worker_tp_update analog für TP
5 test_must_merge_preserves_bot_new_position RAM hat pos['ETH'] (neu); Disk hat es nicht → final hat ETH
6 test_must_merge_drops_worker_closed_position RAM hat pos['BTC']; Disk hat es NICHT (Worker close) → final hat BTC NICHT
7 test_must_merge_preserves_ms_candidate_cooldowns RAM hat cooldown, Disk hat keine → final state hat cooldown
8 test_must_merge_preserves_sl_cooldowns analog sl_cooldowns
9 test_must_merge_unions_closed_trades Disk hat trade_A; RAM hat trade_B → final hat beide
10 test_must_merge_dedups_closed_trades_by_position_id gleiche position_id → 1×
11 test_must_merge_position_simultaneous_buy_close_collision RAM neue pos['ETH'] + Disk gelöschte pos['BTC'] → final hat ETH, kein BTC
12 test_allow_abort_skips_write_and_increments_retry _save_state(allow_abort=True) mit Conflict → kein write, retry_count += 1
13 test_allow_abort_path_does_not_lose_disk_state Disk-State bleibt unverändert nach abort
14 test_no_conflict_no_merge_call Normalfall: merge-Helper nicht gerufen
15 test_atomic_tmp_rename_preserved tmp + rename Pattern unverändert
16 test_json_validity_post_merge merged JSON ist valides JSON
17 test_merge_failure_falls_back_to_blind_save json.load wirft → log ERROR, blind save
18 test_worker_save_round_trip_unchanged Worker _save_state direkt → kein neuer Code-Path
19 test_partial_pending_marker_preserved RAM pos['BTC'].pending=True, Disk pos['BTC'] ohne pending → final hat pending=True
20 test_source_label_logged source='ms_runner' im Log-Output sichtbar

test_state_observability_1_4.py (10 Tests)

# Test Erwartung
1 test_coherence_dict_initialized _state_coherence dict mit 0-Werten nach __init__
2 test_reload_increments_jump_count _reload_state_if_externally_changed True → counter +1
3 test_conflict_increments_count pre-save conflict → counter +1
4 test_retry_increments_on_abort allow_abort=True conflict → retry +1
5 test_merged_count_increments must-merge path → merged_count +1
6 test_last_conflict_timestamp_updates last_conflict_at sets to UTC ISO
7 test_last_merge_summary_populated merge writes summary {"positions_merged": n, ...}
8 test_bot_status_metadata_contains_state_coherence main.py emit_bot_status payload check (Mock)
9 test_no_log_spam_when_no_conflict Normaler save → 0 STATE-COHERENCE log lines
10 test_log_message_format_includes_source WARNING log enthält source=

Regression-Sweep (P0-Tests aus History)

Ziel: 0 neue Failures (pre-existing N8-test bleibt erwartet failing).


7. Cutover-Plan (SOT-1d-Pattern)

Step Action
1 Crontab bot_watchdog.sh UND worker_watchdog.sh freeze via Marker # CUTOVER_FREEZE_RECON1
2 Backup: /root/recon-1-backup-{TS}/ mit live_portfolio.json + Code-Files
3 docker compose build clawbot + docker compose build clawbot-worker (beide!)
4 docker compose up -d --force-recreate clawbot clawbot-worker
5 3-Way MD5: trader_core.py Repo == Image (clawbot) == Image (worker) == Container (clawbot) == Container (worker)
6 Bot host-PID + Worker host-PID neu, beide healthy
7 Crontab freeze entfernen
8 30-min Window: [STATE-COHERENCE] Log + bot_statuses.metadata_json.state_coherence Counter beobachten
9 Operator-Functional-Test: GUI close_position während Bot-Scan läuft → erwarteter Pfad: merged, Counter +1
10 24-h Monitor: state_save_precheck_conflict_count > 0 belegt echte Race-Hits

Rollback


8. Risiken

# Risiko Severity Mitigation
1 Merge-Helper liefert beschädigten State (false-positive Field-Konflikt) Mittel Per-Feld-Merge nur für SL/TP; Tests #19 (pending) + #11 (collision)
2 Worker schreibt zwischen Bot-Merge und Bot-Save (2.-Generation Race) Niedrig Counter zählt; 2 Conflicts in Folge → Log ERROR. Phase 1.2 (FCNTL-FLOCK) wäre der echte Fix
3 Aggressive Abort verhindert legitime Saves Niedrig allow_abort=True nur für Cooldown/Loss-Marker (T2). T1 nie abort.
4 Counter-Inflation (Bot saved sich selbst inkrementiert) Niedrig Cookie wird nach jedem eigenen Save resync'd
5 Performance-Cost (extra stat() pro Save) sehr niedrig 1 stat() ≤ 50 µs
6 Worker-Side bekommt Recreate Niedrig Sauberer Recreate, keine offenen Long-Running Commands
7 Falsche Reihenfolge der Merge-Helper (Bot-positions vs Worker-deletes) Mittel Tests #5 + #6 + #11 spezifisch
8 closed_trades Union ohne position_id → Duplicates Niedrig Test #10 dedupliziert; legacy ohne position_id by-index akzeptabel
9 Pre-existing N8-Test failure nicht-related bleibt unverändert

9. Bundle vs Split — Empfehlung

→ Empfehlung: BUNDLE 1.1+1.4 in EINEM Commit + EINEM Cutover.

Begründung

Argument Wirkung
1.4 ist nur sinnvoll mit 1.1 (Counter zählen ohne neue Save-Logik wäre sinnlos) Split erzwingt 1.1 vor 1.4 → 1.4 alleine bringt nichts
Beide treffen trader_core._save_state Doppel-Cutover = doppeltes Risiko-Fenster auf clawbot+clawbot-worker
1.4 ist strikt additiv (Counter + Log + 1 metadata-Key) Low-Risk-Add
Tests sind disjunkt aber Code-Touch überlappt Bundle vermeidet 2-fache Edit-Rounds
Observability AB Tag 1 ist wertvoller als nachgereicht Operator sieht Conflicts sofort beim ersten echten Race

Alternative — Split (nicht empfohlen)


10. MS-Live-Readiness-Update

Bedingung Status
REPEAT-CANDIDATE-DEDUP-1 live (commit c4651dd)
MS-STABLECOIN-BLOCK-1 live (commit c4651dd)
STATE-CROSS-PROCESS-RECONCILE-1.1+1.4 dieser Plan — HARTE Vorbedingung vor MS-Live
MS-LIVE-OHLCV-BACKTEST-1 offen P2

Sobald 1.1+1.4 live + 24 h Counter sauber → MS-Live darf in Testnet-Execution geschaltet werden.


11. Boundaries dieses Plans

0× Code-Touch · 0× State-Edit · 0× Bot-Recreate · 0× Worker-Recreate · 0× MS-Live-Aktivierung · 0× Env-Änderung · 0× DB-Migration · 0× Order-Logik · 0× Mainnet · 0× CommandBus · 0× Push.


12. Operator-Entscheidungen vor GO EXECUTE

Q Frage Default-Empfehlung
Q1 Bundle 1.1+1.4 oder Split? A Bundle
Q2 Default allow_abort für MS-cooldown saves? A True (cooldown loss akzeptabel)
Q3 Telegram-Alert bei > 10 conflicts/24h? B Nein in dieser Phase — nur Log + bot_statuses
Q4 source-Label-Migration aller 27 Save-Sites? B Out-of-Scope für 1.1 — source='unknown' default, später separater P3-Commit
Q5 closed_trades ohne position_id → dedup? B Akzeptieren als-ist (legacy entries)
Q6 Worker-Recreate Teil dieses Cutovers? A Ja — beide brauchen neuen Code (trader_core.py ist shared)
Q7 Backup-Pfad? /root/recon-1-backup-{TS}/
Q8 Observability-Dashboard (GUI-Widget) als Follow-up? A Ja als separater P3-Commit (außer Scope)

STOP

Plan abgeschlossen. Kein Code geändert. Warte auf Operator-Entscheidungen Q1–Q8 + GO EXECUTE STATE-CROSS-PROCESS-RECONCILE-1 für Bundle 1.1+1.4.