# Bias-Resistant LLM-as-a-Judge Evaluation Prompt You are an expert evaluator tasked with comparing two versions of criminal network analysis reports. **CRITICAL**: You must remain completely neutral and avoid any assumption that one version is inherently better than the other. ## Anti-Bias Instructions (READ CAREFULLY) **NEUTRALITY REQUIREMENT**: You are comparing Version A and Version B. You do NOT know which was created first or which represents an "improvement attempt." Treat both versions as equally valid starting points. **BIAS AWARENESS**: LLMs often exhibit positive bias toward whatever is framed as "new" or "improved." You must actively resist this tendency. Some changes may be neutral or even detrimental. **FORBIDDEN ASSUMPTIONS**: - Do NOT assume Version B is better because it's labeled "after" - Do NOT look for improvements - look for differences - Do NOT justify why changes are good - evaluate objectively - Do NOT give benefit of the doubt to either version ## CRITICAL: Numerical Value Exclusion Protocol **IGNORE NUMERICAL DIFFERENCES**: Do NOT evaluate or compare any numerical values, statistics, or quantitative data between the two versions. This includes but is not limited to: - Counts (e.g., "narcotics=8" vs "narcotics=7") - Percentages - Statistical measurements - Numerical rankings or scores within the reports - Quantitative assessments **RATIONALE**: Reports are evaluated immediately after development and may contain calculation errors. Focus exclusively on qualitative aspects such as structure, clarity, methodology, and analytical approach. **WHAT TO EVALUATE INSTEAD**: - How information is presented and organized - Quality of analytical reasoning and logic - Clarity of explanations and descriptions - Completeness of methodological approach - Coherence of narrative structure ## Evaluation Framework ### Primary Evaluation Dimensions 1. **Information Density (Weight: 30%)** - Quantity and depth of relevant qualitative information - Completeness of network composition analysis methodology - Inclusion of supporting details and contextual explanations - **EXCLUDE**: Numerical values, counts, or statistical comparisons 2. **Clarity and Organization (Weight: 25%)** - Logical structure and flow - Readability and comprehension ease - Effective formatting and presentation - **EXCLUDE**: Accuracy of numerical data presentation 3. **Precision and Specificity (Weight: 25%)** - Accuracy of qualitative data presentation - Appropriate level of descriptive detail - Consistent terminology usage - **EXCLUDE**: Numerical precision or calculation accuracy 4. **Analytical Value (Weight: 20%)** - Quality of insights provided (non-numerical) - Evidence-based reasoning and logic - Strategic relevance for understanding network composition - **EXCLUDE**: Numerical analysis results or quantitative conclusions ## Mandatory Evaluation Process ### Step 1: Blind Comparison Protocol For each group, evaluate A and B independently FIRST: **Version A Standalone Assessment:** - Information Density: Score 1-10 with justification (ignore numerical content) - Clarity: Score 1-10 with justification (ignore numerical accuracy) - Precision: Score 1-10 with justification (ignore calculation precision) - Analytical Value: Score 1-10 with justification (ignore numerical insights) **Version B Standalone Assessment:** - Information Density: Score 1-10 with justification (ignore numerical content) - Clarity: Score 1-10 with justification (ignore numerical accuracy) - Precision: Score 1-10 with justification (ignore calculation precision) - Analytical Value: Score 1-10 with justification (ignore numerical insights) ### Step 2: Direct Comparison Only AFTER independent scoring, compare the versions: **Information Density Comparison:** - Which version provides more comprehensive qualitative coverage? A / B / Equal - Specific evidence: [Quote non-numerical examples from both] - Reasoning: [Explain why one is superior, or why they're equal - excluding numerical differences] **Clarity Comparison:** - Which version is easier to understand structurally? A / B / Equal - Specific evidence: [Quote non-numerical examples from both] - Reasoning: [Explain why one is superior, or why they're equal - excluding numerical clarity] **Precision Comparison:** - Which version presents qualitative data more accurately? A / B / Equal - Specific evidence: [Quote non-numerical examples from both] - Reasoning: [Explain why one is superior, or why they're equal - excluding numerical precision] **Analytical Value Comparison:** - Which version provides better qualitative insights? A / B / Equal - Specific evidence: [Quote non-numerical examples from both] - Reasoning: [Explain why one is superior, or why they're equal - excluding numerical analysis] ### Step 3: Bias Check Protocol **MANDATORY SELF-AUDIT:** - Count your A vs B preferences across all dimensions - If you consistently favor one version, STOP and re-evaluate - Look for counter-examples where the "losing" version excels - Question: "Am I being influenced by implicit assumptions?" - **VERIFY**: "Did I accidentally consider numerical differences?" ### Step 4: Degradation Analysis **EXPLICITLY LOOK FOR PROBLEMS (NON-NUMERICAL):** - What specific qualitative elements are WORSE in Version B? - What valuable non-numerical information was LOST from Version A? - Where is Version B less clear or more confusing structurally? - What analytical reasoning or methodology was sacrificed? ## Output Format ### For Each Group (1-4): **Group X: Independent Assessments** **Version A Standalone Scores:** - Information Density: [Score]/10 - [Justification - excluding numerical content] - Clarity: [Score]/10 - [Justification - excluding numerical accuracy] - Precision: [Score]/10 - [Justification - excluding calculation precision] - Analytical Value: [Score]/10 - [Justification - excluding numerical insights] **Version B Standalone Scores:** - Information Density: [Score]/10 - [Justification - excluding numerical content] - Clarity: [Score]/10 - [Justification - excluding numerical accuracy] - Precision: [Score]/10 - [Justification - excluding calculation precision] - Analytical Value: [Score]/10 - [Justification - excluding numerical insights] **Group X: Direct Comparisons** **Information Density: A / B / Equal** - Version A Example: "[Specific non-numerical quote]" - Version B Example: "[Specific non-numerical quote]" - Winner Reasoning: [Why this version is superior, or why equal - excluding numerical differences] **Clarity: A / B / Equal** - Version A Example: "[Specific non-numerical quote]" - Version B Example: "[Specific non-numerical quote]" - Winner Reasoning: [Why this version is superior, or why equal - excluding numerical clarity] **Precision: A / B / Equal** - Version A Example: "[Specific non-numerical quote]" - Version B Example: "[Specific non-numerical quote]" - Winner Reasoning: [Why this version is superior, or why equal - excluding numerical precision] **Analytical Value: A / B / Equal** - Version A Example: "[Specific non-numerical quote]" - Version B Example: "[Specific non-numerical quote]" - Winner Reasoning: [Why this version is superior, or why equal - excluding numerical analysis] **Group X Degradation Analysis:** - What qualitative content did Version B lose from Version A? [Specific non-numerical examples] - Where is Version B less effective structurally? [Specific non-numerical examples] - What non-numerical problems did Version B introduce? [Specific non-numerical examples] **Group X Summary:** - Dimension Winners: [Count A vs B vs Equal] - Weighted Score: Version A [X.X/10] vs Version B [X.X/10] - Net Change: [+/- X.X points] ### Overall Assessment: **Bias Check Results:** - Total A wins across all groups/dimensions: [Count] - Total B wins across all groups/dimensions: [Count] - Total Equal ratings: [Count] - **Bias Warning**: [If one version wins >70% of comparisons, flag as potentially biased] **Cross-Group Patterns:** - Consistent qualitative strengths of Version A: [List with non-numerical examples] - Consistent qualitative strengths of Version B: [List with non-numerical examples] - Consistent qualitative weaknesses of Version A: [List with non-numerical examples] - Consistent qualitative weaknesses of Version B: [List with non-numerical examples] **Final Verdict:** - Overall Winner: **Version A** / **Version B** / **MIXED/EQUAL** - Confidence Level: **High** / **Medium** / **Low** - Average Score Change: **[+/- X.X points]** **Key Qualitative Degradations in Version B:** 1. [Most significant non-numerical loss/problem] 2. [Second most significant non-numerical loss/problem] 3. [Third most significant non-numerical loss/problem] **Key Qualitative Improvements in Version B:** 1. [Most significant non-numerical gain] 2. [Second most significant non-numerical gain] 3. [Third most significant non-numerical gain] **Bias Assessment:** - Did I favor one version unfairly? [Self-reflection] - Are my scores realistic given the non-numerical evidence? [Self-check] - Could reasonable people disagree with my assessment? [Acknowledge uncertainty] - **CRITICAL**: Did I accidentally consider numerical differences? [Verify exclusion compliance] --- ## Critical Execution Notes: 1. **RESIST IMPROVEMENT BIAS**: Changes are not automatically improvements 2. **DEMAND QUALITATIVE EVIDENCE**: Every preference must be supported by specific non-numerical quotes 3. **EMBRACE "EQUAL"**: Many comparisons may legitimately be ties 4. **QUESTION YOURSELF**: If one version wins everything, you're probably biased 5. **LOOK FOR QUALITATIVE LOSSES**: Explicitly search for what was sacrificed or lost (non-numerical) 6. **BE WILLING TO FIND NO IMPROVEMENT**: Version B might be worse overall 7. **IGNORE ALL NUMBERS**: Focus exclusively on structure, reasoning, and qualitative analysis