Cross-Model Prompting Workflow for a Systematic Literature Review

Structured prompts, comparative evaluation, refinements, and synthesis strategy for an SLR on real-world applications of data mining and machine learning.

1. Initial Prompt Creation Baseline

Baseline Prompt

Conduct a 2,000-word structured systematic literature review on the applications of data mining and machine learning in real-world domains. Include a methodology section, synthesize key findings, identify major research trends and gaps, and propose one testable hypothesis. Use an academic tone and follow systematic literature review standards.

2. Analyze Model Responses

Evaluation Dimension Copilot ChatGPT Grok 3
Structure Strongest adherence to SLR standards; explicit RQs, detailed methodology, clear search strategy, and numerical study selection following PRISMA and Kitchenham. Clear SLR structure with defined methodology and synthesis, but lacks numerical screening counts and formal quality appraisal. Includes standard sections and screening statistics, but structure is uneven and closer to a survey.
Methodological Rigor High rigor with transparent procedures; minor omissions include no PRISMA flow diagram or formal quality assessment tool. Moderate to high rigor; suitable for a narrative SLR but less procedurally detailed. Moderate rigor; reports inter-rater reliability but lacks balanced methodological depth.
Synthesis Quality Well-integrated synthesis across domains; effectively links methods to operational needs. Strong cross-domain conceptual synthesis with emphasis on real-world deployment challenges. Largely descriptive; relies heavily on individual studies and metrics.
Domains Covered Healthcare, finance, manufacturing, transportation, education, environment. Healthcare, finance, manufacturing, cybersecurity, energy, transportation, agriculture. Healthcare, finance, transportation, agriculture, manufacturing, environment.
Trends Identified Explainable AI, federated learning, multimodal models, ethics. End-to-end ML pipelines, IoT and edge computing, sustainability, governance and trust. Explainable AI, federated learning, edge computing, multimodal learning.
Research Gaps Deployment-oriented gaps; not prioritized. Clearly derived gaps related to validation, benchmarks, lifecycle management, and socio-technical integration. Gaps mostly listed; limited analytical grounding.
Hypothesis Quality Realistic, relevant, and testable; focuses on trust and decision quality without harming performance. Conservative, domain-agnostic, and testable; focuses on internal vs external validation performance. Testable but theoretically weak; overstates accuracy gains from explainability.
References Credible venues cited but incomplete formatting; verification needed. Credible sources cited but presented as web references; formal formatting required. References imprecise and inconsistently formatted; substantial verification required.
Strengths Excellent structure, clear methodology, strong alignment with SLR standards, practical and realistic hypothesis. Strong analytical synthesis, broad domain coverage, conservative and methodologically sound hypothesis. Rich technical detail, concrete metrics, and broad exposure to recent ML methods.
Weaknesses Minor reporting issues; lack of PRISMA diagram and formal quality scoring. Missing procedural details and formal reference formatting. Weak synthesis depth, uneven structure, and questionable hypothesis assumptions.
Overall Assessment Best overall for structural rigor and methodological clarity. Best for conceptual integration and analytical depth. Useful as a technical survey but not a strict systematic review.
Table 1. Comparative Assessment of ChatGPT, Copilot, and Grok 3 Responses

3. Refine the Prompt

Refined Prompt for ChatGPT

Imagine you are conducting a 2,000-word systematic literature review on the applications of data mining and machine learning in real-world domains. Follow PRISMA-style reporting rigor. Explicitly report numerical screening counts, inclusion and exclusion decisions, and clearly describe the study selection process. Apply at least one formal quality appraisal approach to assess included studies. Synthesize findings across multiple domains, identify key trends and research gaps, and propose one conservative, testable hypothesis focused on real-world deployment challenges. Use fully formatted academic citations and maintain a rigorous academic tone.

Rationale: This prompt addresses ChatGPT’s main limitations by requiring numerical screening details, explicit quality appraisal, and formal referencing, while preserving its strong analytical synthesis and cautious hypothesis formulation.

Refined Prompt for Copilot

Imagine you are writing a 2,000-word PRISMA-compliant systematic literature review on real-world applications of data mining and machine learning. Maintain your current level of methodological detail, including explicit research questions, databases, search strings, and screening counts. In addition, include a concise quality assessment of included studies and briefly prioritize the most critical research gaps. Strengthen cross-domain synthesis by adding limited quantitative summaries where appropriate. Propose one testable hypothesis relevant to high-stakes applications, and provide fully formatted academic references.

Rationale: This prompt builds on Copilot’s strong structural rigor by encouraging deeper synthesis, gap prioritization, and improved reference formatting, without changing its core strengths.

Refined Prompt for Grok 3

Imagine you are a data scientist conducting a 2,000-word systematic literature review on how data mining and machine learning are applied in real-world domains such as healthcare, finance, transportation, manufacturing, and agriculture. Clearly distinguish between descriptive examples and cross-domain synthesis. Reduce reliance on isolated performance metrics and focus on generalizable patterns and operational implications. Ensure the review follows PRISMA-style structure and balance all sections evenly. Propose one realistic, theory-grounded, testable hypothesis that aligns with existing evidence. Use verifiable academic citations and maintain a rigorous academic tone.

Rationale: This prompt directly targets Grok 3’s tendency toward descriptive surveys and overambitious hypotheses by emphasizing synthesis, theoretical grounding, structural balance, and citation accuracy.

4. Cross-Model Collaboration

New Synthesis Prompt (Preferred Model)

Using the three draft literature reviews produced by different AI models, write one integrated systematic literature review of about 2,000 words on real-world applications of data mining and machine learning.
Follow PRISMA guidelines and clearly report the search strategy, screening process, inclusion and exclusion criteria, and study counts.
Combine the strongest elements from each draft, including methodology, key findings, cross-domain trends, research gaps, and practical deployment challenges.
Remove repeated content, standardize terminology, and ensure a clear and consistent academic structure.
Conclude with one conservative, testable hypothesis that focuses on real-world implementation issues rather than algorithmic novelty.

Justification

I selected and integrated content from multiple AI-generated drafts by prioritizing methodological rigor, clarity, and relevance to real-world deployment. When drafts overlapped, I retained the version that most closely followed PRISMA guidelines and provided clearer reporting of search strategy, screening steps, and quality appraisal. I emphasized cross-domain patterns and practical implementation challenges rather than algorithmic novelty, as these issues were most consistently supported across the literature. Redundant content was removed, terminology was standardized, and domain-specific examples were included only when they contributed to broader analytical insights. The final hypothesis was chosen because it is conservative, testable, and directly grounded in the synthesized evidence. This process ensured that the final review is coherent, transparent, and academically sound.

5. Reflection

Each AI model approached the systematic review task differently. ChatGPT emphasized structure and methodological clarity, closely aligning with PRISMA standards by clearly outlining research questions, inclusion and exclusion criteria, screening stages, and quality appraisal. Grok 3 focused more on cross-domain synthesis and operational patterns, prioritizing conceptual integration and theory-driven insights over detailed methodological reporting. Copilot adopted a more evaluative and quantitative stance, foregrounding research questions, numerical summaries, and issues of trust, calibration, and uncertainty in real-world deployments.

Prompt refinements that explicitly specified structure and reporting standards produced the strongest results for ChatGPT, while synthesis-oriented prompts emphasizing patterns, resilience, and deployment context were most effective for Grok 3. For Copilot, prompts that requested explicit research questions, tables, and quantitative summaries led to clearer and more analytically focused outputs. Across models, vague prompts resulted in descriptive overviews, whereas precise constraints encouraged methodological rigor and academic tone.

Overall, this process demonstrated that AI is most effective for structured academic reviews when used iteratively and comparatively. Rather than relying on a single model, leveraging multiple AI systems allowed complementary strengths to emerge—structure from one, synthesis from another, and analytical precision from a third. The key lesson is that AI supports, but does not replace, scholarly judgment: the quality of the final review depends on the researcher’s ability to refine prompts, evaluate outputs critically, and synthesize results into a coherent academic narrative.