Making large language models reliable data science programming copilots for biomedical research

0
Making large language models reliable data science programming copilots for biomedical research
  • Radenkovic, D., Keogh, S. B. & Maruthappu, M. Data science in modern evidence-based medicine. J. R. Soc. Med. 112, 493–494 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Ellis, L. D. To meet future needs, health care leaders must look at the data (science). Harvard T.H. Chan School of Public Health (accessed 16 September 2024).

  • Meyer, M. A. Healthcare data scientist qualifications, skills, and job focus: a content analysis of job postings. J. Am. Med. Inf. Assoc. 26, 383–391 (2019).

    Article 

    Google Scholar 

  • Chen, M. et al. Evaluating large language models trained on code. Preprint at (2021).

  • Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022).

    Article 
    CAS 
    PubMed 

    Google Scholar 

  • Luo, Z. et al. Wizardcoder: empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations 1–21 (OpenReview, 2023).

  • Lozhkov, A. et al. Starcoder 2 and the stack v2: the next generation. Preprint at (2024).

  • Zhang, F. et al. RepoCoder: repository-level code completion through iterative retrieval and generation. The 2023 Conference on Empirical Methods in Natural Language Processing 2471–2484 (Association for Computational Linguistics, 2023).

  • Parvez, M. R., Ahmad, W., Chakraborty, S., Ray, B. & Chang, K.-W. Retrieval augmented code generation and summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021 2719–2734 (Association for Computational Linguistics, 2021).

  • Wang, Z. Z. et al. CodeRAG-Bench: can retrieval augment code generation? In Findings of the Association for Computational Linguistics: NAACL 2025 3199–3214 (Association for Computational Linguistics, 2025).

  • Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations 1–80 (OpenReview, 2024).

  • Austin, J. et al. Program synthesis with large language models. Preprint at (2021).

  • Hendrycks, D. et al. Measuring coding challenge competence with apps. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track 1–11 (OpenReview, 2021).

  • Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. Adv. Neural Inf. Process. Syst. 36, 21558–21575 (2023).

    Google Scholar 

  • Jimenez, C. E. et al. SWE-bench: can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations 1–51 (OpenReview, 2024).

  • Huang, J. et al. Execution-based evaluation for data science code generation models. In Proc. Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances) 28–36 (Association for Computational Linguistics, 2022).

  • Lai, Y. et al. DS-1000: a natural and reliable benchmark for data science code generation. In International Conference on Machine Learning 18319–18345 (PMLR, 2023).

  • Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Tang, X. et al. Biocoder: a benchmark for bioinformatics code generation with large language models. Bioinformatics 40, i266–i276 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Majumder, B. P. et al. DiscoveryBench: towards data-driven discovery with large language models. In The Thirteenth International Conference on Learning Representations 1–34 (OpenReview, 2025).

  • Wang, Z., Danek, B. & Sun, J. BioDSA-1K: benchmarking data science agents for biomedical research. Preprint at (2025).

  • TrialMind Data Science Assistant. Keiji AI. (2025).

  • cBioPortal for cancer genomics. cBioPortal (accessed 17 September 2024).

  • Hello GPT-4o. OpenAI (accessed 17 September 2024).

  • GPT-4o mini: advancing cost-efficient intelligence. OpenAI (accessed 17 September 2024).

  • Claude 3.5 Sonnet. Anthropic (accessed 17 September 2024).

  • Introducing the next generation of Claude. Anthropic (accessed 17 September 2024).

  • Reid, M. et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at (2024).

  • OpenAI o3-mini: pushing the frontier of cost-effective reasoning. OpenAI (accessed 6 June 2025).

  • Grattafiori, A. et al. The Llama 3 herd of models. Preprint at (2024).

  • Guo, D. et al. Deepseek-R1 incentivizes reasoning capability in LLMs via reinforcement learning. Nature 645, 633–638 (2025).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Rozière, B. et al. Code Llama: open foundation models for code. Preprint at (2024).

  • Hui, B. et al. Qwen2.5-coder technical report. Preprint at (2024).

  • Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).

    Google Scholar 

  • Brown, T. B. et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems, 1877-1901. (Curran Associates, 2020).

  • Khattab, O. et al. DSPy: compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations 1–31 (OpenReview, 2024).

  • Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).

    Google Scholar 

  • Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In International Conference on Learning Representations 1–33 (OpenReview, 2023).

  • Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23, 703–713 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Welch, J. S. et al. Tp53 and decitabine in acute myeloid leukemia and myelodysplastic syndromes. N. Engl. J. Med. 375, 2023–2036 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Mostavi, M., Chiu, Y.-C., Huang, Y. & Chen, Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med. Genomics 13, 44 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Yen, P.-Y., Wantland, D. & Bakken, S. Development of a customizable health it usability evaluation scale. In AMIA Annual Symposium Proceedings Vol. 2010, 917 (American Medical Informatics Association, 2010).

  • Wang, Z. et al. Accelerating clinical evidence synthesis with large language models. npj Digit. Med. 8, 509–523 (2025).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: a foundation model for clinical trial search, summarization, design, and recruitment. Preprint at (2024).

  • Jin, Q. et al. Matching patients to clinical trials with large language models. Nat. Commun. 15, 9074 (2023).

  • Wang, X. et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. In The Thirteenth International Conference on Learning Representations 1–8 (OpenReview, 2025).

  • Majumder, B. P. et al. Position: data-driven discovery with large generative models. In Proc. 41st International Conference on Machine Learning 34350–34382 (JMLR, 2024).

  • Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Jupyter. Jupyter (accessed 23 September 2024).

  • Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Nie, F., Chen, M., Zhang, Z. & Cheng, X. Improving few-shot performance of language models via nearest neighbor calibration. Preprint at (2022).

  • New embedding models and API updates. OpenAI (accessed 23 September 2024).

  • Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing 4222–4235 (Association for Computational Linguistics, 2020).

  • Vertex AI search. Google (accessed 23 September 2024).

  • Madaan, A. et al. Self-refine: iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 36, 46534–46594 (2023).

    Google Scholar 

  • link

    Leave a Reply

    Your email address will not be published. Required fields are marked *