Dotmatics
  • Platform

    Scientific Intelligence Platform

    AI-powered data management and workflow automation for multimodal scientific discovery

    Learn More

    Capabilities

    Adaptive Workflows

    Customize, automate, and scale your lab workflows

    Artificial Intelligence

    Leverage AI and ML to accurately predict scientific outcomes

    Material & Ontology Management

    Classify materials and manage entities with full traceability

    Luma Products

    BioGlyph Luma

    Next-gen protein design for complex biologics – integrating molecular modeling, registration, and production with seamless data traceability and precision.

    Geneious Luma

    Accelerated antibody discovery for sequence analysis, construct design, and lab execution—integrating the power of Geneious Prime and Geneious Biologics with Luma’s adaptive workflows.

    Lab Connect

    Automated lab data ingestion and modeling—connect instruments, structure scientific data, and streamline lab operations with seamless integration.

  • Solutions

    The State of Chemicals & Materials

    Uncover key trends shaping the chemicals and materials industry

    Read More

    Solutions

    Antibody & Protein Engineering

    Integrated registration, lab workflow and data management

    Flow Cytometry

    Automated flow data processing and auto-gating

    Industry

    Biology Discovery

    Chemistry R&D

    Chemicals and Materials

  • Products

    R&D Software for Scientists

    Review our comprehensive portfolio of products driving scientific breakthroughs for R&D innovation and collaboration.

    Explore All

    BIOINFORMATICS

    SnapGene

    Geneious Prime

    Geneious Biologics

    CHEMINFORMATICS

    Vortex

    DATA ANALYSIS & VISUALIZATION

    Prism

    ELN

    ELN & Data Discovery Platform

    FLOW CYTOMETRY

    OMIQ

    FCS Express

    MULTIMODAL SCIENCE

    Scientific Intelligence Platform

    PROTEOMICS

    Protein Metrics

  • Resources

    Watch a Demo

    See Dotmatics in action with on-demand product tours and demos.

    View Demos

    Resources

    All Resources

    Explore the resource library

    Blog

    Latest insights and perspectives to lead your R&D

    Case Studies

    How our customers are using Dotmatics

    Ebooks & White Papers

    News and discoveries from industry leaders

    Videos

    On-demand videos from industry topics to product demos

    Events

    Dotmatics Summit

    Upcoming Events & Webinars

  • Company

    COMPANY

    About Us

    Careers

    Contact Us

    COMPANY

    News & Media

    Partners

    Portfolio

    Latest News

Request Demo
Dotmatics
Request Demo
  • Platform

    Scientific Intelligence Platform

    AI-powered data management and workflow automation for multimodal scientific discovery

    Learn More

    Capabilities

    Adaptive Workflows

    Customize, automate, and scale your lab workflows

    Artificial Intelligence

    Leverage AI and ML to accurately predict scientific outcomes

    Material & Ontology Management

    Classify materials and manage entities with full traceability

    Luma Products

    BioGlyph Luma

    Next-gen protein design for complex biologics – integrating molecular modeling, registration, and production with seamless data traceability and precision.

    Geneious Luma

    Accelerated antibody discovery for sequence analysis, construct design, and lab execution—integrating the power of Geneious Prime and Geneious Biologics with Luma’s adaptive workflows.

    Lab Connect

    Automated lab data ingestion and modeling—connect instruments, structure scientific data, and streamline lab operations with seamless integration.

  • Solutions

    The State of Chemicals & Materials

    Uncover key trends shaping the chemicals and materials industry

    Read More

    Solutions

    Antibody & Protein Engineering

    Integrated registration, lab workflow and data management

    Flow Cytometry

    Automated flow data processing and auto-gating

    Industry

    Biology Discovery

    Chemistry R&D

    Chemicals and Materials

  • Products

    R&D Software for Scientists

    Review our comprehensive portfolio of products driving scientific breakthroughs for R&D innovation and collaboration.

    Explore All

    BIOINFORMATICS

    SnapGene

    Geneious Prime

    Geneious Biologics

    CHEMINFORMATICS

    Vortex

    DATA ANALYSIS & VISUALIZATION

    Prism

    ELN

    ELN & Data Discovery Platform

    FLOW CYTOMETRY

    OMIQ

    FCS Express

    MULTIMODAL SCIENCE

    Scientific Intelligence Platform

    PROTEOMICS

    Protein Metrics

  • Resources

    Watch a Demo

    See Dotmatics in action with on-demand product tours and demos.

    View Demos

    Resources

    All Resources

    Explore the resource library

    Blog

    Latest insights and perspectives to lead your R&D

    Case Studies

    How our customers are using Dotmatics

    Ebooks & White Papers

    News and discoveries from industry leaders

    Videos

    On-demand videos from industry topics to product demos

    Events

    Dotmatics Summit

    Upcoming Events & Webinars

  • Company

    COMPANY

    About Us

    Careers

    Contact Us

    COMPANY

    News & Media

    Partners

    Portfolio

    Latest News

The Challenges of Using ChatGPT with Scientific Research Data

Will BowersMay 11, 2023
Latest Blogs
Case Studies
White Papers
Upcoming Events
News
Search

ChatGPT and other generative large language models (LLMs) are becoming increasingly pervasive in our personal and professional lives. In the life sciences, for example, the use of AI is nothing new, but it is certainly growing. In fact, McKinsey reports that, “The AI-driven drug discovery industry has grown significantly over the past decade, fueled by new entrants in the market, significant capital investment, and technology maturation.”

Data Challenges in Large Language Models 

As the use of LLMs grows, it has become clear that any potential benefits come along with numerous challenges. We must consider factors such as:

  • Data quality and transparency – Is the quality of data going into, and coming out of, generative models sufficient for its intended purpose?

  • Truth dilution – How can models and algorithms avoid perpetuating quality issues and diluting the truth?

  • Complexity management and training sufficiency – Have the models been properly trained using accurate and sufficient data? Are the questions being asked too complex or specific for general algorithms that have been built with broad training datasets? Will results be unreliable or in need of expert scrutiny?

Below, we explore some broad examples to illuminate key considerations that must also be kept in mind as we increase our adoption of AI in scientific R&D and integrate it into our primary workflows.

1.  Data Origin and Context Concerns
(As Illustrated through Novel AI-based Apps)

From the text-prompt-to-image app Craiyon to the photo-remixing tool Midjourney, AI-based apps have become increasingly popular. Growing use of such apps feeds developers more and more training data, however there is generally insufficient assessment on whether such data are inaccurate or proprietary, as evidenced by disputes over Midjourney’s use of output images that had artists’ signatures visible. Similarly, in scientific research, data origin is of key importance as results quality and ethical collection must be ensured. 

A fun example to illustrate context concerns, specifically in using LLMs, is mixology, which in many ways is analogous to product formulations. A prominent YouTube mixologist used ChatGPT to create cocktail recipes from a preset list of ingredients. Not surprisingly, some results were unpalatable because crafting a cocktail recipe isn’t just a matter of following a defined format, but rather an art that relies heavily upon contextual application of both knowledge and sensory inputs. The mixologist’s assessment was that ChatGPT might best be used as an assistive tool, not a primary recipe generator. The role of LLMs in research must be similarly augmentative, helping to fuel scientists’ creativity, not replace it. 

2. Data Accuracy Challenges
(As Illustrated by AI-based News Articles)

AI-written articles have become more prominent than most of us realize and are a great example to illustrate data accuracy challenges. Earlier this year, Buzzfeed News reported that technology news outlet CNET had generated 70+ articles using AI, without prominently disclosing such initially. As a follow-on, Buzzfeed then used ChatGPT to generate their own article on the matter, noting that the process was error-prone and they had to rewrite their prompt several times to avoid basic factual errors. In the scientific realm, teams go to great lengths to ensure their data are trustworthy. Increased use of chat-based AI will present new challenges for doing such.

3. Error Perpetuation Potential
(As Illustrated by Natural Language Processing and AI-based Content Generation)

Lexical analysis, or natural language processing (NLP), has been around for years. For example, there are a number of solutions for scanning papers and building semantic models. In drug discovery, researchers might use such tools to scan publications to quickly uncover potential binding targets for small molecules. While these tools can help parse through large volumes of content in rapid fashion, they’re certainly not fool-proof and manual consideration is often necessary to make final assessments. This is partly due to the inherent challenges of conveying complex information in written publications. What constitutes a “good” paper is a discussion far beyond the scope of this piece; but, certainly, most of us have read papers that left us wondering if we were missing some assumed knowledge, or if the paper was just poorly written. Training models using such papers is bound to be challenging.

Complicating matters even more is the growing popularity of using AI algorithms to generate new content using source materials of varying quality. The output content may often sound factually correct even when it isn’t, or it may become too complex and confusing to interpret. This can amplify quality issues and will likely skew toward poor-quality; in turn, readers may feel like they actually need algorithms to interpret information; but if those algorithms are themselves lacking, the quality problem just self-perpetuates, further diluting the content and making the truth increasingly difficult to decipher.

4. Complexity, Specificity, and Training Limitations
(As Illustrated by AI-based Code Writing)

ChatGPT is also being hailed for its ability to write code; but like written language, creating code is an artform in its own right, and the more complex the code is, the greater the chance of error. Say, for example, you ask for the creation of an “alignment algorithm” without further specification. You may be given an algorithm that can align peptide sequences, but not DNA sequences. Because the letters representing DNA bases—A, C, G, and T—are also used to denote amino acids, you might get an output without error, but it might not actually be what you’re looking for. This leaves highly skilled people to clean up after the algorithm. Their skills, which have been acquired through years of computational life sciences work, might be better applied to actually write and refine the algorithms themselves.

As this example above illustrates, lack of specificity is a fundamental obstacle that must be kept in mind when employing any AI tools. Generalized models that have been trained on huge datasets with no specificity will undoubtedly struggle in specialist areas. For example, in drug discovery, if a predictive algorithm has been trained on small molecules for protein-drug binding, the trustworthiness of its binding predictions depends on how structurally similar the input molecules are to the molecules in the training set. In such cases, an uncertainty metric can help improve transparency, letting users know the limitations of the model. This notion of trustworthiness is of key importance. Models, after all, are only as good as the quality of their training data. Without transparency, how are we to know if models were trained using insufficient, inaccurate, or improperly sourced data? While definite, confident answers like those given by ChatGPT may be attractive, those answers mean little without a trustworthiness score or insight into training-data sourcing and quality.

Not All Models Are Created Equal - AI in Scientific R&D

Ask any scientist and they’ll likely agree that the use of machine learning and artificial intelligence in R&D is nothing new. For more than a decade, researchers have used computational techniques for many purposes, such as finding hits, predicting binding sites, modeling drug-protein interactions, and predicting reaction rates. Most scientists will also likely agree that all models, like all data, aren’t created equal. In many cases, AI -and ML-based tools have largely been used supplementally, not exclusively, but as they become more of a mainstay in our standard workflows, we must keep in mind the concerns illuminated by our examples above.

Developers of AI tools should aim to build semantic relationships into neatly organized training data and provide interpretable metrics that allow users to gauge confidence and reliability; users should not be expected to blindly take predictions at face value. It’s akin to providing a satellite navigation system that empowers drivers to see where they are and identify the best route to get where they need to be, rather than forcing upon them a self-driving vehicle that requires them to relinquish all knowledge and control. It’s about using AI to augment people’s expertise, not replace it (or them).

It all boils down to this: AI holds incredible potential to help speed up work, save costs, inspire innovations and expand the scope of possibility, but undoubtedly, the necessity of clean data, trustworthy models, and human insight is still imperative.

Use AI in Your Scientific R&D Workflows

The Dotmatics Platform facilitates easy capture of clean data and enables the integration of AI into more extensive R&D workflows. 

Request a demonstration of Dotmatics to learn how we can help you get AI-ready.

Additional resources

View All Resources
ai drug discoveryBlog

The Dawn of AI in Drug Discovery

ai small moleculeBlog

Lessons Learned for AI in Chemistry

Image for Are You AI-Ready for Small Molecule Drug Discovery? Blog

Are You AI-Ready for Drug Discovery?

Get the latest science news in your inbox.

Dotmatics Logo
Footer Icon 1Footer Icon 2Footer Icon 3
Request Demo
Get Support
Luma Scientific Intelligence Platform
Luma Overview
Instrument & Data Integration
Artificial Intelligence
Solutions
Antibody and Protein Engineering
Flow Cytometry
Biologics Discovery
Chemicals & Materials
Small Molecule Discovery
Resources
All resources
Blog
Case Studies
Demos
White Papers
Webinars
What’s New
Upcoming Events
FAQ
Explore
FAIR Data Principles
Lab workflow management
Lab Data Automation for Life Sciences
Lab Data Automation for Chemicals & Materials
Lab Data Informatics for Drug Discovery
Modern ELN
Products
All Dotmatics Products
Dotmatics ELN & Data Discovery
EasyPanel
FCS Express
Geneious Biologics
Geneious Prime
GraphPad Prism
LabArchives
M-Star
nQuery
OMIQ
Protein Metrics
SnapGene
SoftGenetics
Vortex
Virscidian
Company
About Us
Careers
Contact Us
News & Media
Partners
Footer Icon 1Footer Icon 2Footer Icon 3
Request Demo
Get Support
Do Not Sell or Share My Personal Information
UK Modern Slavery Act
Privacy Policy
Terms & Conditions
Trademarks