MedCAT + OpenSearch End-to-End Demo¶

This is a short, practical walkthrough showing how to turn one clinical note into searchable concepts.

It is an end to end example of how you can use CogStack to unlock the power of your healthcare data.

Overview¶

Who this is for¶

This is for developers, data engineers, and analysts who want to see a practical example of how CogStack, MedCAT and Opensearch can be integrated to let you perform advanced search on your notes.

What this notebook does¶

Index one sample note into discharge
Search that note back using free text
Search that note back even when we have typos
Perform Named Entity Resolution (NER) by calling MedCAT Service and index them
Search notes by concept

The goal is to show that this process is straightforward: call one API, index results, and query them.

Prerequisites¶

The best way to run this notebook interactively is to run the CogStack Community Edition with Helm. Look at https://docs.cogstack.org/ to get started.

Initialisation: Define the inputs and services¶

Input Data¶

We define a short input for this tutorial. This represents your free text patient data, for example a discharge summary or long doctors note.

The sample sentence contains concepts that the example demo packs used by medcat service have been trained for.

Service definitions¶

We will setup a client for OpenSearch, and define the HTTP endpoint for medcat service.

If using the cogstack community edition helm chart, these should all be setup for you automatically using kubernetes services and env vars. Otherwise change these accordingly.

In [ ]:

Copied!





import os
from datetime import datetime, timezone
from urllib.parse import urlparse

import pandas as pd
import requests
import urllib3
from IPython.display import display
from opensearchpy import OpenSearch

# The sample note that we will work with
sample_text = "John was diagnosed with Kidney Failure"

# Service URLs from environment variables
medcat_base_url = os.getenv("MEDCAT_SERVICE_URL", "http://cogstack-medcat-service:5000").rstrip("/")
medcat_url = medcat_base_url + "/api/process"

opensearch_url = os.getenv("OPENSEARCH_URL", "https://opensearch-cluster-master:9200")
opensearch_username = os.getenv("OPENSEARCH_USERNAME", "admin")
opensearch_password = os.getenv("OPENSEARCH_PASSWORD", "opensearch-312$A")

parsed = urlparse(opensearch_url)

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

host_cfg = {
    "host": parsed.hostname,
    "port": parsed.port or (443 if parsed.scheme == "https" else 80),
}
if parsed.path and parsed.path != "/":
    host_cfg["url_prefix"] = parsed.path.lstrip("/")

client = OpenSearch(
    hosts=[host_cfg],
    http_auth=(opensearch_username, opensearch_password),
    use_ssl=(parsed.scheme == "https"),
    verify_certs=False,
)

# Hardcoded demo indices
discharge_index = "discharge"
annotations_index = "discharge_annotations"

# Static demo note id used across all steps
note_id = "demo-note-kidney-failure-001"
subject_id = 1
import os
from datetime import datetime, timezone
from urllib.parse import urlparse

import pandas as pd
import requests
import urllib3
from IPython.display import display
from opensearchpy import OpenSearch

# The sample note that we will work with
sample_text = "John was diagnosed with Kidney Failure"

# Service URLs from environment variables
medcat_base_url = os.getenv("MEDCAT_SERVICE_URL", "http://cogstack-medcat-service:5000").rstrip("/")
medcat_url = medcat_base_url + "/api/process"

opensearch_url = os.getenv("OPENSEARCH_URL", "https://opensearch-cluster-master:9200")
opensearch_username = os.getenv("OPENSEARCH_USERNAME", "admin")
opensearch_password = os.getenv("OPENSEARCH_PASSWORD", "opensearch-312$A")

parsed = urlparse(opensearch_url)

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

host_cfg = {
    "host": parsed.hostname,
    "port": parsed.port or (443 if parsed.scheme == "https" else 80),
}
if parsed.path and parsed.path != "/":
    host_cfg["url_prefix"] = parsed.path.lstrip("/")

client = OpenSearch(
    hosts=[host_cfg],
    http_auth=(opensearch_username, opensearch_password),
    use_ssl=(parsed.scheme == "https"),
    verify_certs=False,
)

# Hardcoded demo indices
discharge_index = "discharge"
annotations_index = "discharge_annotations"

# Static demo note id used across all steps
note_id = "demo-note-kidney-failure-001"
subject_id = 1

/opt/conda/envs/py311/lib/python3.11/site-packages/opensearchpy/connection/http_urllib3.py:214: UserWarning: Connecting to https://opensearch-cluster-master:9200 using SSL with verify_certs=False is insecure.
  warnings.warn(

1) Index the note into OpenSearch¶

We write the note into discharge, then immediately run a free-text query (kidney failure) to prove it is searchable.

In [68]:

Copied!





note_doc = {
    "note_id": note_id,
    "subject_id": subject_id,
    "text": sample_text,
    "storetime": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
}

client.index(index=discharge_index, id=note_id, body=note_doc, refresh=True)
note_doc = {
    "note_id": note_id,
    "subject_id": subject_id,
    "text": sample_text,
    "storetime": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
}

client.index(index=discharge_index, id=note_id, body=note_doc, refresh=True)

Out[68]:

{'_index': 'discharge',
 '_id': 'demo-note-kidney-failure-001',
 '_version': 10,
 'result': 'updated',
 'forced_refresh': True,
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 1009,
 '_primary_term': 1}

2) Search that note back using free text¶

This query uses match search, so we can find notes by important words (for example John kidney) without requiring an exact full-string match.

In a traditional relational query, you would usually rely on exact equality or simple wildcard LIKE patterns. Here, OpenSearch handles tokenized full-text search for us.

In [69]:

Copied!





query_text = "John kidney"
free_text_resp = client.search(
    index=discharge_index,
    body={"query": {"match": {"text": query_text}}},
)
hits = free_text_resp["hits"]["hits"]
print(f"Free-text query used: {query_text}")
print("This still returns the note even though it is not an exact full sentence match.")
print("Results from OpenSearch free-text search:")
display(pd.DataFrame([hits[0]["_source"]]))
query_text = "John kidney"
free_text_resp = client.search(
    index=discharge_index,
    body={"query": {"match": {"text": query_text}}},
)
hits = free_text_resp["hits"]["hits"]
print(f"Free-text query used: {query_text}")
print("This still returns the note even though it is not an exact full sentence match.")
print("Results from OpenSearch free-text search:")
display(pd.DataFrame([hits[0]["_source"]]))

Free-text query used: John kidney
This still returns the note even though it is not an exact full sentence match.
Results from OpenSearch free-text search:

	note_id	subject_id	text	storetime
0	demo-note-kidney-failure-001	1	John was diagnosed with Kidney Failure	2026-03-26 17:49:59

3) Fuzzy full-text search (not exact matching)¶

Now we intentionally misspell the query (kidny falur) and still retrieve results.

This demonstrates why OpenSearch is useful for user-entered text and typo-tolerant retrieval.

In [70]:

Copied!





fuzzy_query = "kidny falur"
fuzzy_resp = client.search(
    index=discharge_index,
    body={
        "query": {
            "match": {
                "text": {
                    "query": fuzzy_query,
                    "fuzziness": "AUTO"
                }
            }
        }
    },
)

fuzzy_hits = fuzzy_resp["hits"]["hits"]
print(f"Fuzzy query: {fuzzy_query}")
print(f"fuzzy_hits={len(fuzzy_hits)}")
display(pd.DataFrame(pd.DataFrame([hits[0]["_source"]])))
fuzzy_query = "kidny falur"
fuzzy_resp = client.search(
    index=discharge_index,
    body={
        "query": {
            "match": {
                "text": {
                    "query": fuzzy_query,
                    "fuzziness": "AUTO"
                }
            }
        }
    },
)

fuzzy_hits = fuzzy_resp["hits"]["hits"]
print(f"Fuzzy query: {fuzzy_query}")
print(f"fuzzy_hits={len(fuzzy_hits)}")
display(pd.DataFrame(pd.DataFrame([hits[0]["_source"]])))

Fuzzy query: kidny falur
fuzzy_hits=1

	note_id	subject_id	text	storetime
0	demo-note-kidney-failure-001	1	John was diagnosed with Kidney Failure	2026-03-26 17:49:59

4) Perform Named Entity Resolution with MedCAT¶

We can see that we are able to search with free text, and fuzzy match. However, what happens if we want to search accross notes using common terminology?

We can solve this by using named entity resolution (NER) and NLP.

To do this we will call MedCAT at /api/process with the same note text.

MedCAT returns structured entities (for example CUI and concept name). This is named entity resolution in one API call.

In [71]:

Copied!





medcat_payload = {"content": {"text": sample_text}}
medcat_result = requests.post(medcat_url, json=medcat_payload, timeout=30).json()
raw_annotations = medcat_result.get("result", {}).get("annotations", [])

annotations = [
    next(iter(ann.values())) if isinstance(ann, dict) and len(ann) == 1 else ann
    for ann in raw_annotations
    if isinstance(ann, dict)
]

print(f"annotations_found={len(annotations)}")
print("Results from MedCAT named entity extraction:")
display(pd.DataFrame(annotations))
medcat_payload = {"content": {"text": sample_text}}
medcat_result = requests.post(medcat_url, json=medcat_payload, timeout=30).json()
raw_annotations = medcat_result.get("result", {}).get("annotations", [])

annotations = [
    next(iter(ann.values())) if isinstance(ann, dict) and len(ann) == 1 else ann
    for ann in raw_annotations
    if isinstance(ann, dict)
]

print(f"annotations_found={len(annotations)}")
print("Results from MedCAT named entity extraction:")
display(pd.DataFrame(annotations))

annotations_found=1
Results from MedCAT named entity extraction:

	pretty_name	cui	type_ids	source_value	detected_name	acc	context_similarity	start	end	id	meta_anns	context_left	context_center	context_right
0	Kidney Failure	1	[T047]	Kidney Failure	kidney~failure	1	1	24	38	0	{}	[]	[]	[]

4.1) Index MedCAT entities into OpenSearch¶

Here we take each MedCAT entity and store it in OpenSearch in the discharge_annotations index

We prefix MedCAT fields with nlp. and add meta.note_id / meta.subject_id so each entity stays linked to its source note.

In [72]:

Copied!





indexed = 0
now_ts = datetime.now(timezone.utc).isoformat()

for i, ann in enumerate(annotations):
    nlp_fields = {f"nlp.{k}": v for k, v in ann.items()}

    ann_doc = {
        **nlp_fields,
        "meta.note_id": note_id,
        "meta.subject_id": subject_id,
        "timestamp": now_ts,
    }

    client.index(
        index=annotations_index,
        id=f"{note_id}-ann-{i}",
        body=ann_doc,
        refresh=False,
    )
    indexed += 1

client.indices.refresh(index=annotations_index)
print(f"indexed_annotations={indexed}")
indexed = 0
now_ts = datetime.now(timezone.utc).isoformat()

for i, ann in enumerate(annotations):
    nlp_fields = {f"nlp.{k}": v for k, v in ann.items()}

    ann_doc = {
        **nlp_fields,
        "meta.note_id": note_id,
        "meta.subject_id": subject_id,
        "timestamp": now_ts,
    }

    client.index(
        index=annotations_index,
        id=f"{note_id}-ann-{i}",
        body=ann_doc,
        refresh=False,
    )
    indexed += 1

client.indices.refresh(index=annotations_index)
print(f"indexed_annotations={indexed}")

indexed_annotations=1

5) Search by concept¶

Finally, we query discharge_annotations using the extracted concept (nlp.cui / nlp.pretty_name).

This is the main value: instead of searching raw strings, we can retrieve notes by normalized clinical concepts.

In [73]:

Copied!





concept_cui = str(annotations[0].get("cui", ""))

concept_query = {
    "query": {
        "term": {
            "nlp.cui.keyword": concept_cui
        }
    }
}

concept_resp = client.search(index=annotations_index, body=concept_query)
concept_hits = concept_resp["hits"]["hits"]

print(f"Concept CUI search used: {concept_cui}")
print(f"concept_hits={len(concept_hits)}")

display(pd.DataFrame([h.get("_source", {}) for h in concept_hits]))
concept_cui = str(annotations[0].get("cui", ""))

concept_query = {
    "query": {
        "term": {
            "nlp.cui.keyword": concept_cui
        }
    }
}

concept_resp = client.search(index=annotations_index, body=concept_query)
concept_hits = concept_resp["hits"]["hits"]

print(f"Concept CUI search used: {concept_cui}")
print(f"concept_hits={len(concept_hits)}")

display(pd.DataFrame([h.get("_source", {}) for h in concept_hits]))

Concept CUI search used: 1
concept_hits=1

	nlp.pretty_name	nlp.cui	nlp.type_ids	nlp.source_value	nlp.detected_name	nlp.acc	nlp.context_similarity	nlp.start	nlp.end	nlp.id	nlp.meta_anns	nlp.context_left	nlp.context_center	nlp.context_right	meta.note_id	meta.subject_id	timestamp
0	Kidney Failure	1	[T047]	Kidney Failure	kidney~failure	1	1	24	38	0	{}	[]	[]	[]	demo-note-kidney-failure-001	1	2026-03-26T17:50:17.140165+00:00

Summary¶

You have now seen the full end-to-end CogStack flow in a few simple steps:

index notes into OpenSearch
run free-text and fuzzy search over clinical text
call MedCAT to perform named entity resolution
index entity outputs
retrieve notes by normalized concept (CUI)

This is the core building block for turning unstructured clinical text into searchable, analysable, and operational data.

What to do next¶

Visualise the data with OpenSearch Dashboards
If you've setup with the CogStack Community Edition and are running on localhost, visit http://localhost:5601/ to now see reports and drill down on this data with the UI
Scale this into production ETL
Use these exact blocks in your pipelines: ingest note text -> index to OpenSearch -> call MedCAT -> index annotations -> query/serve downstream applications.
Use a real MedCAT model
Replace the demo model with a domain-appropriate model pack and configuration: MedCAT v2 README.
Explore the platform docs and examples
See full docs at docs.cogstack.org and repositories/examples at github.com/CogStack.
Add supervised learning with MedCAT Trainer (MLOps)
Set up a training and feedback loop to improve extraction quality over time using MedCAT Trainer (annotation -> train -> evaluate -> redeploy).