MedCAT + OpenSearch End-to-End Demo¶
This is a short, practical walkthrough showing how to turn one clinical note into searchable concepts.
It is an end to end example of how you can use CogStack to unlock the power of your healthcare data.
Overview¶
Who this is for¶
This is for developers, data engineers, and analysts who want to see a practical example of how CogStack, MedCAT and Opensearch can be integrated to let you perform advanced search on your notes.
What this notebook does¶
- Index one sample note into
discharge - Search that note back using free text
- Search that note back even when we have typos
- Perform Named Entity Resolution (NER) by calling MedCAT Service and index them
- Search notes by concept
The goal is to show that this process is straightforward: call one API, index results, and query them.
Prerequisites¶
The best way to run this notebook interactively is to run the CogStack Community Edition with Helm. Look at https://docs.cogstack.org/ to get started.
Initialisation: Define the inputs and services¶
Input Data¶
We define a short input for this tutorial. This represents your free text patient data, for example a discharge summary or long doctors note.
The sample sentence contains concepts that the example demo packs used by medcat service have been trained for.
Service definitions¶
We will setup a client for OpenSearch, and define the HTTP endpoint for medcat service.
If using the cogstack community edition helm chart, these should all be setup for you automatically using kubernetes services and env vars. Otherwise change these accordingly.
import os
from datetime import datetime, timezone
from urllib.parse import urlparse
import pandas as pd
import requests
import urllib3
from IPython.display import display
from opensearchpy import OpenSearch
# The sample note that we will work with
sample_text = "John was diagnosed with Kidney Failure"
# Service URLs from environment variables
medcat_base_url = os.getenv("MEDCAT_SERVICE_URL", "http://cogstack-medcat-service:5000").rstrip("/")
medcat_url = medcat_base_url + "/api/process"
opensearch_url = os.getenv("OPENSEARCH_URL", "https://opensearch-cluster-master:9200")
opensearch_username = os.getenv("OPENSEARCH_USERNAME", "admin")
opensearch_password = os.getenv("OPENSEARCH_PASSWORD", "opensearch-312$A")
parsed = urlparse(opensearch_url)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
host_cfg = {
"host": parsed.hostname,
"port": parsed.port or (443 if parsed.scheme == "https" else 80),
}
if parsed.path and parsed.path != "/":
host_cfg["url_prefix"] = parsed.path.lstrip("/")
client = OpenSearch(
hosts=[host_cfg],
http_auth=(opensearch_username, opensearch_password),
use_ssl=(parsed.scheme == "https"),
verify_certs=False,
)
# Hardcoded demo indices
discharge_index = "discharge"
annotations_index = "discharge_annotations"
# Static demo note id used across all steps
note_id = "demo-note-kidney-failure-001"
subject_id = 1
/opt/conda/envs/py311/lib/python3.11/site-packages/opensearchpy/connection/http_urllib3.py:214: UserWarning: Connecting to https://opensearch-cluster-master:9200 using SSL with verify_certs=False is insecure. warnings.warn(
1) Index the note into OpenSearch¶
We write the note into discharge, then immediately run a free-text query (kidney failure) to prove it is searchable.
note_doc = {
"note_id": note_id,
"subject_id": subject_id,
"text": sample_text,
"storetime": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S"),
}
client.index(index=discharge_index, id=note_id, body=note_doc, refresh=True)
{'_index': 'discharge',
'_id': 'demo-note-kidney-failure-001',
'_version': 10,
'result': 'updated',
'forced_refresh': True,
'_shards': {'total': 2, 'successful': 1, 'failed': 0},
'_seq_no': 1009,
'_primary_term': 1}
2) Search that note back using free text¶
This query uses match search, so we can find notes by important words (for example John kidney) without requiring an exact full-string match.
In a traditional relational query, you would usually rely on exact equality or simple wildcard LIKE patterns. Here, OpenSearch handles tokenized full-text search for us.
query_text = "John kidney"
free_text_resp = client.search(
index=discharge_index,
body={"query": {"match": {"text": query_text}}},
)
hits = free_text_resp["hits"]["hits"]
print(f"Free-text query used: {query_text}")
print("This still returns the note even though it is not an exact full sentence match.")
print("Results from OpenSearch free-text search:")
display(pd.DataFrame([hits[0]["_source"]]))
Free-text query used: John kidney This still returns the note even though it is not an exact full sentence match. Results from OpenSearch free-text search:
| note_id | subject_id | text | storetime | |
|---|---|---|---|---|
| 0 | demo-note-kidney-failure-001 | 1 | John was diagnosed with Kidney Failure | 2026-03-26 17:49:59 |
3) Fuzzy full-text search (not exact matching)¶
Now we intentionally misspell the query (kidny falur) and still retrieve results.
This demonstrates why OpenSearch is useful for user-entered text and typo-tolerant retrieval.
fuzzy_query = "kidny falur"
fuzzy_resp = client.search(
index=discharge_index,
body={
"query": {
"match": {
"text": {
"query": fuzzy_query,
"fuzziness": "AUTO"
}
}
}
},
)
fuzzy_hits = fuzzy_resp["hits"]["hits"]
print(f"Fuzzy query: {fuzzy_query}")
print(f"fuzzy_hits={len(fuzzy_hits)}")
display(pd.DataFrame(pd.DataFrame([hits[0]["_source"]])))
Fuzzy query: kidny falur fuzzy_hits=1
| note_id | subject_id | text | storetime | |
|---|---|---|---|---|
| 0 | demo-note-kidney-failure-001 | 1 | John was diagnosed with Kidney Failure | 2026-03-26 17:49:59 |
4) Perform Named Entity Resolution with MedCAT¶
We can see that we are able to search with free text, and fuzzy match. However, what happens if we want to search accross notes using common terminology?
We can solve this by using named entity resolution (NER) and NLP.
To do this we will call MedCAT at /api/process with the same note text.
MedCAT returns structured entities (for example CUI and concept name). This is named entity resolution in one API call.
medcat_payload = {"content": {"text": sample_text}}
medcat_result = requests.post(medcat_url, json=medcat_payload, timeout=30).json()
raw_annotations = medcat_result.get("result", {}).get("annotations", [])
annotations = [
next(iter(ann.values())) if isinstance(ann, dict) and len(ann) == 1 else ann
for ann in raw_annotations
if isinstance(ann, dict)
]
print(f"annotations_found={len(annotations)}")
print("Results from MedCAT named entity extraction:")
display(pd.DataFrame(annotations))
annotations_found=1 Results from MedCAT named entity extraction:
| pretty_name | cui | type_ids | source_value | detected_name | acc | context_similarity | start | end | id | meta_anns | context_left | context_center | context_right | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Kidney Failure | 1 | [T047] | Kidney Failure | kidney~failure | 1 | 1 | 24 | 38 | 0 | {} | [] | [] | [] |
4.1) Index MedCAT entities into OpenSearch¶
Here we take each MedCAT entity and store it in OpenSearch in the discharge_annotations index
We prefix MedCAT fields with nlp. and add meta.note_id / meta.subject_id so each entity stays linked to its source note.
indexed = 0
now_ts = datetime.now(timezone.utc).isoformat()
for i, ann in enumerate(annotations):
nlp_fields = {f"nlp.{k}": v for k, v in ann.items()}
ann_doc = {
**nlp_fields,
"meta.note_id": note_id,
"meta.subject_id": subject_id,
"timestamp": now_ts,
}
client.index(
index=annotations_index,
id=f"{note_id}-ann-{i}",
body=ann_doc,
refresh=False,
)
indexed += 1
client.indices.refresh(index=annotations_index)
print(f"indexed_annotations={indexed}")
indexed_annotations=1
5) Search by concept¶
Finally, we query discharge_annotations using the extracted concept (nlp.cui / nlp.pretty_name).
This is the main value: instead of searching raw strings, we can retrieve notes by normalized clinical concepts.
concept_cui = str(annotations[0].get("cui", ""))
concept_query = {
"query": {
"term": {
"nlp.cui.keyword": concept_cui
}
}
}
concept_resp = client.search(index=annotations_index, body=concept_query)
concept_hits = concept_resp["hits"]["hits"]
print(f"Concept CUI search used: {concept_cui}")
print(f"concept_hits={len(concept_hits)}")
display(pd.DataFrame([h.get("_source", {}) for h in concept_hits]))
Concept CUI search used: 1 concept_hits=1
| nlp.pretty_name | nlp.cui | nlp.type_ids | nlp.source_value | nlp.detected_name | nlp.acc | nlp.context_similarity | nlp.start | nlp.end | nlp.id | nlp.meta_anns | nlp.context_left | nlp.context_center | nlp.context_right | meta.note_id | meta.subject_id | timestamp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Kidney Failure | 1 | [T047] | Kidney Failure | kidney~failure | 1 | 1 | 24 | 38 | 0 | {} | [] | [] | [] | demo-note-kidney-failure-001 | 1 | 2026-03-26T17:50:17.140165+00:00 |
Summary¶
You have now seen the full end-to-end CogStack flow in a few simple steps:
- index notes into OpenSearch
- run free-text and fuzzy search over clinical text
- call MedCAT to perform named entity resolution
- index entity outputs
- retrieve notes by normalized concept (CUI)
This is the core building block for turning unstructured clinical text into searchable, analysable, and operational data.
What to do next¶
Visualise the data with OpenSearch Dashboards
If you've setup with the CogStack Community Edition and are running on localhost, visit http://localhost:5601/ to now see reports and drill down on this data with the UIScale this into production ETL
Use these exact blocks in your pipelines: ingest note text -> index to OpenSearch -> call MedCAT -> index annotations -> query/serve downstream applications.Use a real MedCAT model
Replace the demo model with a domain-appropriate model pack and configuration: MedCAT v2 README.Explore the platform docs and examples
See full docs at docs.cogstack.org and repositories/examples at github.com/CogStack.Add supervised learning with MedCAT Trainer (MLOps)
Set up a training and feedback loop to improve extraction quality over time using MedCAT Trainer (annotation -> train -> evaluate -> redeploy).