GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval

In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation. 

Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper

import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
   "zai":         {"base_url": "https://api.z.ai/api/paas/v4/",   "model": "glm-5.2",        "env": "ZAI_API_KEY"},
   "openrouter":  {"base_url": "https://openrouter.ai/api/v1",    "model": "z-ai/glm-5.2",   "env": "OPENROUTER_API_KEY"},
   "together":    {"base_url": "https://api.together.xyz/v1",     "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
   "requesty":    {"base_url": "https://router.requesty.ai/v1",   "model": "zai/glm-5.2",    "env": "REQUESTY_API_KEY"},
   "huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG   = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
   try:
       from google.colab import userdata
       v = userdata.get(env_name)
       if v: return v
   except Exception:
       pass
   if os.environ.get(env_name):
       return os.environ[env_name]
   return getpass.getpass(f"Enter your {env_name}: ")
client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(usage):
   if usage:
       _USAGE["in"]    += getattr(usage, "prompt_tokens", 0) or 0
       _USAGE["out"]   += getattr(usage, "completion_tokens", 0) or 0
       _USAGE["calls"] += 1
def get_reasoning(obj):
   """Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""
   val = getattr(obj, "reasoning_content", None)
   if val: return val
   extra = getattr(obj, "model_extra", None) or {}
   if extra.get("reasoning_content"): return extra["reasoning_content"]
   try:    return obj.to_dict().get("reasoning_content")
   except Exception: return None
def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",
        stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
   """
   effort:   None | "high" | "max"   (GLM-5.2 thinking-effort level; max is the model default)
   thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)
   GLM-specific params go through extra_body so any OpenAI client works.
   """
   extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}
   if effort and thinking: extra["reasoning_effort"] = effort
   if tool_stream:         extra["tool_stream"] = True
   kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,
                 temperature=temperature, stream=stream, extra_body=extra)
   if tools:
       kwargs.update(tools=tools, tool_choice=tool_choice)
   if stream:
       kwargs["stream_options"] = {"include_usage": True}
   return client.chat.completions.create(**kwargs)

We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.

Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2

def demo_basic():
   print("n=== 1. BASIC CHAT / SANITY CHECK =========================")
   resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
                {"role": "user",   "content": "In one sentence, what is GLM-5.2 best at?"}],
               thinking=False, max_tokens=200)
   _track(resp.usage)
   print(resp.choices[0].message.content.strip())
def demo_effort():
   print("n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")
   problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "
              "Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "
              "At what clock time do they meet? Show the key steps briefly.")
   for label, kw in [("thinking OFF", dict(thinking=False)),
                     ("effort=high",  dict(thinking=True, effort="high")),
                     ("effort=max",   dict(thinking=True, effort="max"))]:
       t0 = time.time()
       resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
       dt = time.time() - t0
       _track(resp.usage)
       msg, u = resp.choices[0].message, resp.usage
       print(f"n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
       r = get_reasoning(msg)
       if r:
           print("  [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")
       print("  
Please view this post in your web browser to complete the quiz.
: " + " ".join((msg.content or '').split())[:350]) def demo_streaming(): print("n=== 3. STREAMING: reasoning channel vs answer channel ====") stream = chat([{"role": "user", "content": "Explain why the sky is blue, then give a one-line TL;DR."}], thinking=True, effort="high", stream=True, max_tokens=1200) saw_r = saw_a = False usage = None for chunk in stream: if getattr(chunk, "usage", None): usage = chunk.usage if not chunk.choices: continue delta = chunk.choices[0].delta r = get_reasoning(delta) if r: if not saw_r: print("n[thinking] ", end="", flush=True); saw_r = True print(r, end="", flush=True) if getattr(delta, "content", None): if not saw_a: print("nn ", end="", flush=True); saw_a = True print(delta.content, end="", flush=True) print() _track(usage)

We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.

Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent

def tool_calculator(expression: str):
   if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
       return {"error": "unsupported characters"}
   try:    return {"result": eval(expression, {"__builtins__": {}}, {})}
   except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
            "sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
   return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
   {"type": "function", "function": {
       "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
       "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
                      "required": ["expression"]}}},
   {"type": "function", "function": {
       "name": "city_population", "description": "Look up the metro population of a city.",
       "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                      "required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
   """Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
   for _ in range(max_rounds):
       resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
                   max_tokens=1500, temperature=0.3)
       _track(resp.usage)
       m = resp.choices[0].message
       if not getattr(m, "tool_calls", None):
           return m.content
       messages.append({
           "role": "assistant", "content": m.content or "",
           "tool_calls": [{"id": tc.id, "type": "function",
                           "function": {"name": tc.function.name,
                                        "arguments": tc.function.arguments}}
                          for tc in m.tool_calls]})
       for tc in m.tool_calls:
           try:    args = json.loads(tc.function.arguments or "{}")
           except json.JSONDecodeError: args = {}
           result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)
           print(f"   ↳ {tc.function.name}({args}) -> {result}")
           messages.append({"role": "tool", "tool_call_id": tc.id,
                            "content": json.dumps(result)})
   return "(stopped: max tool rounds reached)"
def demo_tools():
   print("n=== 4. FUNCTION / TOOL CALLING ===========================")
   q = ("How many times larger is Tokyo's metro population than Mexico City's? "
        "Use the tools, then answer with the ratio to one decimal place.")
   print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
   print("n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
   task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
           "then compute the combined population of the top two and report it. "
           "Use the tools for every lookup and sum; never guess numbers.")
   ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
                        {"role": "user",   "content": task}])
   print("Final:", " ".join((ans or "").split()))

We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.

Structured JSON Output and Long-Context Retrieval with GLM-5.2

def tool_calculator(expression: str):
   if not re.fullmatch(r"[0-9+-*/(). %]+", expression or ""):
       return {"error": "unsupported characters"}
   try:    return {"result": eval(expression, {"__builtins__": {}}, {})}
   except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
            "sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
   return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
   {"type": "function", "function": {
       "name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
       "parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
                      "required": ["expression"]}}},
   {"type": "function", "function": {
       "name": "city_population", "description": "Look up the metro population of a city.",
       "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                      "required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
   """Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
   for _ in range(max_rounds):
       resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
                   max_tokens=1500, temperature=0.3)
       _track(resp.usage)
       m = resp.choices[0].message
       if not getattr(m, "tool_calls", None):
           return m.content
       messages.append({
           "role": "assistant", "content": m.content or "",
           "tool_calls": [{"id": tc.id, "type": "function",
                           "function": {"name": tc.function.name,
                                        "arguments": tc.function.arguments}}
                          for tc in m.tool_calls]})
       for tc in m.tool_calls:
           try:    args = json.loads(tc.function.arguments or "{}")
           except json.JSONDecodeError: args = {}
           result = TOOL_IMPLS.get(tc.function.name, lambda **k: {"error": "unknown"})(**args)
           print(f"   ↳ {tc.function.name}({args}) -> {result}")
           messages.append({"role": "tool", "tool_call_id": tc.id,
                            "content": json.dumps(result)})
   return "(stopped: max tool rounds reached)"
def demo_tools():
   print("n=== 4. FUNCTION / TOOL CALLING ===========================")
   q = ("How many times larger is Tokyo's metro population than Mexico City's? "
        "Use the tools, then answer with the ratio to one decimal place.")
   print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
   print("n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
   task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
           "then compute the combined population of the top two and report it. "
           "Use the tools for every lookup and sum; never guess numbers.")
   ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
                        {"role": "user",   "content": task}])
   print("Final:", " ".join((ans or "").split()))

We focus on reliable, structured output and long-context retrieval. We create a JSON extraction helper, ask the model to return a strict JSON object, and retry once if the first response is not valid JSON. We also build a synthetic long document with a hidden “needle” and send it to GLM-5.2 to check whether the model retrieves the exact launch code from the provided context.

Running All Demos with GLM-5.2 Token and Cost Accounting

def cost_summary():
   print("n=== 8. TOKEN + COST ACCOUNTING ===========================")
   cost = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M
   print(f"  calls: {_USAGE['calls']} | input: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")
   print(f"  estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${cost:0.4f}")
DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,
        demo_agent, demo_structured, demo_long_context]
print(f"Provider={PROVIDER}   model={MODEL}")
for fn in DEMOS:
   try:    fn()
   except Exception as e:
       print(f"  [skipped {fn.__name__}: {type(e).__name__}: {e}]")
cost_summary()
print("nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo function.")

We finish the tutorial by collecting usage information and running all demos from top to bottom. We calculate the estimated cost from total input and output tokens, then print a compact summary of calls, token counts, and spend. We also use a driver loop so that a single failed demo does not halt the entire notebook, making the tutorial easier to run, debug, and reuse.

Conclusion

In conclusion, we have a practical and reusable workflow for using GLM-5.2 in Python applications. We learned how to control its reasoning behavior, compare different thinking modes, connect it with tools, validate structured outputs, test long-context inputs, and monitor token usage with estimated cost. It provides us a strong starting point for building more advanced systems such as research assistants, document analysis tools, coding agents, long-context retrieval workflows, or API-based reasoning pipelines. We finished with a setup that is lightweight enough for Colab but still close to how we would build with GLM-5.2 in a real project.


Check out the Full Codes hereAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval appeared first on MarkTechPost.

Liked Liked