[R] Seeking benchmark advice: Evaluating Graph-Oriented Generation (GOG) vs. R.A.G.

I’m looking to continue my research into Graph-Oriented Generation (GOG) as a potential alternative to RAG, but I need to establish meaningful benchmarks. I want to build a showcase that proves whether GOG beats RAG or vice versa (honestly, who knows yet!).

So far, my testing has shown a massive reduction in token usage and compute—which is awesome. However, it comes with an extreme lack of creativity and out-of-the-box thinking from the LLM. It essentially trades associative leaps for rigid, deterministic logic. It’s a fascinating byproduct, but it means I need a highly accurate way to evaluate it.

What does it actually mean to “dethrone” or even just rival RAG in a measurable way? In my mind, it comes down to:

  1. Higher Quality Responses: (e.g., lower hallucination rates, higher factual faithfulness, better context precision)
  2. Resource Efficiency: Lower tokens used, drastically reducing API costs and hardware needs.

Are there standard datasets (like MultiHop-RAG or TriviaQA) or specific evaluation frameworks you would recommend using to test this? I want to get more atomic with these two broad categories, but I’m relatively new to formal benchmarking.

For context, GOG is the first applied showcase of a broader theoretical framework I’m working on called a “Symbolic Reasoning Model.” I want to make sure the foundation is solid before building further.

Would love any advice on the best way to structure these tests!

Anyone interested can see the current benchmark code/repo here. It’s evolving and very primordial at the moment, but has potential! https://github.com/dchisholm125/graph-oriented-generation

submitted by /u/BodeMan5280
[link] [comments]

Liked Liked