[R] Seeking benchmark advice: Evaluating Graph-Oriented Generation (GOG) vs. R.A.G.
I’m looking to continue my research into Graph-Oriented Generation (GOG) as a potential alternative to RAG, but I need to establish meaningful benchmarks. I want to build a showcase that proves whether GOG beats RAG or vice versa (honestly, who knows yet!).
So far, my testing has shown a massive reduction in token usage and compute—which is awesome. However, it comes with an extreme lack of creativity and out-of-the-box thinking from the LLM. It essentially trades associative leaps for rigid, deterministic logic. It’s a fascinating byproduct, but it means I need a highly accurate way to evaluate it.
What does it actually mean to “dethrone” or even just rival RAG in a measurable way? In my mind, it comes down to:
- Higher Quality Responses: (e.g., lower hallucination rates, higher factual faithfulness, better context precision)
- Resource Efficiency: Lower tokens used, drastically reducing API costs and hardware needs.
Are there standard datasets (like MultiHop-RAG or TriviaQA) or specific evaluation frameworks you would recommend using to test this? I want to get more atomic with these two broad categories, but I’m relatively new to formal benchmarking.
For context, GOG is the first applied showcase of a broader theoretical framework I’m working on called a “Symbolic Reasoning Model.” I want to make sure the foundation is solid before building further.
Would love any advice on the best way to structure these tests!
—
Anyone interested can see the current benchmark code/repo here. It’s evolving and very primordial at the moment, but has potential! https://github.com/dchisholm125/graph-oriented-generation
submitted by /u/BodeMan5280
[link] [comments]