Hugging Face has launched Community Evals, a feature that enables benchmark datasets on the Hub to host their own ...
OpenClaw integrates VirusTotal Code Insight scanning for ClawHub skills following reports of malicious plugins, prompt injection & exposed instances.
This repository contains the analysis code and data for METR's time horizon methodology, as described in "Measuring AI Ability to Complete Long Tasks". . ├── src/horizon/ # Analysis code (installable ...
Abstract: In this paper, we present CAST-Eval, a novel, comprehensive and domain-specific benchmark designed to assess the knowledge and reasoning capabilities of large language models (LLMs) in the ...
An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the si… ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results