Hackathon on Autopilot

Chibi rendering of ED-209, RoboCop's defective predecessor

Last weekend I field-tested a Claude Code plugin at the TwelveLabs hackathon in LA. Here’s what worked and what didn’t.

The plugin

Two days before the TwelveLabs Video Intelligence hackathon (Mar 28 to 29, WeWork Playa Vista, LA), I vibed up a hackathoner Claude Code plugin, fully hoping it might give me enough of an edge for a win. The Plugin’s mission: process a hackathon’s rules page and build scaffolding so the project would run itself. CLAUDE.md, skills, slash commands, tracking issues, and an SDLC workflow tuned for a 24-hour window.

The plugin generated 11 project-level skills for this event, some general, and some super-specific. The meta-prompt I used for building these skills: ‘research every sponsor tool and technical requirement, study relevant APIs, CLIs, SDKs, MCPs, and documentation, generate a skill for each tool, then generate a skill for each requirement.’

Here’s what the useful skills looked like in practice. Each one is a markdown file that gets loaded into Claude’s context when relevant:

skills/
  hackathon-rules.md      # Parsed Devpost rules, scoring rubric, deadlines
  hackathon-sdlc.md       # Plan-first workflow, branching, quality gates
  twelvelabs.md           # Marengo + Pegasus API patterns, rate limits (see below)
  aws-bedrock.md          # Model access, guardrails config
  broadcast-compliance.md # FCC, OFCOM, BCAP, GARM frameworks
  content-ratings.md      # MPAA, TV Parental Guidelines mappings
  brand-safety.md         # GARM categories, severity classifications

The plugin also generated slash commands: /hack to pick up your next issue, /checkpoint to review progress against the timeline, /hack-bug and /hack-feat for fast issue filing. Every command enforced the same rule: plans before code.

What we built

Track 3: Compliance Guardian. The app scans uploaded videos against broadcast compliance frameworks and generates violation reports with evidence timestamps. We called it 20 Seconds to Comply after RoboCop’s ED-209. (Worth noting: an autonomous weapons platform gunning someone down to a bloody heap is exactly the kind of extreme content our compliance tool was supposed to flag. Also exactly the kind of autonomous kill capability the US ‘Department of War’ blacklisted Anthropic for not building.)

Team of two. Me and my brother Steve Shotwell. His first hackathon, and my first hackathon in 12 years. He works at PlutoTV, where broadcast compliance is a real concern. That should have been our story, but wasn’t. More on that later.

Tech stack: Next.js on Vercel, TypeScript (strict), Terraform for infra, TwelveLabs Marengo for visual search and Pegasus for analysis, AWS Bedrock for guardrails. 7,117 lines of TypeScript. 513 lines of Terraform. 42 production scans. 132 violations detected across 6 compliance frameworks.

The build

UX pivot: wizard to single page

The first UI was a 4-step wizard. Upload video. Select frameworks. Configure thresholds. View report. Too many clicks between “I have a video” and “show me what’s wrong with it.” Mid-Saturday I scrapped it and rebuilt as a single-page flow. 15 commits over 2 hours. Claude executed the migration; a manual rewrite would have been 6 to 8 hours. The speed mattered because this was a direction change, not a feature add. Direction changes at hour 8 of a hackathon are typically considered risky. But with our hackathoner workflow, this one was so minor, I would have forgotten to mention it (thanks for retrieving this from the jsonl logs, blog-drafter plugin!)

Terraform’s silent key wipe

Sunday, 10:16 AM. Two hours before deadline. Deploys were failing silently. The TwelveLabs API key was blank in production. Root cause: default = "" on the twelvelabs_api_key Terraform variable. Every terraform apply overwrote the key with an empty string. Commit 337869f fixed it. This is the kind of bug that takes 45 minutes to find and seconds to fix. Claude found it in the Terraform files just as soon as I described the symptom. Is terraform overkill for a hackathon? Not if it lets you turn over ALL your infra chorse to Claude it isn’t.

The dual API key disaster

I created two TwelveLabs API keys thinking we could get double the free credits. Steve’s Claude agent was using one key, and mine was using the other. Videos indexed under one key were invisible to the other. We spent an hour chasing phantom indexing failures that were actually a key mismatch. Steve’s Claude used its 12Labs MCP skill to helpfully index the video with its local key, while our app was using the other key.

This is an agent attribution problem worth calling out. “Steve tested Marengo” and “Steve’s agent tested Marengo autonomously” are different statements. When two people are each running autonomous agents, you get four actors. Four actors with two API keys is chaotic.

Context exemptions (last-minute save)

Sunday, 9 AM. War footage in news clips was triggering violence violations. Michelangelo’s David was flagging for nudity. Both are correct detections and wrong violations. Implemented context-aware exemptions (educational, artistic, journalistic) in under an hour. This was the right call. It turned false positives into explainable decisions, which is exactly what the scoring rubric rewarded under “Explainability” (30% weight).

What the plugin actually gave me

The plugin’s main contribution wasn’t speed. It was structure that freed me to think about other problems.

Broadcast compliance is a complicated UX challenge. You’re dealing with many overlapping frameworks, severity levels, evidence timestamps, context exemptions. The obvious UI is a wall of checkboxes and tables. A better UX requires sitting with the problem, trying things and iterating. That’s where I spent most of my conscious effort: exploring how to indicate ‘findings’ on the video timeline, and improve the ergonomics of the human review system.

I could do that because I wasn’t also tracking priorities, writing boilerplate plans, triaging bugs, or remembering which API had which rate limit. The plugin handled the project management layer. Claude handled the implementation grunt work. Steve and I handled the design questions: Which violations matter most? What combination of test clips and prompts will demonstrate an actually valuable time-saving workflow? When should the tool explain itself vs. just show the evidence?

Even with multiple autonomous agents working in parallel, someone still needs to be the taste filter. The plugin’s job was to make sure that person wasn’t also the project manager.

Claude matched my energy

The closer I got to the deadline, the worse my engineering habits got. I stopped writing plans. I stopped running tests. I started saying “just fix it and deploy.” Claude accommodated all of it.

The final 67 minutes tell the story. Between 11:00 AM and 12:07 PM on Sunday, Claude made 11 commits. All bug fixes and deploy triggers. No human-initiated features. I was recording the demo video. Claude was polishing the system autonomously, fixing edge cases in the compliance report display, cleaning up failed evidence references, adjusting confidence scores.

I’d spent 22 hours training Claude on how this project worked: the skills, the CLAUDE.md rules, the patterns. In the final hour, the system was taking care of business itself while I scrambled.

This is not entirely a brag. Time pressure pushed me to remove my human self from the loop.

The worst 15 minutes

12:15 PM. Fifteen minutes before the hard submission deadline. I’m solo-recording a screen share with voiceover. Steve is patiently asking how he can help. There’s no way to bring him in. The demo is a private struggle between me and a screen recorder.

Multiple use cases are brittle on camera. The compliance scan works, but the loading states are janky. The report renders, but the evidence thumbnails are broken for half the violations. I’m narrating around the rough edges in real time. We submitted the 1 video take we had and it was garbage. But on schedule! Thanks, Youtube, for making it possible to grab the published URL before the upload/transcode completes — it got us in under the deadline with a full 45 seconds to spare.

What the winners did differently

All three winning teams had three things in common. Team of three (not two). Polished presentation slides (not just a live demo). Someone with real domain experience who was visible in the presentation (not behind a terminal).

They told stories before they demoed features. They showed you a world where their user’s life gets better. One team walked through a day in the life of a Disney internationalization specialist. Another showed how a video producer could track not just standards and practices, but also talent releases.

We never painted a picture of a compliance officer using our tool. We just showed off a whole bunch of features, and talked about how it all worked.

I made it about me and my clever systems for configuring Claude, and dumb 80s-movie references, while the winning teams made it about the person they were building for.

First live run

The hackathoner plugin had gaps. Skill generation was indiscriminate (11 skills created, but only 8 were used). The checkpoint command tracked code progress but not demo readiness. Agent attribution was unreliable. It tried coaching, but its coaching focus was off.

The plugin was rough. But it did make it possible for 2 of us to use unfamiliar APIs and SDKs to build a gnarly-problem-solving tool in 1 day. Foolhardy to pin my hopes on a pre-alpha plugin, but I’m glad I tried.

What’s next

The hackathoner plugin drew interest at the event. James Le from TwelveLabs asked about it. At least one other participant wanted to try it. The meta-product (the plugin for hacking hackathons) has longer legs than our hackathon creation.

What the plugin really demonstrated: Claude Code’s skill and plugin system works well on process logistics. Not just code generation, but project structure, workflow enforcement, domain-specific context loading, checkpoint discipline. The pieces all work, and you can wire them together for your specific problem.

I’m refreshing the plugin based on what I learned. The changes:

  • Storytelling is a deliverable. The plugin generates technical artifacts. It should also generate a presentation outline tied to the scoring rubric. “Who is your user and what does their day look like?” can be a required field in the project scaffold.
  • Demo prep needs to start earlier. The /checkpoint command tracked code progress but never asked “can you record a clean demo right now?” That question should appear at every checkpoint after the halfway mark.
  • Agent attribution needs work. Claude confidently told me about all the things Steve had tested, when in fact it was Steve’s agent who half-assed the testing. Next version disambiguates human actions from agent actions.

The plugin is open source. It works best if you crack it open and rewire it to taste. If this is going to be your lightsaber, you bring your own crystal.

GitHub: fshot/claude-plugin-hackathoner

How the plugin works

Hackathoner is a meta-plugin. You don’t install it and start coding. You install it, point it at a Devpost rules page, and it builds a custom project repo stocked with everything your team needs: a CLAUDE.md, domain-specific skills, slash commands, a tracking issue, and an SDLC workflow. Contributors clone that repo and they’re ready to go, vibe-sprinting in parallel, with Github as the coordination layer, and no plugin install required on their end.

The setup pipeline (/hackprep):

  1. Parse the hackathon rules page. Extract tracks, scoring rubric, deadlines, sponsor tools, submission requirements.
  2. Research every sponsor tool. The plugin dispatches parallel agents that investigate each tool’s API surface, SDKs, auth patterns, free tier limits, Terraform support, and Claude ecosystem integrations. For this hackathon, it researched TwelveLabs, AWS Bedrock, Baseten, LTX, and TrackIt.
  3. Generate a skill for each tool and each domain. Each skill is a standalone markdown reference doc with code examples, rate limits, gotchas, and integration patterns. The TwelveLabs skill alone was 9KB of Marengo and Pegasus API patterns. The broadcast compliance skill covered FCC, OFCOM, BCAP, and GARM frameworks with detection rule templates. Twelve skills total for this event.
  4. Brainstorm constrained by the scoring rubric. The plugin generates 2 to 3 project concepts, pressure-tests the architecture, then auto-generates GitHub issues with acceptance criteria for every work item.
  5. Scaffold the project. Pick the right stack based on the architecture plan, generate stubs and mocks, wire up the test framework, commit Feature Zero (the smallest thing that proves the full stack works end-to-end).

After setup, the team runs /hack in a loop. The command detects who you are from git config, finds your highest-priority unblocked issue, checks that a plan exists (and writes one if it doesn’t), creates a worktree, executes, opens a PR, and loops back for the next issue. State lives in GitHub Issue #1, so you can close your laptop, switch machines, or hand off to a teammate. The system picks up where you left off.

The checkpoint timeline is an 8-stage schedule (C0 through C7) that maps the full hackathon window. /checkpoint audits your progress against it, shows what’s behind, and suggests scope cuts. At this hackathon, the timeline ran from scaffold at hour 0 to submission at hour 25.5, with explicit gates for Feature Zero (hour 3), first real integration (hour 7), and “no new features” (hour 23).

What the plugin generated for this hackathon:

  • 12 project-level skills (3 workflow, 5 platform integration, 4 domain knowledge)
  • 5 slash commands (/hack, /checkpoint, /hack-bug, /hack-feat, /hack-retro)
  • 31 timestamped plan documents in docs/plans/
  • 41 GitHub issues, 34 closed
  • 25 PRs opened, 18 merged
  • A single tracking issue that served as the live dashboard for both developers across 23 hours

The research alone (APIs, compliance frameworks, content rating systems) would have taken 15 to 20 hours of manual work. The plugin took about 30 minutes to compress that research into a short list of skills that we could skim through, but more importantly, could power our Claude instances to accurately answer our questions about what was feasible and work well with the unfamiliar APIs.

Comments

Get new posts by email