Making Public Comments Count: How 80 Civic-Technologists Hacked Regulatory Data for a More Transparent Government

Over a single Saturday, we built 13 open-source tools to unlock public feedback and strengthen our democracy. Here's what we built, what we learned, and what's next.

🎉 Sponsors 🎉


🤝 Partners 🤝


Making Public Comments Count: How 80 Civic-Technologists Hacked on Regulatory Data for a More Transparent Government

TL;DR: Federal regulations generate millions of public comments annually, but finding and analyzing them is nearly impossible due to technical barriers. Over one Saturday in July, 80 civic-technologists from government, non-profits, and tech companies built 13 open-source tools that make this data truly accessible for the first time. We created everything from simple command-line tools that let researchers download entire regulatory dockets, to AI-powered chat interfaces that answer plain-English questions about complex rules, to infrastructure improvements that make analytical queries 100x faster. More importantly, we proved that sustainable civic technology comes from diverse coalitions of subject matter experts and technologists working on validated problems, not isolated hackathon weekends. The code is live, the community is growing, and Part 2 is already in planning.

Hackathon Github Archive - Including problem statements, code, photos, and evaluations.


On July 26th, a sweaty DMV Saturday in July, something remarkable happened in Southeast DC. Eighty people from different professional worlds came together with a shared conviction: the desire to make federal regulations and the comments that shape them more transparent. What emerged from that single day of collaboration was more than just code. It was a blueprint for how civic tech can mature this work from a weekend hackathon into a sustained movement for democratic transparency.

The Problem That Brought Us Together

The story begins with a frustration that anyone who works with federal regulations knows too well. Melanie Kourbage, who leads public health informatics at the Association of Public Health Laboratories, captured it perfectly when she told us about her daily struggle: “It is often difficult to locate comments from sister organizations on Regulations.gov, even when I know they submitted them. As a workaround, I usually reach out to them directly and ask for a copy.”

Think about what that means. Here’s a professional whose job involves tracking how public health organizations engage with federal rulemaking, and she can’t even reliably find the comments she knows exist. She has to resort to calling colleagues and asking them to email her copies of documents that are supposedly public. This isn’t a minor inconvenience; it’s a fundamental breakdown in how our democracy is supposed to work.

Every year, millions of American citizens, non-profits, and businesses submit comments on proposed federal rules. These aren’t just bureaucratic formalities. These comments shape everything from how your medical data is protected to what pollutants can be released into your local waterways. The law requires agencies to read and respond to these comments before finalizing rules, and cite them in the final rulemaking documents. In theory, this creates a direct line between public input and government action. In practice, that line is severed by technical barriers that make the data and the process of analyzing it practically unusable to not only the public but to the government agencies themselves, who often struggle to satisfy their legal obligations to read and respond to all public comments.

The comments exist, scattered across 27 million documents in formats ranging from PDFs to images to ebooks to spreadsheets; users can upload anything. They’re technically public through sites like Regulations.gov, but accessing them at scale requires navigating API rate limits, parsing inconsistent formats and data structures, and somehow making sense of terabytes of unstructured data. For most organizations, especially non-profits and advocacy groups with limited technical resources, this might as well be a locked vault. The data is there, but it’s not accessible.

The Coalition We Built

We knew from the start that this problem couldn’t be solved by technologists alone. Like Michael Deeb, one of the Directors of Civic Tech DC likes to say, most problems are not technical problems, but people and process problems. Similarly, the best policy expertise can’t overcome technical barriers without engineering support. So we designed Civic Hack DC 2025 as an intentional collision of different worlds.

The room that Saturday morning was unlike any typical hackathon. Yes, we had data engineers, software developers, and data scientists. But sitting next to them were policy analysts from non-profits who spend their days parsing the nuances of healthcare regulations. We had government technologists from the Centers for Medicare and Medicaid Services working alongside civic tech volunteers who maintain open-source projects in their spare time. Academic researchers shared tables with journalists who cover regulatory policy.

This diversity wasn’t decorative; it was essential to our success. When one team started building a sophisticated algorithm for detecting malicious campaigns based on the timing of comment submissions, a policy expert asked a simple question: “But what if those were submitted on behalf of another group? That happens all the time, it doesn’t mean they were bad actors.” That question completely reframed the problem. The team realized that their work was not isolated, but was interdependent with the work the other teams were working on, such as the entity resolution team, working on a tool that could identify organizations even when they’re referenced inconsistently across documents. To achieve the desired outcome, everyone needed to be able to understand the real world nuance of how these comments are submitted, and begin to explore how they could connect the various work streams to achieve their goals.

Introduction to the hackathon.
Introduction to the hackathon.

Similarly, when developers discovered they could convert the entire dataset to a more efficient format that would make queries hundreds of times faster, it was the policy experts who helped them understand which types of queries actually mattered. There’s no point optimizing for speed if you’re not enabling the analyses that practitioners actually need to perform.

The Foundation: Understanding Mirrulations

To understand what we built, you first need to understand the remarkable foundation we were building upon. The Mirrulations project, maintained by Professor Ben Coleman and his students at Moravian University, represents years of painstaking work to solve the data access problem at its root.

Regulations.gov, the federal government’s central portal for regulatory documents, has an API, but it’s throttled to 1,000 requests per hour. At that rate, downloading all 27 million documents would take literally years. The Mirrulations team solved this by pooling donated API keys from volunteers and organizations, allowing them to maintain a complete, continuously updated mirror of all the data. They extract text from PDFs, parse metadata from complex JSON structures, and organize everything into a coherent data-lake-style structure on Amazon S3.

The result is 2.3 terabytes of regulatory data, constantly updated, freely available to anyone. But even with this incredible resource, significant barriers remained. The data was still in JSON format, which is great for APIs but terrible for analysis. Many of the comments were extracted into plaintext, but many weren’t, and the extraction process left a lot of document structure, images, and context on the cutting room floor. There were no tools for easily downloading specific dockets. And most importantly, there was no way for non-technical users to explore or understand what was in this vast dataset. These were the gaps we set out to fill.

The github repository for the Mirrulations project can be found here.

The Problem Map

Here’s the end-to-end system we organized around, based on feedback from various stakeholders and subject matter experts:

  • Data access and cost: Find and fetch the right materials from a 2.3 TB mirror without burning time or budget.
  • Agency scrapers: Pull in comments from agencies outside Regulations.gov (e.g., FCC, SEC) to complete the picture.
  • Data quality: Extract clean, structured text from PDFs, scans, and Word docs.
  • Integrity checks: Detect spam and coordinated campaigns so signal rises above noise.
  • Entity resolution: Identify who’s commenting (e.g., “American Medical Association” vs. “AMA”) to enable cross-docket insights.
  • Analysis and discovery: Summarize dockets, compare across related rules, and trace influence from comments to final text.
  • Usability: Make all of this explorable for non-technical users through search, dashboards, and chat.

We split teams across these layers so outputs stack and strengthen each other.

More Details can be found here.

What We Built: A Portfolio of Innovation

Over the course of eight intensive hours, thirteen teams produced a remarkable portfolio of tools. Rather than competing against each other, they focused on creating solutions that work together as an ecosystem. These are the highlights.

Demos and prototypes from 13 teams.
Demos and prototypes from 13 teams.

The Foundation: Making Data Accessible

Teams recognized that all the analysis in the world wouldn’t matter if people couldn’t actually access the data. Each took a different approach to this fundamental challenge.

Mirrulations CLI

The Mirrulations CLI team created what might be the day’s simplest but universally valuable contribution. They recognized that while the Mirrulations data was technically available, actually downloading and working with it required writing custom scripts and understanding the underlying S3 structure. Their solution was elegantly straightforward: package the essential tools into a professional command-line interface that anyone can install with a single command. With pip install mirrulations-cli, researchers can now download entire dockets, convert comments to CSV format, and begin analysis immediately. As one judge noted, “We should highly value contributions like this.” It’s not flashy, but it removes a critical barrier that was preventing many potential users from even getting started.

Hive-Partitioned Parquet

While that team focused on making downloads easier, the Hive-Partitioned Parquet team reimagined how the data should be stored in the first place. The original JSON format is flexible but incredibly inefficient for analysis. Querying the full dataset meant downloading and parsing terabytes of text, a slow and expensive process even with modern cloud computing. Their conversion to Hive-partitioned Parquet files created a transformation so dramatic that queries which previously took hours and cost dollars now run in seconds for pennies. One judge, a healthcare data technologist at CMS, said it “totally made me rethink how to use S3 as a database.”

LLM.gov (CMS Docket Assistant)

The LLM.gov (CMS Docket Assistant) team took an entirely different approach to accessibility. Instead of making the raw data easier to work with, they asked: what if users never had to see the raw data at all? They built a chat interface that lets users ask questions about regulatory dockets in plain English. The system uses Retrieval-Augmented Generation to understand questions like “What does this rule say about patient data access?” and provide highlighted content from hundreds of pages of dense regulatory text. This wasn’t just about technical sophistication; it was about recognizing that the ultimate users aren’t data scientists but nurses, small business owners, and engaged citizens.

The Pipeline: Improving Data Quality

Teams tackled the unglamorous but essential work of improving the quality of the data itself, recognizing that better input data would improve every analysis built on top of it.

Taskmasters

The Taskmasters team built a comprehensive pipeline for handling the chaos of real-world document formats. Government comments come in everything from pristine PDFs to badly scanned images to Word documents with complex formatting. Their system extracts text from all these formats, handling edge cases like documents embedded within documents and images of handwritten comments. They even integrated keyword extraction to automatically identify the main topics in each comment. One judge praised it as a “cohesive pipeline” that addresses the data quality problem that undermines so many analyses.

The Intelligence Layer: Understanding Content

Teams built tools to extract meaning from the raw comment text, each focusing on a different aspect of understanding.

Rules Talk

The Rules Talk team created what they called a “Policy Comment Analyzer” that uses Google’s Gemini API to automatically understand both the structure of a proposed rule and the substance of comments about it. Their system doesn’t just count positive and negative responses; it identifies specific issues raised, maps them to sections of the proposed rule, and generates comprehensive reports showing how different organizations’ critiques and support fit into the broader conversation. The judges were particularly impressed by the thoughtful JSON schemas that made the output immediately useful for downstream analysis.

Can of Spam

The Can of Spam team focused on a different but equally critical problem: detecting fraudulent and coordinated comment campaigns. They built sophisticated algorithms to identify temporal patterns that indicate bot submissions, such as thousands of identical comments submitted within seconds of each other. Their tool also detects template-based campaigns where slight variations mask coordinated efforts. This isn’t just about data quality; it’s about the integrity of the democratic process itself.

The Expansion: Growing the Dataset

The Scrapers

One team recognized that improving access to Regulations.gov data was only part of the solution. The Scrapers researched and prototyped tools to gather public comments from agencies that maintain their own systems outside the federal portal. They discovered that the FCC provides a reasonable API for accessing comment data, while the SEC actively blocks automated access, requiring more sophisticated approaches. Rather than just building scrapers, they documented the landscape of federal comment systems, creating a roadmap for systematically expanding the dataset to include all federal agencies. As one judge observed, “The possibility that Mirrulations could expand beyond regulations.gov is truly exciting.”

The Analysis: Connecting Comments to Outcomes

Within-Docket Dataset

Finally, the Within-Docket Dataset team tackled perhaps the most ambitious challenge: understanding how public comments actually influence final rules. Their tool links specific comments to changes between proposed and final rules, using a combination of time-window analysis, text similarity, and semantic matching to identify which suggestions were adopted and which concerns were addressed. One judge said that this addresses the fundamental question of whether public commenting actually matters. While the technical implementation was just beginning, the conceptual framework they developed provides a roadmap for a critical tool in the Mirrulations ecosystem.

All the projects can be found here, and their evaluations here.

The Patterns of Success and Failure

Beyond individual projects, clear patterns emerged that distinguished successful efforts from those that struggled. These lessons are invaluable for future civic tech initiatives.

The most successful projects shared three characteristics. First, they maintained a laser focus on solving real, validated problems rather than interesting technical challenges. Second, they produced excellent documentation, recognizing that code without context is nearly useless in the civic tech space where volunteer maintainers come and go. Third, they built with integration in mind, creating tools that enhanced rather than replaced existing infrastructure.

The projects that struggled typically fell into predictable traps. Some teams got lost in technical complexity, building sophisticated solutions without the time to finish them. Others produced code that worked but was impossible for others to understand or build upon, trapped in Jupyter notebooks with hard-coded paths and missing dependencies.

The Human Impact

While we can measure success in lines of code and technical metrics, the real impact of Civic Hack DC 2025 was human. The event created connections that will outlast any individual project.

A data engineer from a tech giant sat next to a non-profit policy analyst and realized they’d been trying to solve the same problem from opposite directions. A government technologist saw a demonstration of modern data tools and immediately recognized how they could transform his agency’s approach to public comment analysis. A graduate student working on regulatory research discovered an entire community of people who cared about making government data accessible.

These connections matter because sustainable civic technology isn’t built by individuals working in isolation. It’s built by communities that combine different types of expertise, share resources, and support each other through the long, often thankless work of maintaining public goods.

The event also served as a proof point for a different model of civic technology. Too often, hackathons in this space follow a predictable pattern: enthusiastic volunteers try to solve the same problem, judges award prizes, and then the projects die as volunteers return to their day jobs. We deliberately designed Civic Hack DC 2025 to break this pattern.

By starting with validated problems, mapping out the problem space, creating sustainable documentation, and framing the event as “Part 1” of a longer journey, we set expectations that this work would continue. The GitHub repository we created preserves context, documentation, and connections to upstream development. Projects aren’t abandoned; they’re critical building blocks for the teams that have the resources to continue development.

Full room during project share-outs.
Full room during project share-outs.

How You Can Help

  • Developers and data scientists: Try the various projects, file issues, and send PRs. Help advance the projects and the community.
  • Government staff and non-profit professionals: Pilot the tools on real dockets. Tell us what workflows matter, where the data breaks, and what would make this usable day-to-day.
  • Organizations: Support public-good infrastructure with cloud credits, funding for maintenance, or staff time for testing and documentation.
  • Everyone: Share this post with colleagues who work with regulatory data and invite them to the repo.

Contact us at team@civictechdc.org and explore the projects on GitHub.

A Note of Gratitude

This event wouldn’t have been possible without the generous support of our sponsors: CareSet, Taoti Creative, Prefect, Thunder Compute, and TealWolf Consulting. Taoti Creative not only provided our venue but has been a longtime supporter of the civic tech community in DC. Their belief that creative and technical expertise should serve social good aligns perfectly with our mission.

Our partners brought essential expertise and resources. Professor Ben Coleman and Moravian University’s years of work on the Mirrulations project provided the foundation for everything else built upon. DataKindDC’s experience running data-for-good projects helped us design an event that balanced ambition with achievability.

The judges who evaluated projects brought decades of collective experience in government technology, data science, and civic innovation. Their thoughtful feedback helped teams understand not just what they built, but why it matters and how it could be improved.

Most importantly, we’re grateful to every participant who gave up a summer Saturday to build a better government. You proved that civic technology isn’t just about apps and algorithms; it’s about people coming together to strengthen the infrastructure of democracy.

Special thanks to Michael Deeb, Helen Glover, Taylor Wilson, Evan Tung, Alma Trotter, Alex Gurvich, and Fred Trotter for their significant contributions before and during the event.

What’s Next

This was “Part 1”—a working proof that access, quality, integrity, and usability can be tackled together. Next, we’ll work with the participants to iterate on their projects, bring them to life, and expand the community. We hope to host a Part 2 event in Spring to bring these projects together into a single, comprehensive system that will allow citizens, advocates, journalists, and public servants to find, analyze, and understand regulatory feedback and how it influences government decisions.

Sustainable civic tech is infrastructure: it takes maintenance, documentation, and communities that bridge policy and engineering. The code is public. The community is growing. Join us: team@civictechdc.org.


For more information about Civic Hack DC 2025, to access the code, or to get involved in Part 2, visit our GitHub repository or contact us at team@civictechdc.org.