Technology · Data

Entity resolution in legal data.

One of the most underappreciated challenges in legal data analytics is entity fragmentation: the problem that the same economic actor — a builder, a carrier, a subcontractor — appears across public records under dozens or hundreds of different legal names. Solving this problem is not a research task. It is a machine learning task, and it is foundational to any serious attempt at litigation market intelligence.

Why legal entities fragment in the construction industry

Large homebuilders and commercial contractors do not operate through a single legal entity. They structure their operations through networks of subsidiaries, often organized by geography (state, county, or region), by project type (residential, commercial, mixed-use), or by division. This structure reflects legitimate business reasons: liability isolation, financing structures, joint venture arrangements, and operational efficiency.

The result, in the public record, is fragmentation. A national homebuilder with significant Florida operations may appear under a different legal name in Hillsborough County than it does in Manatee County — and under still different names in DBPR licensing records, permit applications, and court filings. Each subsidiary is a distinct legal entity, but they collectively constitute one economic actor whose behavior, in aggregate, is what matters for litigation intelligence.

This fragmentation is not a corner case. It is the norm for any builder operating at scale. And without a solution to the entity-linking problem, any analysis of builder behavior across public records is analyzing the wrong unit: individual subsidiaries rather than builder portfolios.

Why manual matching fails at scale

The straightforward approach to entity linking — an attorney or researcher manually recognizing that “[Builder] of Hillsborough County, LLC” and “[Builder] FL Construction Corp.” are the same economic actor — works at a small scale and breaks at a large one. For a handful of named defendants in a specific matter, manual identification is tractable. For an attempt to systematically track builder behavior across an entire state’s permit and licensing records, it is not.

Manual matching has several failure modes:

  • Name variation. Corporate name conventions are inconsistent across jurisdictions and record types. Abbreviations, punctuation, word order, and entity-type designations (LLC vs. Corp. vs. Inc.) all vary in ways that prevent simple string matching.
  • Completeness gaps. A researcher who knows to look for “[Builder] Homes” may not know to look for the dozen subsidiary entities that share the same registered agent and principal address. The connections are not discoverable by name alone.
  • Scale. Florida has hundreds of thousands of active licensed contractors. Tracking all of them — and their subsidiaries — is not a human-scale task.
  • Maintenance. Corporate registrations change over time. New subsidiaries are formed; old ones are dissolved. A one-time manual exercise goes stale immediately.

How ML entity resolution works

Machine learning entity resolution — also called record linkage or entity matching — addresses these problems by learning to identify when two records in a dataset refer to the same real-world entity, based on multiple signals rather than any single identifier.

In the construction-record context, the signals that matter include:

  • Name similarity. Measured not as exact match but as probabilistic similarity — detecting that “[Builder] of Hillsborough County LLC” and “[Builder] Hillsborough, LLC” are likely the same entity even though they share no exact token sequence.
  • Registered agent overlap. Subsidiaries of the same parent company frequently share a registered agent. This is a strong signal that two differently-named entities belong to the same corporate family.
  • Principal address clustering. Subsidiary offices of the same builder often share a principal office address. Address similarity — normalized for formatting variation — provides another independent signal.
  • License relationship. DBPR license records sometimes document qualifying agent overlap — the licensed individual whose credentials are associated with multiple entities — providing another link.
  • Permit co-occurrence. Entities that appear together on the same permit (contractor and subcontractor) repeatedly across different projects may have an organizational relationship.

A well-designed entity resolution model weighs these signals probabilistically, using a framework that assigns match confidence scores to candidate pairs and clusters high-confidence matches into unified entities. The result is not a binary “same or different” decision but a calibrated probability, allowing downstream analysis to apply appropriate confidence thresholds for different use cases.

Entity resolution is the technical foundation of builder-portfolio intelligence. Without it, you are analyzing subsidiaries. With it, you are analyzing builders.

What entity resolution makes possible

Once entity resolution is applied and builder portfolios are constructed, the analysis that was previously blocked by fragmentation becomes tractable:

  • Portfolio-level defect patterns. Rather than asking how one subsidiary has performed in Hillsborough County, you can ask how a builder has performed across its entire Florida footprint — which subsidiary entities, which project types, which jurisdictions, over what time periods.
  • Repeat-defendant identification. Linking subsidiary entities to parent portfolios reveals which economic actors are repeatedly appearing in defect claims under different legal names — a pattern that may not be visible if each subsidiary is treated as an independent defendant.
  • Regulatory and litigation correlation. DBPR complaints, permit history, and litigation records can be correlated across the full portfolio rather than within the narrow slice visible for any individual subsidiary.
  • Market benchmarking. Comparing builders on a portfolio basis — normalized for total project volume and jurisdiction mix — requires first constructing the portfolio, which entity resolution enables.

The technical implementation at DAIS

DAIS Analytics’s entity resolution layer applies probabilistic record linkage to Florida’s DBPR contractor licensing database, combined with permit records and corporate registration data. The model is trained on known matches — entities that are documented as related through corporate filings — and applies those learned weights to candidate pairs across the full population of licensed contractors.

The output is an entity cluster database: groups of licensed contractor entities that the model has determined, with high confidence, to represent the same economic actor. Those clusters become the unit of analysis for all downstream intelligence: permit activity, DBPR complaint history, and litigation docket patterns are all analyzed at the cluster level, not the individual entity level.

For more detail on the data sources and analytic methodology, see the Methodology page and Data Sources overview.

Builder Intelligence — built on resolved entities.

DAIS’s ML entity resolution is the layer that turns fragmented public records into coherent builder-portfolio intelligence. Founding-cohort access is limited and by request.

Request access