Skip to content

Commit

Permalink
Cloud Data Connect (#2)
Browse files Browse the repository at this point in the history
* Add function for cloud connect
* Add API for CSV MatchReport
* Add custom user-agent to API request headers
* Add repository url to package.json
* Update publish script
* Update npmignore
* Update package version
* Update JSDoc
* Update imports/exports
* Update README.md
* CSV & TSC MatchKey reports
  - Add support for TSV.
  - Add request validation.
  - Add support for 'responseFormat' of json, html, or text.
  - Add unit tests.
* CloudDatabaseMatchKeyReports
  - Implement CloudDatabaseMatchKeyReport API
  - Request validation.
  - Rename "CSV..." classes to "DelimitedFile" to make it more clear it supports both CSV and TSV
  - Refactoring.




---------

Co-authored-by: Interzoid <[email protected]>
  • Loading branch information
dvause and interzoid authored Oct 16, 2023
1 parent 3ca94fb commit 437c750
Show file tree
Hide file tree
Showing 53 changed files with 3,063 additions and 730 deletions.
247 changes: 233 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Interzoid Data Matching Node.js SDK

This is a Node.js SDK for Interzoid's Generative-AI powered data matching, data quality, data cleansing, and data normalization for organization and individual name data. Functions include the generation of similarity keys for identifying and matching inconsistent name data, as well as comparing and scoring data for matching purposes.
**Version: 1.1.0**

This is a Node.js SDK for Interzoid's Generative-AI powered data matching, data quality, data cleansing, and data normalization for organization and individual name data. Functions include the generation of similarity keys (also called match keys) for identifying and matching inconsistent name data, as well as comparing and scoring data for matching purposes. The concept is that the same similarity key will be algorithmically generated for different permutations of the same data content, such as GE, Gen Elec, General Electric all generating the same similarity key. Then, these similarity keys can be used as the basis of matching data, identifying duplicates, and resolving inconsistencies that can otherwise degrade the usefulness and value of data-driven applications, processes, or anything else that makes use of data. These similarity keys form the basis of many of the different functions available in the SDK that make use of Generative AI, Machine Learning, specialized algorithms, and extensive knowledge bases - all in the Cloud - to provide its results. These include functions that generate similarity keys for custom use, functions that score matches for certain use cases, and functions that process and perform matching functions with entire database tables and datasets.

#### Table of Contents
1. [API Key](#api-key)
Expand All @@ -13,8 +15,16 @@ This is a Node.js SDK for Interzoid's Generative-AI powered data matching, data
2. [Match Score Functions](#match-score-functions)
1. [Full Name Match Score](#full-name-match-score)
2. [Organization Name Match Score](#organization-name-match-score)
3. [Interzoid Account Information (Remaining Credits)](#account-information)

4. [Interzoid Cloud Data Connect](#cloud-data-connect)
1. [Introduction](#introduction)
2. [Matching Process](#matching-process)
3. [Sources](#source)
4. [Processing Categories](#category)
5. [Connection Strings](#connection-strings)
6. [Match and write keys to a new cloud database table](#match-and-write-results-to-a-new-table)
7. [Match Key Report for a cloud database table](#match-key-report-for-a-cloud-database-table)
8. [Text File Match Key Report](#text-file-match-key-report)
5. [Interzoid Account Information (Remaining Credits)](#account-information)
---

## API Key
Expand All @@ -33,6 +43,7 @@ npm install @interzoid/data-matching
---

## Data Matching APIs

Interzoid uses algorithmically generated similarity keys leveraging Generative AI, Large Language Models (LLMs), Machine Learning, specialized algorithms, and extensive knowledge bases to intelligently match data within or across data sources. Match rates can increase significantly when similarity keys are used with important data.

To learn more about the technology behind these APIs and to better understand how to make use of similarity keys, please visit https://docs.interzoid.com/entries/understanding-data-matching
Expand All @@ -43,7 +54,7 @@ To learn more about the technology behind these APIs and to better understand ho
This API provides a hashed similarity key from the input data used to match with other similar full name data. Use the generated similarity key, rather than the actual data itself, to match and/or sort individual name data by similarity as similar individual names will generate the same similarity key. This avoids the problems of data inconsistency, misspellings, and name variations when matching within a single dataset, and can also help matching across datasets or for more advanced searching.

```typescript
import { getFullNameMatchKey } from 'interzoid';
import { getFullNameMatchKey } from '@interzoid/data-matching';

async function fullNameMatch() {
const result = await getFullNameMatchKey({apiKey: 'your-interzoid-api-key', fullName: 'John Smith'});
Expand Down Expand Up @@ -75,7 +86,7 @@ The optional `algorithm` parameter provides multiple matching algorithms:
- The default value for the optional `algorithm` parameter is `wide`.

```typescript
import { getCompanyNameMatchKey } from 'interzoid';
import { getCompanyNameMatchKey } from '@interzoid/data-matching';

async function companyNameMatch() {
const result = await getCompanyNameMatchKey({apiKey: 'your-interzoid-api-key', company: 'Microsoft', algorithm: 'medium'});
Expand All @@ -94,15 +105,16 @@ async function companyNameMatch() {
---

#### Address Match Key

This API provides a hashed similarity key from the input data used to match with other similar address data. Use the generated similarity key, rather than the actual data itself, to match and/or sort address data by similarity, as similar addresses will generate the same similarity key. This avoids the problems of data inconsistency, misspellings, and address element variations when matching either withing a single dataset, or across datasets. It also provides for broader searching capabilities.

You can choose from two matching algorithms, `wide` and `narrow`.
You can choose from two matching algorithms, `wide` and `narrow`.
- `narrow` considers a unit number (suite, apartment, unit, etc.) when generating similarity keys. This ensures individual units are identified separately when comparing generated keys.
- `wide` parameter will not consider the unit numbers, generating matching similarity keys based on the primary address only.
- The default value for the optional `algorithm` parameter is `narrow`.

```typescript
import { getAddressMatchKey } from 'interzoid';
import { getAddressMatchKey } from '@interzoid/data-matching';

async function addressMatch() {
const result = await getAddressMatchKey({apiKey: 'your-interzoid-api-key', address: '500 main street', algorithm: 'narrow'});
Expand Down Expand Up @@ -130,7 +142,7 @@ We provide two operations for match scoring: Organization name and Full name. Th
This API provides a match score (likelihood of matching) between two individual names on a scale of 0-100, where 100 is the highest possible match.

```typescript
import { getFullNameMatchScore } from 'interzoid';
import { getFullNameMatchScore } from '@interzoid/data-matching';

async function fullNameMatchScore() {
const result = await getFullNameMatchScore({apiKey: 'your-interzoid-api-key', value1: 'John Smith', value2: 'John Smyth'});
Expand All @@ -150,10 +162,10 @@ async function fullNameMatchScore() {
---

#### Organization Name Match Score
This API provides a match score (likelihood of matching) from 0-100 between two organization names.
This API provides a match score (likelihood of matching) ranging from 0 to 100 between two organization names.

```typescript
import { getOrganizationMatchScore } from 'interzoid';
import { getOrganizationMatchScore } from '@interzoid/data-matching';

async function organizationNameMatchScore() {
const result = await getOrganizationNameMatchScore({apiKey: 'your-interzoid-api-key', value1: 'Apple', value2: 'Apple Inc.'});
Expand All @@ -172,26 +184,233 @@ async function organizationNameMatchScore() {

---

#### Account Information
## Cloud Data Connect

### Introduction

Interzoid's Cloud Data Connect is a set of functions that allow you to match data in your cloud database or delimited text file such as CSV and TSV with Interzoid's data matching algorithms.


### Matching Process

The `process` parameter determines the type of matching process to run. The package provides an `enum` called [`Process`](src/interfaces/Process.ts) that contains the available options.

| Process | Description |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Process.MATCH_REPORT` | Generate a report of all found clusters of similar data that share the same generated similarity key. |
| `Process.CREATE_TABLE` | Creates a new table in the source database with all the similarity keys for each record in the source table, so they can be used for additional queries. |
| `Process.GEN_SQL` | Generate the SQL INSERT statements to store the similarity keys in a database for ability to review before execution. |
| `Process.KEYS_ONLY` | Output a generated similarity key for every record in the dataset. |


### Source

The `source` parameter determines the type of data source containing the data you are performing matching functions with. The package provides an `enum` called [`Source`](src/interfaces/Source.ts) that contains the available options. Some commonly used examples are:

| Source | Description |
|---------------------|--------------------------------------|
| `Source.MYSQL` | Match data in a MySQL database. |
| `Source.POSTGRES` | Match data in a PostgreSQL database. |
| `Source.MARIADB` | Match data in a MariaDB database. |
| `Source.DATABRICKS` | Match data in a Databricks table. |
| `Source.CSV` | Match data in a CSV file. |

Please see the [source code](src/interfaces/Source.ts) for a complete list of available options.


### Category

The `category` parameter determines the type of data you're matching. The package provides an `enum` called [`Category`](src/interfaces/Category.ts) that contains the available options.

| Category | Description |
|-----------------------|-------------------------|
| `Category.COMPANY` | Match company names. |
| `Category.INDIVIDUAL` | Match individual names. |
| `Category.ADDRESS` | Match addresses. |

### Connection Strings

The `connection` parameter is a connection string for your database. The format of the connection string depends on the database you're connecting to.

Please see [this page](https://connect.interzoid.com/connection-strings) for examples of connection strings for various databases.

### Match and write results to a new table

Set the `process` parameter to `CREATE_TABLE` to create a new table in your database with the match keys. The `newTable` parameter is the name of the new table to create. This table will be created by the process, and will contain the original data and the similarity key.

**Do not create the table manually; the process will handle the creation.**

You'll have to grant the user you're connecting with the ability to create a new table in the database in addition to the ability to read from the table you're matching.

```typescript
import { getCloudDatabaseMatchKeyReport, Process, Category, Source } from '@interzoid/data-matching';

async function databaseMatchKeyReport() {
const result = await getCloudDatabaseMatchKeyReport({
apiKey: 'your-interzoid-api-key',
process: Process.CREATE_TABLE,
category: Category.COMPANY,
source: Source.MYSQL,
connection: 'db_user:db_password@tcp(db_host)/database',
table: 'companies', // table to match
column: 'companyname', // column to match
reference: 'id', // optional reference column
newTable: 'companies_match_keys' // new table to create
});
console.log(result);
}
```

#### Response
```
"Creating new table...Table companies_match_keys created successfully."
```

---

### Match Key Report for a cloud database table

#### Response options

* Set `json` to `true` to return a JSON object with arrays of match clusters.
* Set `html` to `true` to return results in plain text with clusters separated by html `<br>` tags.
* Don't set either to return results in plain text with clusters separated by newlines.

```typescript
import { getCloudDatabaseMatchKeyReport, Source, Process, Category } from '@interzoid/data-matching';

async function databaseMatchKeyReport() {
const result = await getCloudDatabaseMatchKeyReport({
apiKey: 'your-interzoid-api-key',
process: Process.MATCH_REPORT,
category: Category.COMPANY,
source: Source.MYSQL,
connection: 'db_user:db_password@tcp(db_host)/database',
table: 'companies',
column: 'companyname',
reference: 'id',
json: true,
});
console.log(JSON.stringify(result, null, 2));
}
```

#### Sample Response

```json
{
"Status": "success",
"Message": "",
"MatchClusters": [
[
{
"Data": "Cisco",
"Reference": "",
"SimKey": "3AmCGk2yvEJ7XUxUmB3dFHxRiVzy4Squ89J-4_lDrxQ"
},
{
"Data": "Cisco Systems",
"Reference": "30",
"SimKey": "3AmCGk2yvEJ7XUxUmB3dFHxRiVzy4Squ89J-4_lDrxQ"
}
],
[
{
"Data": "Netflix",
"Reference": "15",
"SimKey": "8c6BY0KP9MYiDezQaKL3bH3iHfDU2wCMMTD9v0EeZJ8"
},
{
"Data": "\"Netflix, Inc.\"",
"Reference": "34",
"SimKey": "8c6BY0KP9MYiDezQaKL3bH3iHfDU2wCMMTD9v0EeZJ8"
}
]
]
}
```

---

### Text File Match Key Report

Provide a URL to a delimited file (CSV or TSV) and the API will return a match key report for the data in the file.

```typescript
import { getDelimitedFileMatchKeyReport, Process, Source, Category } from '@interzoid/data-matching';

async function csvFileMatchReport() {
const result = await getDelimitedFileMatchKeyReport({
apiKey: 'your-interzoid-api-key',
process: Process.MATCH_REPORT,
category: Category.COMPANY,
source: Source.CSV,
table: Source.CSV,
connection: 'https://dl.interzoid.com/csv/companies.csv',
column: '1', // column number to match
json: true,
});
console.log(JSON.stringify(result, null, 2));
}

```

#### Result

```json
{
"Status": "success",
"Message": "",
"MatchClusters": [
[
{
"Data": "Good Year Tire & Rubber",
"Reference": "",
"SimKey": "140xAiUxvDysV56LZzogzDwLuYLd2U7E5sVAXd1nKd8"
},
{
"Data": "Goodyear Tire Inc",
"Reference": "Transportaions",
"SimKey": "140xAiUxvDysV56LZzogzDwLuYLd2U7E5sVAXd1nKd8"
}
],
[
{
"Data": "Pederson Tooling Inc.",
"Reference": "Transportaions",
"SimKey": "7oOMieCdoyxjt7_oKbE2xGngnZGdG75CFU5pEfhU5z8"
},
{
"Data": "Peterson Tools",
"Reference": "Services",
"SimKey": "7oOMieCdoyxjt7_oKbE2xGngnZGdG75CFU5pEfhU5z8"
}
]
]
}
```

---

## Account Information

This API retrieves the current amount of remaining purchased (or trial) credits for a license key.

Using this function does **not** deduct credits from your account.

```typescript
import { getRemainingCredits } from 'interzoid';
import { getRemainingCredits } from '@interzoid/data-matching';

async function remainingCredits() {
const result = getRemainingCredits({apiKey: 'your-interzoid-api-key'});
console.log(result);
}
```

##### Result
#### Result
```json
{
"credits": "9998",
"code": "Success"
}
```

21 changes: 21 additions & 0 deletions docs/assets/highlight.css
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@
--dark-hl-6: #4FC1FF;
--light-hl-7: #0451A5;
--dark-hl-7: #9CDCFE;
--light-hl-8: #008000;
--dark-hl-8: #6A9955;
--light-hl-9: #098658;
--dark-hl-9: #B5CEA8;
--light-hl-10: #EE0000;
--dark-hl-10: #D7BA7D;
--light-code-background: #FFFFFF;
--dark-code-background: #1E1E1E;
}
Expand All @@ -28,6 +34,9 @@
--hl-5: var(--light-hl-5);
--hl-6: var(--light-hl-6);
--hl-7: var(--light-hl-7);
--hl-8: var(--light-hl-8);
--hl-9: var(--light-hl-9);
--hl-10: var(--light-hl-10);
--code-background: var(--light-code-background);
} }

Expand All @@ -40,6 +49,9 @@
--hl-5: var(--dark-hl-5);
--hl-6: var(--dark-hl-6);
--hl-7: var(--dark-hl-7);
--hl-8: var(--dark-hl-8);
--hl-9: var(--dark-hl-9);
--hl-10: var(--dark-hl-10);
--code-background: var(--dark-code-background);
} }

Expand All @@ -52,6 +64,9 @@
--hl-5: var(--light-hl-5);
--hl-6: var(--light-hl-6);
--hl-7: var(--light-hl-7);
--hl-8: var(--light-hl-8);
--hl-9: var(--light-hl-9);
--hl-10: var(--light-hl-10);
--code-background: var(--light-code-background);
}

Expand All @@ -64,6 +79,9 @@
--hl-5: var(--dark-hl-5);
--hl-6: var(--dark-hl-6);
--hl-7: var(--dark-hl-7);
--hl-8: var(--dark-hl-8);
--hl-9: var(--dark-hl-9);
--hl-10: var(--dark-hl-10);
--code-background: var(--dark-code-background);
}

Expand All @@ -75,4 +93,7 @@
.hl-5 { color: var(--hl-5); }
.hl-6 { color: var(--hl-6); }
.hl-7 { color: var(--hl-7); }
.hl-8 { color: var(--hl-8); }
.hl-9 { color: var(--hl-9); }
.hl-10 { color: var(--hl-10); }
pre, code { background: var(--code-background); }
2 changes: 1 addition & 1 deletion docs/assets/navigation.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 437c750

Please sign in to comment.