Use Case

Positive impact on business processes with correct data matching

Many companies face the problem of data matching and deduplication. For example, identifying customers who have multiple accounts can be critical for the risk and fraud departments of a financial institution. Similarly, combining internal data with external sources can be the basis of for developing new data products, and it is crucial to improve analytics and data science projects. If done correctly, data matching unlocks great potential and can have a positive impact on all downstream processes.

Scigility has implemented deduplication and data matching solutions for several customers and industries using advanced ML techniques that are scalable for large datasets.

Challenge

Data matching and deduplication require careful planning and business understanding. When implementing these use cases, we at Scigility consider the following when applying our frameworks:

  • Plan the data engineering steps. How much data is there? How often is it updated? Does the data require cleaning and pre-processing?
  • Understand what the business considers a “match.”  Is a labeled dataset available? What is the cost of collecting reliable labeled data?
  • How is the output from the application used? Is it more important to match as many records as possible, or to match only clear duplicates and avoid edge cases?

Data matching can have a considerable impact on many parts of a business. It is crucial to involve stakeholders throughout the development process and to design a pipeline that can scale and deliver reliable data.

Solution

Scigility supports and enables data science and data engineering by:

  • Building scalable and distributed data pipelines using solutions such as Apache Spark and Dask.
  • Using a data-centric approach to model development that fits the business needs. Labeling data is a time-consuming but extremely valuable process. Using active learning approaches, we can speed up the time to acquire new labels and improve the ML model.
  • Deploying models and operations (monitoring, versioning, validation, etc.) with MLOps tools such as MLflow.

Used Methodology

Scigility MLOps & AI Industrialization
Scigility Data Driven Enablement
Scigility Use Cases Accelerator
Learn more about the Scigility Framework

Used Technology

Spark or Dask for distributed analysis
AzureML, AWS Sagemaker, Databricks, MLflow, etc. for MLO
Learn more about our technologies

We look forward to speaking with you.

Do you have questions about a case, would you like a quote, or to get to know us better?

Or are you a data scientist, an awesome coder or a passionate engineer searching for a brilliant team and cool challenges?

Regardless of what you need, we're here for you.

Christof Studer
Business Developer
+41 44 214 62 89 sales@scigility.com
Federica Suardi
Recruiting
+41 44 214 62 89 jobs@scigility.com
Christian Gügi
Principal Engineer
+41 44 214 62 89 devs@scigility.com