Without a secure, wide, steady stream of data encompassing a large sample of transactions and records within the systems, little can be done to train machine learning algorithms. Ideally, we can define data infrastructure in which an ongoing population sample can be analyzed by a continuously improving machine learning model.
Thus, our first tasks will determine:
Where the data that will be analyzed exists today? Who owns it? Will they share? Is special clearance needed? What's your existing data? What's being done with it? What's missing?
How to securely retrieve a very large dataset (gigabytes, terabytes), "clean" it, and leverage it en masse as a statistically significant sample?
Then, the process generally unfolds like this:
We study the problem and general data. At times, convolutional neural networks and complex systems serve as little more than a Rube Goldberg machine for an otherwise simple problem. Not everything is a nail when you have a hammer to swing. Sometimes less intricate math can work magic to detect simple regressions.
We examine the data for general statistical patterns. Using random sampling, distribution, basic Bayesian methods, etc.
We train and examine algorithmic models. These can leverage if-else-then procedural steps made available through standard language constructs.
We train and examine supervised models. Beyond general patterns, we analyze using various supervised learning methods for anomaly detection and look for hidden bias. Supervised learning is interesting and relevant only if we have a sample of "known" fraud that we can use for training.
We train and examine unsupervised models. We analyze using various unsupervised learning methods that expose clusters of data, which we can then examine. Our examination will narrow down self-similar categories within the unstructured data. We then put the shoe on the other foot (so to speak) by examining which groups might be more aligned with our definition of fraud than others. We test this against a "null" hypothesis that will give us a degree of confidence that what we are seeing can't be attributed to randomness alone
We train and examine neural models and hybrid models. Seldom is the case that one model fits all. Whether we have clearer understandings of models, we can apply neural network models to analyze unsupervised outcomes. We can also combine various stages of modeling to isolate anomalies and short circuit the detection phases by narrowing down with a mix of modeling methods.
We present our findings in a research report that outlines the more significant findings in a way that both technical and non-technical audiences can understand. This is typically a "notebook" that is presented in digital format. It contains the actual machine learning code along with the narrative explanation.
We evaluate whether the problem is sufficiently solved. We revise and repeat 2-5 until it is.
We develop an automation strategy. With our best machine learning models in hand, it's not enough to have to do all the work "by hand" every time. Instead, we articulate the needs for infrastructure to enable the automation that's necessary for continuous analysis.
We develop an evolutionary technical strategy. Repeating analysis with automation needs to adapt to the ecosystems being scrutinized. Thus, we strategize on which parts of modeling we can automatically train and which need continuous human intervention.
We ship code. Everything that's used in the development of this infrastructure can be packaged and shipped to all the stakeholders.
We evaluate the infrastructural efficiency. The initial automation effort requires that the configuration unique to the stakeholders operates at the most efficient cost relative to returns. It makes sense to target an operating cost that's acceptable to the problem in order to develop a technology that is not prohibitive to scale.
We scale. Large scale solutions that do well typically need to incorporate more data at a higher velocity of transaction. To that end, we deploy a commodity solution that we can rapidly replicate to accommodate additional customers who need access to the ML pipeline.
This top-level process should give you context so that you understand why we ask certain questions. Our team is very interested in working with you to tackle a variety of ML and AI problems at scale. As we get our questions answered, we will prepare a formal proposal specific to your team for consideration and contract.
pSilent Partners Ltd is hiring. In addition to members of the professional technical community, we are considering graduates of self-directed programs such as Thinkful's data science program for work in our consultancy.
We're eager to partner with you for the betterment of all Life.