Many small tools for working with big data.

How we helped Data Scientist

Close
Do you have any questions? Contact us!
I agree the Terms of Service
published March 3, 2020

In the pursuit of their business's quality development, many companies that produce great products often resort to data-science specialists' help. A large company from the USA, which has dozens of Data Scientist specialists working with big data of various large projects, has become our client. The cost of employees in this company is so high that every minute of their working time is worth its gold weight.
So who is a Data Scientist? His work requires real and practical knowledge of statistical data analysis methods, the skills of building mathematical models (from neural networks to clustering, from factorial to correlation analysis), working with large data arrays, and a unique ability to find patterns.

Like many other specialists, the customer's employees daily need a "suitcase" with tools that they could use to solve specific problems with the data presented to them. Collecting this suitcase from scratch every day before you start working with data is a very time-consuming task. So our client needed, in fact, a ready-made already assembled set of tools for all occasions. It should have included all the necessary utilities for code analysis, visualization, data normalization, data mining, and other processes that our team had to develop.
Parcing
Parsing is the process of collecting data with its subsequent processing and analysis. This method is resorted to when it is necessary to process a large array of information that is difficult to handle manually.
Instance
Instance is a virtual machine that runs and runs in the cloud. An instance is an instance of a class in object-oriented programming.
One of the utilities was to implement data parsing for analysis from various sites. To develop and test this utility, we could train on parsing data from the Microsoft website. It was necessary to collect data on updating all products since 1999. Selenium running on the old client parser and standard Python tools were not fast enough, and with so many pages of information, the collection process could take several months. It was necessary to create an instance that could parse several threads simultaneously. At the same time, in Python, there is a way to synchronize GIL threads, which eliminates the possibility of working with multiple threads, so we had to develop an implementation in which several processes could be launched in parallel. To do this, we independently created a series of workers who provide a simple tool for running scripts in the background process.

After we created the whole set of utilities and tools for Data Scientists, we started developing a full-fledged platform that would allow us to get data for analysis in a minimum number of clicks, raise all the necessary environments for their processing, and gain access to the projects needed to work with this computing power.
The next exciting phase of work was related to DC / OS, Kubernetes, and JupyterHub. In a nutshell, it is software for automating deployment, scaling, and containerized application management.

Previously, all processes for working with data from our customers proceeded on Amazon cloud services. Now the company has acquired its own data centers for the operation of which it was necessary to build and deploy a new architecture. To do this, we resorted to working with DC / OS, which is a kind of "pie," consisting of several layers, which include:

1.Hardware servers
2.Virtualized environment
3.Container environment

Each subsequent layer is more abstract, independent of the equipment, goal-oriented, and more convenient for the user.

The main task was to customize and improve the operation of virtualized environment tools. Since customer analytics work with vast amounts of data, and for easy creation, they need servers with various technical characteristics. There was a need to find a solution that can assemble the server according to the analyst's requirements. A similar opportunity exists in the JupyterHub service, but out of the box, it is not convenient enough for our client. We began to work on customization and made it possible to allocate and archive the required number of cores, storage space, and RAM in vast servers deployed on Kubernetes using JupyterHub. You can create and file a server right away with all the necessary tools and utilities for working on this project.

When working with data, analysts record everything in Jupyter Notebooks, which needs to be saved somewhere conveniently. At the same time, such a Notebook should be available for other analysts. GitLab is best suited for such tasks. We launched and connected it with the extension written in Python and jQuery and with the developed platform.

As a result, we got a ready-made platform that solves access problems for analysts to the "case" with tools, computing power, environment, and data being analyzed. Now, every analyst can work using each other's data based on our utilities in an isolated environment, and all this is located in cloud computing services.
Did you like this article?
Share article on social networks
Worked on the article:
Aybek Abdykasymov
Middle Full Stack Developer
Maria Ilchenko
PR and Event Manager
Made on
Tilda