Avoiding these pits, a practical machine learning framework will be born

Machine learning is a data-driven approach to artificial intelligence. In the design of machine learning framework, there is no universal best framework, only the framework that is most suitable for its own application scenario. What specific factors should be considered when designing a practical and efficient machine learning framework? At present, can those open source frameworks with a total value of over 80 million US dollars meet the needs of enterprises? To answer these questions, you can start with the past experience of artificial intelligence experts "pitching."

From the machine learning system to the â€œmature businessâ€, there are these 7 thresholds

Threshold 1: Rapid growth of effective data volume

As more and more data are recorded, machine learning in the context of big data, computing efficiency becomes one of the core issues. Machine learning systems must be scalable to effectively respond to data growth.

Threshold 2: Machine Learning Algorithm - No Free Lunch

No Free Lunch is a well-known theorem in the field of supervised learning. It means that there is no perfect machine learning model that can solve all problems. Different target scenarios require different machine learning algorithms. So the machine learning framework also needs the friendliness of algorithm development.

Threshold 3: Scarcity of Data Scientists

AI requires data scientists who are well versed in algorithmic and business issues, but good data scientists are scarce, so machine learning solutions need to be as â€œintelligentâ€ as possible, reducing dependence on data scientists.

Threshold 4: Differences between machine learning calculations and traditional ETL calculations

1. ç‡‘!--[endif]-->Calculation

In contrast to the relatively "simple" operations of ETL, machine learning algorithms are more complex in their computation of data. For example, some nonlinear models require intensive computations. Therefore, in practice, not only the characteristics of different computing resources should be taken into consideration, but also the calculation mode must be adjusted to reduce the overhead caused by distributed computing for communications, synchronization, and disaster recovery.

2. ç‡‘!--[endif]-->Communication

Many machine learning algorithms frequently use information from the global or other nodes during the calculation process. The requirements for network throughput and communication delays are much higher than ETL tasks. At the same time, many machine learning tasks require less consistency than ETL tasks, so relaxed consistency requirements can be used in system design.

3. ç‡‘!--[endif]-->Storage

ETL deals with data from a variety of sources, with fewer iterative iterations. The machine learning algorithm has many iterative operations, and there is a large amount of intermediate data that is continuously erased. This has a higher demand for the use efficiency and access efficiency of storage.

4. ç‡‘!--[endif]-->Disaster Recovery and Efficiency Tradeoffs

Different from the ETL computing tasks, the process of machine learning computing tasks is relatively complex and there are many intermediate states. Performing disaster recovery at a finer granularity will increase the overhead during execution. Therefore, the balance between the machine learning calculation task and the ETL calculation task is different in the disaster recovery strategy and the disaster tolerance granularity.

Threshold 5: Resource diversity

The same machine learning algorithm may be used in different resources and different environments. Therefore, the machine learning algorithm system must be able to do better abstraction and design, shield the differences in the underlying resources, and make development and deployment more convenient.

Threshold 6: The Openness of the System

The machine learning system should be able to facilitate integration and deployment in real business systems. At the same time, because a variety of ETL platforms will generate the data needed for machine learning, machine learning systems must be able to openly interface with existing business ETLs and decision systems.

Threshold 7: Complexity of Large-Scale Distributed Machine Learning Systems

The large-scale distributed machine learning system involves many links and the calculation logic is complex. Therefore, the clarity of the entire system architecture design, the understandability of the execution process, the traceability of the execution, and the maintainability of the actual system are all very important. . At the same time, we must weigh the distributed overhead and revenue under different data sizes.

Today's giant technology companies have launched open-source machine learning frameworks, which have greatly reduced the research threshold of artificial intelligence. But are these highly sought after open source frameworks really able to meet the challenges of realizing complex business operations? The answer may not be optimistic. Because, fundamentally speaking, the most popular computing frameworks such as Hadoop and Spark are mostly focused on ETL computing. As mentioned earlier, there are many differences in machine learning calculations compared to ETL calculations. In addition, some algorithm frameworks such as tensorflow, etc., pay more attention to the ease of use in the research, and the algorithms are more focused on a class of algorithms of deep neural networks, thus giving up the efficiency. And some other algorithm frameworks that focus on production applications, especially distributed frameworks, have caught sight of the secondary development of algorithms.

How to see a trick, design a practical machine learning system?

So what exactly do you need to design a practical machine learning system? Here, we take the general distributed brilliant learning (GDBT) framework of the fourth paradigm as an example. Its design goals can be summarized as efficient, intelligent, easy to develop, easy to deploy, easy to operate, easy to expand, and cover a wide range of scenarios.

1. Efficient

Calculation

According to the different characteristics of the computing hardware, GDBT uses different versions of local calculations to make best use of the acceleration instructions. At the same time taking into account that all tasks need to be distributed execution, so as far as possible optimal for distributed and stand-alone operations.

storage

The price, speed, and capacity of different storage devices are different. GDBT can adapt to different storage configurations, optimize storage access speed, and store usage efficiency.

The internet

By rationally designing calculation modes and deploying network communication, GDBT optimizes network communication delay and network usage efficiency.

Efficient disaster recovery

Because of the many intermediate states of machine learning algorithms, in order to avoid overhead problems, GDBT disaster recovery is more emphasis on the core parameters of machine learning algorithms. At the same time, different disaster recovery strategies are formulated based on different calculation scales.

2. Intelligence

Algorithm intelligence

The feature engineering and model tuning in machine learning requires data scientists to have a deep understanding of machine learning algorithms and actual business. Therefore, advanced machine learning systems need to provide automatic or semi-automatic feature engineering. For example, GDBT provides automatic feature engineering including automatic feature generation, automatic feature selection, automatic feature combination, and automatic model parameters.

Running smart

According to different application scenarios, GDBT can automatically adapt to the operating mode and achieve higher operating efficiency.

3. Easy to develop

GDBT provides industrial-grade developer ease-of-use, shielding algorithm developers from underlying details as much as possible, providing good packaging of machine learning components, and facilitating the implementation of various distributed modes required for machine learning. In GDBT, only hundreds of lines of code are needed to implement distributed versions of algorithms such as logistic regression and matrix decomposition.

4. Deployment & Maintenance

GDBT supports a variety of platforms, such as Yarn, Hadoop MR, MPI, etc., and facilitate cross-platform migration. It can monitor the running status and progress in real time and facilitate debugging and error tracking.

5. Covering a wide range of application scenarios

By redesigning, deeply integrating existing models and algorithms, and rationally designing computational patterns and processes, GDBT can provide more efficient algorithms that match practical application scenarios. For example, algorithms on GDBT can take into account discrete features and continuous features, and optimize I/O. And computing resource usage efficiency.

ç‡‘br>

Solar Energy System

Power X (Qingdao) Energy Technology Co., Ltd. , https://www.qdpowerxsolar.com