Rahul Sharma (Editor)

Hierarchical Cluster Engine Project

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Developer(s)
  
HCE Team

Development status
  
Active

Initial release
  
2013 (2013)

Written in
  
C++, PHP, Python

Stable release
  
1.4.4 / 31 August 2015; 17 months ago (2015-08-31)

Preview release
  
1.4.4 / 25 September 2015; 16 months ago (2015-09-25)

Hierarchical Cluster Engine (HCE) is a FOSS complex solution for: construct custom network mesh or distributed network cluster structure with several relations types between nodes, formalize the data flow processing goes from upper node level central source point to down nodes and backward, formalize the management requests handling from multiple source points, support native reducing of multiple nodes results (aggregation, duplicates elimination, sorting and so on), internally support powerful full-text search engine and data storage, provide transactions-less and transactional requests processing, support flexible run-time changes of cluster infrastructure, have many languages bindings for client-side integration APIs in one product build on C++ language.

Contents

This project became the successor of Associative Search Machine (ASM) full-text web search engine project that was developed from 2006 to 2012 by IOIX Ukraine

The HCE project products

  • The hce-node core (HCE-node application) network transport cluster infrastructure engine.
  • The Bundle:
  • Distributed Crawler service (DC),
  • Distributed Tasks Manager service (DTM),
  • PHP language API and management tools,
  • Python language API and management tools.
  • Utilities.

  • All of them are the set of applications that can be used to construct different distributed solutions like: remote processes execution management, data processing (including the text mining with NLP), web sites crawling (including incremental, periodic, with flexible and adaptive scheduling, RSS feeds and custom structured), web sites data scraping (include pre-defined and custom scrapers, xpath templates, sequential and optimized scraping algorithms), web-search engine (complete cycle including the crawling, scraping and distributed search index based on the Sphinx indexing engine), corporate integrated full-text search based on distributed Sphinx engine index and many more another applied solutions with similar business logic

    HCE-node application

    The heart and main component of the HCE project it is hce-node application. This application integrates complete set of base functionality to support network infrastructure, hierarchical cluster construction, full-text search system integration and so on.

  • Implemented for Linux OS environment and distributed in form of source code tarball archive and Debian Linux binary package with dependencies packages.
  • Supports single instance configuration-less start or requires set of options that used to build correspondent network cluster architecture.
  • Supposes usage with client-side applications or integrated IPI.
  • First implementation of client-side API and cli utilities bind on PHP.
  • HCE application area:

  • As a network infrastructure and messages transport layer provider – the HCE can be used in any big-data solution that needs some custom network structure to build distributed high-performance easy scalable vertically and horizontally data processing or data-mining architecture.
  • As a native internally supported full text search engine interface provider – the HCE can be used in web or corporate network solutions that needs smoothly integrated with usage of natural target project specific languages, fast and powerful full text search and NOSQL distributed data storage. Now the Sphinx (c) search engine with extended data model internally supported.
  • AS a Distributed Remote Command Execution service provider – the HCE can be used for automation of administration of many host servers in ensemble mode for OS and services deployment, maintenance and support tasks.
  • Hierarchical Cluster as engine:

  • Provides hierarchical cluster infrastructure – nodes connection schema, relations between nodes, roles of nodes, requests typification and data processing sequences algorithms, data sharding modes, and so on.
  • Provides network transport layer for data of client application and administration management messages.
  • Manages native supported integrated NOSQL data storage Sphinx (c) search index and Distributed Remote Command Execution.
  • Collect, reduce and sort results of native and custom data processing.
  • Ready to support transactional messages processing.
  • Hce-node roles in the cluster structure: Internally HCE-node application contains seven basic handler threads. Each handler acts as special black-box messages processor/dispatcher and used in combination with other to work in one of five different roles of node:

  • Router – upper end-point of cluster hierarchy. Has three server-type connections. Handles client API, any kind of another node roles instances (typically, shard or replica managers) and admin connections.
  • Shard manager – intermediate-point of cluster hierarchy. Routes messages between upper and down layers. Uses data sharding and messages multicast dispatching algorithms. Has two server-type and one client connections.
  • Replica manager – the same as shard manager. Routes messages between upper and down layers uses data balancing and messages round-robin algorithms.
  • Replica – down end-point of cluster hierarchy. Data node, interacts with data storage and/or process data with target algorithm(s), provides interface with fill-text search engine, target host for Distributed Remote Commands Execution. Has one server- and one client-side connections used for cluster infrastructure Also can to have several data storage-dependent connections.
  • Bundle

    Both DTM and DC applications provided with set of functional tests and demo operations automation scripts based on Linux shell. The Bundle distribution provided as zip archive that needs some environmental support to get functionality be ready.

    Distributed Crawler service (DC)

    It is a Linux OS daemon application that implements business-logic functionality of distributed web crawler and document data processor. It is based on the DTM application main functionality and hce-node DRCE Functional Object functionality and uses web crawling, processing and another related tasks as an isolated session executable modules with common business logic encapsulation. Also, the crawler contains raw contents storage subsystem based on file system (can be customized to support key-value storage or SQL). This application uses several DRCE Clusters to construct network infrastructure, MySQL and sqlite back-end for indexed data (Sites, URLs, contents and configuration properties) as well as a key-value data storage for the processed contents of pages or documents.

    Distributed Tasks Manager service (DTM)

    It is a Linux OS multi-thread daemon application that implements business-logic functionality of tasks management that uses the DRCE Cluster Execution Environment to manage tasks as remote processes. It implements general management operations for distributed tasks scheduling, execution, state check, OS resources monitoring and so on. This application can be used for parallel tasks execution with state monitoring on hierarchical network cluster infrastructure with custom nodes connection schema. It is multipurpose application aimed to cover needs of projects with big data computations, distributed data processing, multi-host data processing with OS system resources balancing, limitations and so on. It supports several balancing modes including multicast, random, round-robin and system resource usage algorithms. Also, provides high level state check, statistics and diagnostic automation based on natural hierarchy and relations between nodes. Supports messages routing as a tasks and data balancing method or a tasks management.

    Utilities

    It is set of different by role and functionality separated console applications that can be united in some chains to get sequential data processing on server-side functionality and can be used as self-sufficient tools.

    Utilities designed to get common functional units for typical web projects that need to get huge data from web or another sources, pars, convert and process it. It supports unified input-output interface and json format of messages interaction. First implementation of utilities application is a Highlighter: this is utility to get fast parallel multi-algorithmic textual patterns highlighting. It provides cli UI, works as filter console tool and uses json format of protocol messages for input and output interaction. Highlight is an algorithm of text processing that gets the search query string and textual context on input and returns textual content with marks of entrances of patterns from search query and additional stat information. Patterns usually are lexical words, but depend on stemming and tokenizing processes can be more complex constructions.

    Tags Reaper

    Tags Reaper is data mining online service that lets crawl web content of a different kind (dynamic or static), from multiple IPs and locations avoiding IP rotation or ban tracking, scrap web pages, make data validation, save to archive (JSON, XML, CSV), and transmit it to external application or database. Based on Distributed Crawler (DC) service installation, it have the Web User interface (UI). Functionality includes crawling of web resources (HTML and XML pages, binary documents like pdf, doc, ppt, etc., binary images like jpg, gif, png, etc.) including collecting URL links with the maximum limit. It allow to monitor running spiders, schedule new jobs and stop running ones, all from a single page. Tags Reaper service implements platform-level engineering, which could be used for a wide range of purposes: Web page content analysis for ad targeting; Media monitoring, collection of public company, brand, people information; Statistics collection: announcements, documents, etc.

    The TR service also includes the web-administration management console to do all possible operations with users, sites and resources as well as to get statistical reports and logs. The DC and DTM services management will be added soon.

    Demo installations

    Several pre-configured VM images for VMware and VirtualBox are uploaded to get start process faster. The user name is “root” and password is the same. The target user for DTS archive is “hce”, password the same. VM files zipped at here [1]

    License

    GNU General Public License version 2

    References

    Hierarchical Cluster Engine Project Wikipedia