AWK++
Abstract
Keywords
Introduction
AWK plays a significant role in the UNIX culture and as a toolkit for data science tasks, however, it can turn hard to provide mechanisms to manipulate data with modern formats and modern mechanisms to retrieve data. Those formats include JSON, Apache Arrow,
AWK Plus Plus conceptualized as a metadata layer or storage system middleware, introduces an architecture that segregates resources and resource settings to enhance functionality by introducing Semantic Search, Fuzzy algorithms, and in-process database. We propose to make a similar system to AWK, in the sense, that we use various unstructured data sources and transform them into structured data.
Related work
Data and storage systems
By definition, Structured data conforms to a data model, so we adhere to the relational model as the underlying data model with JSON Net as the build language. However, data is distributed, dynamic and unstructured
In the realm of AWK Plus Plus, a distributed resource system plays a pivotal role by mapping URLs to corresponding data, and facilitating reads and writes through specific communication protocols among various processes. At the heart of this system lies the Uniform Resource Identifiers (URIs), which effectively direct the system from a path to an inode, with URLs instrumental in establishing inodes based on predefined permissions. This inode, in turn, serves as a gateway to the actual data, with applications dynamically defining the methods for data manipulation based on associated metadata. A crucial aspect of this architecture is the clear delineation between data and its metadata, where resources are represented by URIs that encapsulate methods, typically in plain-text JSON format, enabling the mounting of resources across diverse file systems, including non-POSIX compliant ones. The system’s design incorporates data location pointers, utilizing a lookup mechanism that directs to either metadata or data through explicit pointers, as elucidated in existing literature [14]. While custom inode properties are feasible, the use of plain-text metadata enhances compatibility across diverse file systems beyond the POSIX standard. Such an approach is pivotal, ensuring that mounted file systems permit leases for caching reads and buffering writes, thereby enhancing latency through service worker mediation. The architecture also contemplates redundancy and internal consistency mechanisms, crucial for multiclient coordination, which relies on client-server collaboration. This coordination is further refined through redundancy strategies and the partitioning of the file system namespace, a process involving remote links that connect resource entries in one process to resources in another. This intricate process of path-to-inode translation and the subsequent inode-to-data journey, coupled with the flexibility in the granularity of data targets, underscores the system’s robustness in facilitating efficient and scalable data management within distributed resource systems.
Results
Conclusions
Future work