What is big data?

Big data refers to extremely large and complex data sets that are difficult to manage and process. Doug Lanely, is acknowledged as the first person to popularly be associated with the term “big data”, but Kitchen and McArdle[6], reference the term from computer science academia since the mid-1990s.

While there are no formal definitions of big-data, conceptually the terminology is used to describe the volume, velocity, variety and value of data that has changed since the year 2000 as data generation has grown exponentially since the turn of the century.

Conceptually, big data is more than the issues of large amounts of data and how to store this data. It is also about knowledge derived from data and the process of extracting insights from this data for better efficiencies such as financial and operational gains. It is also not contained to a type of data structure and relates to both both structured, (traditional relational data-bases) and unstructured data (object-oriented or graph) formats.

The five “v’s” of big data that are different from traditional data

The five key characteristics of big data are often called the five “v’s” of big data

Volume: The sheer scale of data (from terabytes to petabytes) matters, we are now into zeta-bytes and beyond, if you read the exploding topics data[7]. The biggest data centres in the world are in the USA, Germany, UK, China and Canada.
Velocity: The speed at which data is generated and processed in the past was using RDBMS (relational database management systems). Data was structured mainly in tables, with rows and columns and could easily be processed by servers and mainframes. Big data on the other hand requires advanced parallel processing and non-standard chips; specialised tools such as Hadoop and Spark are used to sift through semi-structured, structured and unstructured data in RDBMS, graph, object or document formats.
Variety: Data comes in many types (structured, unstructured, semi-structured) It is heterogenous e.g., text, video, images and near-field sensor data, used in IoT. Video, social media and gaming being the biggest generators of the variety of this data[4]. In the early 2000s, nobody predicted that entertainment would generate the biggest volume and variety of data. Previously, financial data and other “serious” forms of business data generated the most volume and variety of data.
Veracity: Data quality and integrity are critical, with big data being generated by robots and malicious actors or at a rate at which it is impossible to verify and clean the data sets, a key question being asked about big data, machine learning and gen-AI is how accurate, reliable and trustworthy is the underlying data that is being processed. The quality of the data therefore is very important if real value is to be derived from the data.
Value: Data has potential value, but these insights into the data must be discovered and realised. Traditionally, simple analytics that were descriptive or analytical were provided. Big data now is analysed differently, complex data patterns are analysed by machines very rapidly, providing prescriptive insghts. Analytics can provide predictive models, forecasts and actionable recommendations from underlying data sets.

New data processing software - Hadoop the front runner

With big data, new software to manage this data came into the market place. The need stemmed from the growth web search. [8] Startups like Google, Yahoo, and AltaVista began building frameworks to automate search results.

High Availability Distributed Object Oriented Platform, Hadoop, was developed by Douglas Cutting, who was with Yahoo. The primary component of the Hadoop ecosystem is the distributed file system, Hadoop Distributed File System (HDFS).

Today, Hadoop, is part of the open source Apache Software Foundation and is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.[13] It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Map Reduce is the processing engine of Hadoop. While HDFS is responsible for storing the data, Map Reduce uses parallel processing to handle big data and analytics jobs, breaking workloads down into smaller chunks that can be run at the same time. The first phase is when the data is mapped and broken down into smaller chunks moving into the reduce phase then the data is aggregated and merged to produce a final filtered output.[11]

Yet Another Resource Negotiator (YARN), is part of the HDFS ecosystem responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.

And finally Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.

While Hadoop has been a front-runner in big-data processing, newer technologies have been widely adopted to manage and analyze modern data workloads.

Within the Apache Software Foundation ecosystem, tools such as Apache Spark, an in-memory distributed processing engine; Apache Flink, which supports both streaming and batch data processing; Apache Kafka, a real-time data-streaming platform; and Apache Hive and Presto/Trino, SQL engines for querying large datasets, have become widely used.

Beyond Hadoop-based tools, NoSQL databases such as MongoDB, Cassandra, and Redis have emerged. Fraph databases such as Neo4J, and graph-based query engines are now being used for storing and querying highly interconnected data. Kubernetes-based data platforms now provide scalable, container-oriented data-pipeline orchestration, further advancing the big-data landscape.

Major cloud providers have also reshaped the industry. For example, Amazon, which began as an online bookstore, is is the largest cloud service provider, overtaking its rivals google and Microsoft Azure. AWS (Amazon Web Services) is now a data-platform provider offering cloud-native data warehousing solutions such as Redshift and Snowflake a cloud-based data warehouse, competing with googles’ Big Query and contributing to the shift toward scalable, elastic, cloud-driven analytics platforms.

Big data benefits

Big data provides several tangible advantages:

Better insights: More data + more types = broader and deeper understanding; uncover hidden patterns.
Improved decision-making: Data-driven decisions with more reliable predictions (market trends, customer behaviour, risk).
Personalised customer experiences: Combining diverse data sources (sales, social media, campaign data) enables more granular customer profiles and tailored experiences.
Operational efficiency: Analysing internal and external data to detect anomalies, optimise processes, reduce downtime.

Big data use cases

Some representative use-cases discussed include:

Retail / e-commerce: Predicting customer demand, launching new products using data from social, focus-groups, early roll-outs.
Healthcare: Combining electronic records, wearable devices, staffing and supplier data to optimise care & operations.
Financial services: Fraud detection, regulatory reporting, trend spotting across large volumes of financial transaction data.
Manufacturing: Predictive maintenance by analysing sensor data, logs, equipment performance to reduce downtime.
Government / public services: Using data from public services, traffic, schools etc to optimise resource allocation, improve transparency and public trust.

Big data challenges

There are significant hurdles to leveraging big data:

Data volumes are growing rapidly (doubling approx every 2 years). Storing is only part of the challenge; processing, curating and making sense of it is harder.
Data storage: Storage is not an insignificant consideration. Large volumes of data need to be stored, backed-up and restored regulary for data-safety and security as well as to prevent data losses. Modern data centres are noisy, consume large quantities of energy and need coolants to keep the environment safe. Noise-cancellation systems are required if these data-centres are close to habitation and have become expensive to store and maintain. Traditionally data was a smaller scale managed with conventional relational databases, on premise or in the cloud. Several cloud based data-centres are required to manage the increasing speed and volume at which data is being generated.
Data curation: In many organisations, data scientists spend 50-80% of their time simply cleaning, preparing and organising data.
Security & privacy: Compliance, encryption, role-based access and regional/industry regulations add complexity.
Cultural change: Leveraging big data often requires shifting organisational culture — from legacy practices to data-driven processes, self-service analytics, training.
Technology evolution: The tools and frameworks evolve quickly; keeping up can be a challenge.

How big data works

Oracle[1] outlines a three‐step workflow:

Integrate: Ingest data from many disparate sources; traditional ETL may not suffice at large scale.
Manage: Store and manage the data (cloud, on-premises, hybrid), often using data lakes and elastic compute to scale.
Analyse: Perform analysis and take action — visual analytics, machine learning/AI models, discovery of new insights.

Best practices for big data

Align with business goals: Ensure big data initiatives support concrete business and IT priorities, not just for novelty.
Address skills shortages via governance & standards: Big data requires expertise; governance and standardisation help mitigate risk.
Establish a centre of excellence: Share knowledge, oversight and resources across enterprise to scale big data competencies.
Integrate unstructured with structured data: The richest insights often come from linking new big-data sources to existing structured data.
Create sandbox/discovery labs for performance: Provide high-performance experimental environments for analytics, data-scientists and business users.
Adopt cloud operating model: Big data solutions benefit from elastic scalability, on-demand compute/storage, and quick provisioning.

Evolution of big data: past, present & future

Past: Managing large data sets dates back decades (1960s/70s relational databases). But big data surged around 2005 with the advent of large-scale internet usage and frameworks like Hadoop; NoSQL databases began growing.
Present: Open‐source frameworks (Hadoop, Spark) make big data more accessible; IoT (information of things) and machine-learning generate even more data.
Future: Technologies such as cloud scalability, gen-AI (generative artificial intelligence), graph databases will push the big data boundaries further. Cyber security attacks, fraud have been accelerated the data controllers responsbilities for keeping data safe and secure.

Conclusion

Big data has evolved from a niche technical concept into a foundational pillar of the modern digital economy. It represents far more than the accumulation of massive datasets — it reflects a shift in how organisations think, operate, and compete. At its core, big data combines scale, speed, and complexity, but the real value lies in the ability to transform raw information into meaningful insight, innovation, and measurable business outcomes.

The traditional data landscape, once dominated by structured relational databases and batch processing, has expanded to include real‑time streaming, unstructured multimedia, IoT sensor outputs, and highly interconnected graph‑based information. With this shift, organisations now rely on sophisticated distributed architectures and cloud‑native solutions to store, process, and extract insights from data at unprecedented scale. Frameworks such as Hadoop laid the early foundation, but modern ecosystems increasingly leverage in‑memory computing (e.g., Spark), distributed messaging (Kafka), and specialised NoSQL and graph databases to meet diverse analytical and operational demands.

Big data’s promise is reflected across industries: precision health models in medicine, real‑time fraud detection in finance, digital twins in manufacturing, and personalised recommendation systems in retail and entertainment. These capabilities not only enhance performance and user experience but also unlock new revenue streams, competitive differentiation, and strategic agility.

Yet, the power of big data brings parallel responsibilities and challenges. Issues such as data quality, ethical use, privacy protection, energy consumption, and regulatory compliance continue to shape the field. Organisations must balance technological advancement with governance, security, transparency, and user trust. Additionally, cultural transformation remains critical — success increasingly depends on data‑literate teams, interdisciplinary collaboration, and investment in skilled talent.

Ultimately, big data is a continual journey rather than a fixed destination. As artificial intelligence, automation, cloud infrastructure, and edge computing converge, data will continue to expand in both scale and significance. Organisations that adopt a strategic, ethical, and innovation‑driven approach to big data will be best positioned to harness its full potential and unlock future growth.