Skip to main content
  1. Data Science Blog/

All about Hashing

·1346 words·7 mins· loading · ·
Cybersecurity Algorithms Computer Science Fundamentals Cybersecurity Data Security Data Quality

On This Page

Table of Contents
Share with :

All-about-Hashing

All about Hashing
#

What is Hashing function?
#

A hashing function is a mathematical algorithm that converts an input (or “message”) into a fixed-size string of bytes, typically a hash code or hash value. This output is usually a short, unique representation of the input data. Examples of common hash functions include MD5, SHA-1, and SHA-256.

Hashing is a process where an algorithm (known as a hash function) takes an input (or “message”) and returns a fixed-size string of bytes. The output, typically a hexadecimal number, appears random. The purpose of hashing is to ensure data integrity and to securely compare large amounts of data.

Here’s a high-level overview of how hashing works:

1. Input Data
#

The data to be hashed, which can be of any size (e.g., a string, a file, etc.).

2. Hash Function
#

The hash function processes the input data and performs several operations to transform the input into a fixed-size hash value.

3. Fixed-Size Output
#

Regardless of the input size, the output (hash value) will have a fixed size. For example, SHA-256 always produces a 256-bit (32-byte) hash value.

Properties of a Good Hash Function
#

  • Deterministic: The same input will always produce the same hash output.
  • Fast Computation: The hash value should be quick to compute.
  • Avalanche Effect: A small change in the input should produce a significantly different hash.
  • Fixed Output Length: The output hash length should be fixed, regardless of input size.
  • Pre-image Resistance: Given a hash value, it should be infeasible to find the original input.
  • Collision Resistance: It should be infeasible to find two different inputs that produce the same hash value.

Hashing functions are widely used in various applications, including:

  1. Data Retrieval: Hash functions are used in hash tables to quickly locate data records. By hashing a key, the system can quickly access the corresponding value in the table.

  2. Data Integrity: Hash functions can generate a checksum or digest that verifies data integrity. If the data changes, the hash value will also change, indicating potential tampering or corruption. Hashes allow users to verify that the file they downloaded has not been corrupted or altered. By comparing the hash of the downloaded file with the hash provided on the website, users can ensure the file’s integrity. Hashes protect against tampering. If an attacker tries to alter the file, the hash of the modified file will not match the hash provided on the website, indicating that the file has been compromised. When downloading software from multiple sources or mirrors, hashes help ensure that each source provides the exact same file.

  3. Cryptography: In cryptographic applications, hash functions are used to create digital signatures and perform password hashing. They are designed to be secure against various attacks, ensuring that it’s infeasible to reverse-engineer the original input from the hash value.

  4. Data Deduplication: Hash functions help identify duplicate data by generating a unique hash for each piece of data. Identical data will have the same hash value.

How to Use Hashes for Verification
#

  1. Download the File: First, download the package file, for example mistral_inference-0.0.0.tar.gz.

  2. Calculate the Hash: Use a hashing tool to calculate the hash of the downloaded file. Here are some examples of commands you can use to calculate different types of hashes:

    • SHA256:

      sha256sum mistral_inference-0.0.0.tar.gz
      
    • MD5:

      md5sum mistral_inference-0.0.0.tar.gz
      
    • BLAKE2b-256:

      b2sum mistral_inference-0.0.0.tar.gz
      
  3. Compare the Hash: Compare the calculated hash with the hash provided on the PyPI website. If they match, the file is verified.

Example of Verification Process
#

Let’s say you downloaded mistral_inference-0.0.0.tar.gz and want to verify it using SHA256:

  1. Calculate the SHA256 hash of the file:

    sha256sum mistral_inference-0.0.0.tar.gz
    
  2. Compare the output with the hash provided on the website:

    f69c6cb3852e2d937b8d196845bdf6fba9dafa12fc89b795d96b92a3d987cf9c
    

If the hash values match, the file is intact and has not been tampered with. If they do not match, the file may have been corrupted or altered, and you should not use it.

Summary
#

The hashes provided on the PyPI website are essential for ensuring the security and integrity of downloaded files. They allow users to verify that they have received the correct and unaltered version of the software.

Common Hash Functions
#

  • MD5: Produces a 128-bit hash value. It’s now considered broken and unsuitable for further use.
  • SHA-1: Produces a 160-bit hash value. It has vulnerabilities and is not recommended for security-sensitive applications.
  • SHA-256: Part of the SHA-2 family, produces a 256-bit hash value. Widely used for security applications.
  • SHA-3: The latest member of the Secure Hash Algorithm family, standardized in 2015.
  • BLAKE2: A fast and secure hash function, designed as an alternative to MD5 and SHA-2.

Understanding how hashing works is crucial for ensuring data integrity, securing data, and verifying data authenticity in various applications.

Can we have different hash value for same input? How to check hash value?
#

Hash values calculated of some text may different using different tools like BLAKE2b-256 hash

1. Verify the Correct File is Being Hashed
#

Ensure that the file you downloaded is the exact file you are calculating the hash for.

2. Check for Partial or Corrupted Download
#

Sometimes a file may not be fully downloaded. You can check the file size and compare it with the expected size listed on the website. Re-download the file if necessary.

3. Calculate the BLAKE2b-256 Hash Correctly
#

Ensure you are using the correct command and syntax. For BLAKE2b-256, the correct command should be:

b2sum -a blake2b-256 mistral_inference-0.0.0.tar.gz

If your system does not have the b2sum command, you might need to install it. Alternatively, you can use Python to calculate the hash.

4. Use Python to Calculate the BLAKE2b-256 Hash
#

If b2sum is not available or not giving the correct result, you can use Python to calculate the BLAKE2b-256 hash:

How to compute the BLAKE2b hash of a file in Python:
#

import hashlib

# Initialize the BLAKE2b hash function
blake2b_hash = hashlib.blake2b(digest_size=32)

# Open the file in binary mode and read it in chunks
with open('example_file.txt', 'rb') as f:
    while chunk := f.read(8192):
        blake2b_hash.update(chunk)

# Finalize the hash computation
hash_value = blake2b_hash.hexdigest()

# Print the computed hash value
print(hash_value)

Can we do Hasing for binary files?
#

Yes, you can hash binary files like MP3, XLS, JPG, BMP, and other file types. Hashing a binary file is essentially the same as hashing a text file, as hash functions operate on raw bytes. Here’s a step-by-step process to hash binary files in Python using the hashlib module.

Example: Hashing a Binary File
#

Let’s hash an example binary file, such as an image (JPG).

import hashlib

# Function to calculate the hash of a file
def calculate_file_hash(file_path, hash_algorithm='sha256'):
    # Initialize the hash function
    hash_func = hashlib.new(hash_algorithm)

    # Open the file in binary mode
    with open(file_path, 'rb') as file:
        # Read and update the hash in chunks
        for byte_block in iter(lambda: file.read(4096), b""):
            hash_func.update(byte_block)

    # Return the hexadecimal representation of the hash
    return hash_func.hexdigest()

# Path to the binary file
file_path = 'example_image.jpg'

# Calculate the hash of the file
hash_value = calculate_file_hash(file_path, 'sha256')

# Print the computed hash value
print(f"The SHA-256 hash of the file is: {hash_value}")

Hashing Different File Types
#

The above method works for any binary file type, including MP3, XLS, JPG, BMP, etc. Simply change the file_path variable to the path of your binary file.

Verifying the Hash
#

If you want to verify that the hash matches the expected value (as given on a website or other source), compare the computed hash value with the provided one:

expected_hash = 'expected_hash_value_from_website'

if hash_value == expected_hash:
    print("The hash matches the expected value.")
else:
    print("The hash does not match the expected value.")

Understanding Hash Mismatches
#

If the computed hash does not match the expected value, consider the following:

  • Ensure the file is not altered or corrupted.
  • Verify the hash algorithm used is the same as specified.
  • Check for extra bytes (like newlines) in text files if converted from one format to another.
  • Ensure no hidden metadata is affecting the file content.

This approach ensures data integrity and authenticity across different file types and scenarios.

Author
Dr Hari Thapliyaal
dasarpai.com
linkedin.com/in/harithapliyal

Dr. Hari Thapliyaal's avatar

Dr. Hari Thapliyaal

Dr. Hari Thapliyal is a seasoned professional and prolific blogger with a multifaceted background that spans the realms of Data Science, Project Management, and Advait-Vedanta Philosophy. Holding a Doctorate in AI/NLP from SSBM (Geneva, Switzerland), Hari has earned Master's degrees in Computers, Business Management, Data Science, and Economics, reflecting his dedication to continuous learning and a diverse skill set. With over three decades of experience in management and leadership, Hari has proven expertise in training, consulting, and coaching within the technology sector. His extensive 16+ years in all phases of software product development are complemented by a decade-long focus on course design, training, coaching, and consulting in Project Management. In the dynamic field of Data Science, Hari stands out with more than three years of hands-on experience in software development, training course development, training, and mentoring professionals. His areas of specialization include Data Science, AI, Computer Vision, NLP, complex machine learning algorithms, statistical modeling, pattern identification, and extraction of valuable insights. Hari's professional journey showcases his diverse experience in planning and executing multiple types of projects. He excels in driving stakeholders to identify and resolve business problems, consistently delivering excellent results. Beyond the professional sphere, Hari finds solace in long meditation, often seeking secluded places or immersing himself in the embrace of nature.

Comments:

Share with :

Related

Roadmap to Reality
·990 words·5 mins· loading
Philosophy & Cognitive Science Interdisciplinary Topics Scientific Journey Self-Discovery Personal Growth Cosmic Perspective Human Evolution Technology Biology Neuroscience
Roadmap to Reality # A Scientific Journey to Know the Universe — and the Self # 🌱 Introduction: The …
From Being Hacked to Being Reborn: How I Rebuilt My LinkedIn Identity in 48 Hours
·893 words·5 mins· loading
Personal Branding Cybersecurity Technology Trends & Future Personal Branding LinkedIn Profile Professional Identity Cybersecurity Online Presence Digital Identity Online Branding
💔 From Being Hacked to Being Reborn: How I Rebuilt My LinkedIn Identity in 48 Hours # “In …
Exploring CSS Frameworks - A Collection of Lightweight, Responsive, and Themeable Alternatives
·1378 words·7 mins· loading
Web Development Frontend Development Design Systems CSS Frameworks Lightweight CSS Responsive CSS Themeable CSS CSS Utilities Utility-First CSS
Exploring CSS Frameworks # There are many CSS frameworks and approaches you can use besides …
Dimensions of Software Architecture: Balancing Concerns
·873 words·5 mins· loading
Software Architecture Software Architecture Technical Debt Maintainability Scalability Performance
Dimensions of Software Architecture # Call these “Architectural Concern Categories” or …
Understanding `async`, `await`, and Concurrency in Python
·616 words·3 mins· loading
Python Asyncio Concurrency Synchronous Programming Asynchronous Programming
Understanding async, await, and Concurrency # Understanding async, await, and Concurrency in Python …