I’m trying to create a dataframe containing the values from field_1 and field_2 in a single column. I haven’t used pandas a whole lot before, so I’m sure this is naive.
# Create a dataset in CSV format
field_names = ["description", "comments"]
writer = csv.writer(open("dataset.csv", "w"), quoting=csv.QUOTE_ALL, delimiter=",")
writer.writerow(field_names)
for instance in Order.objects.all():
writer.writerow([str(getattr(instance, f)) for f in field_names])
# Read CSV
data_frame = pd.read_csv("dataset.csv", index_col=0)
# Combine columns
df2 = data_frame.apply(lambda x: ", ".join(x[x.notnull()]), axis=1)
If I’m working with a fairly large dataset, is there a way I can make this more efficient? I would like to eliminate the step that creates the CSV entirely, if possible. If anybody can point me in the right direction, that’d be fantastic.
>Solution :
You really don’t need.
- The order objects and
getattr; use.values_list()to get an iterable of 2-tuples (assumingfield_namesare actual fields on the model). - CSV – now that you have an iterable of 2-tuples, pass them to the
DataFrameconstructor along with the respective column names.
field_names = ["description", "comments"]
df = pd.DataFrame(Order.objects.all().values_list(field_names), columns=field_names)
Also, you wouldn’t really need Pandas at all…
from django.db.models import F, Value
from django.db.models.functions import Concat
# ...
my_data = list(
Order.objects.annotate(
x=Concat(
F("description"),
Value(", "),
F("comments"),
)
).values_list("x", flat=True)
)
and you have a list of description, comments strings, just like the series you’d get with Pandas.